建议收藏-使用pytorch时遇到的问题汇总

报错TypeError: unhashable type: 'numpy.ndarray'
原因：在将pytorch的longTensor转为numpy，并用于dict的key的时候，会出现这样的错误。其实程序输出已经是int了，但是还是会被认为是ndarray。
解决：在原来的基础上加上.item()
```
classId = support_y[i].long().cpu().detach().numpy().item()
```

数据加载的时候遇到TypeError: 'int' object is not callable
原因：数据不是Tensor类型的而是np.array或其他类型的。
解决：
```
tensor = torch.LongTensor(data_x)
data_x = autograd.Variable(tensor)
tensor = torch.LongTensor(data_y)
data_y = autograd.Variable(tensor)
```

数据加载的时候遇到RuntimeError: DataLoader worker (pid(s) 18620, 45872) exited unexpectedly
解决：loader中令num_workers=0
RuntimeError: input.size(-1) must be equal to input_size. Expected 10, got 2000
原因：使用view时维度指定错误，LSTM(input,(h0,c0)) 指定batch_first=True后，input就是(batch_size,seq_len,input_size)否则为input(seq_len, batch, input_size)
解决：
```
lstm_out, self.hidden = self.lstm(
        embeds.view(self.batch_size, 200, EMBEDDING_DIM), self.hidden) 
```

报错AttributeError: module 'torch.utils.data' has no attribute 'random_split'。
原因：pytorch1.1.0版本的random_split在torch.utils.data里，而0.4.0版本的random_split在torch.utils.data.dataset里。
解决：from torch.utils.data.dataset import random_split。
报错ValueError: Sum of input lengths does not equal the length of the input dataset!
原因：数据集问题。
报错TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tens or to host memory first.
解决：将准确率计算改为：
```
acc1 = (pred_cls1 == val_y1).cpu().sum().numpy() / pred_cls1.size()[0]
```

报错RuntimeError: Input and hidden tensors are not at the same device, found input t ensor at cuda:1 and hidden tensor at cuda:0
原因：因为使用了
```
if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = nn.DataParallel(model)
model.to(device)
```

而tensor没有指定卡的ID。  
**解决**：两种方式。  
1）先定义一个`device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')`（这里面已经定义了device在卡0上“cuda:0”），然后将`model = torch.nn.DataParallel(model，devices_ids=[0, 1, 2]）`(假设有三张卡)。此后需要将tensor 也迁移到GPU上去。注意所有的tensor必须要在同一张GPU上面，即：`tensor1 = tensor1.to(device)`, `tensor2 = tensor2.to(device)`等等。**注意**：一定不能仅仅是`tensor1.to(device)`而不赋值，这样只会创建副本。  
2）直接用tensor.cuda()的方法。即先`model = torch.nn.DataParallel(model, device_ids=[0, 1, 2])` (假设有三块卡， 卡的ID 为0， 1， 2)，然后`tensor1 = tensor1.cuda(0)`, `tensor2=tensor2.cuda(0)`等等。（我这里面把所有的tensor全放进ID 为 0 的卡里面，也可以将全部的tensor都放在ID 为1 的卡里面）  
参考网址：[学习Pytorch过程遇到的坑（持续更新中）](https://zhuanlan.zhihu.com/p/61892329)

报错‘DataParallel’ object has no attribute ‘init_hidden’。
原因：nn.DataParallel(m)这句返回的已经不是原始的m了，而是一个DataParallel，原始的m保存在DataParallel的module变量里面。
解决：在DataParallel和to(device)之后如果需要修改model，则需要
```
if isinstance(model, nn.DataParallel):
    model = model.module
```

报错Assertioncur_target >= 0 && cur_target < n_classes’ failed.`。
原因：在分类训练中经常遇到这个问题，一般来说在网络中输出的种类数和label设置的种类数量不同的时候就会出现这个错误。Pytorch有个要求，在使用CrossEntropyLoss这个函数进行验证时label必须是以0开始的。
解决：

tags_ids = range(len(tags_set)) # 从0开始
tag2id = pd.Series(tags_ids, index=tags_set)

报错RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.
原因：原本是GPU训练的模型要加载在CPU上。
解决：model = torch.load(model_path, map_location='cpu')
同理，如果原本4块GPU训练的，改为一块，则model = torch.load(model_path, map_location='cuda:0')；
如果是4块到两块：就把map_location改为：map_location={'cuda:1': 'cuda:0'}。
报错size mismatch for word_embeddings.weight: copying a param with shape torch.Size([3403, 128]) from checkpoint, the shape in current model is torch.Size([12386, 128]).
原因：加载的模型的word_embedding层参数和当前model输入的参数不匹配。
解决：word2id、tag2id的长度要一致。
报错RuntimeError: Expected hidden[0] size (2, 359, 256), got (2, 512, 256)。
原因：使用了DataLoader加载数据，数据集中的训练实例数不能被batch_size整除，最后一个batch的大小不等于batch_size，而hidden_layer初始化的时候用固定的batch_size初始化：autograd.Variable(torch.zeros(self.num_layers * 2, self.batch_size, self.hidden_dim // 2))
解决：如果模型不能处理批量大小的在线更改，就应在torch.utils.data中设置drop_last=True，因此，在培训期间只处理整批数据。即testset_loader = DataLoader(test_db, batch_size=args.batch_size, shuffle=False, num_workers=1, pin_memory=True,drop_last=True)
在pytorch训练过程中出现loss=nan的情况。
原因：
1.学习率太高。学习率比较大的时候，参数可能over shoot了，结果就是找不到极小值点；
减小学习率可以让参数朝着极值点前进
。2.loss函数有问题。
3.对于回归问题，可能出现了除0 的计算，加一个很小的余项可能可以解决。
4.数据本身，是否存在Nan，可以用numpy.any(numpy.isnan(x))检查一下input和target。
5.target本身应该是能够被loss函数计算的，比如sigmoid激活函数的target应该大于0，同样的需要检查数据集。
解决：减小学习速率或者增大batch_size。
报错RuntimeError: Trying to backward through the graph second time, but the buffers have already been freed. Please specify retain_variables=True when calling backward for the first time.
原因：网络中存在多个sub-network，有2个甚至2个以上的loss需要分别对网络参数进行更新，比如两个需要分别执行loss1.backward() loss2.backward()。两个loss可能会有共同的部分，所以在执行第一次loss1.backward()完成之后，Pytorch会自动释放保存着的计算图，所以执行第二次loss2.backward()的时候就会出现计算图丢失的情况。
解决：
1 执行loss.backward(retain_graph=True)保留计算图，但这样很可能会出现内存溢出(CUDA out of memory)的情况。因为Pytorch的机制是每次调用.backward()都会free掉所有buffers，所以它提示，让retain_graph。然而当retain后，buffers就不会被free了，所以会OOM。参考网址：https://blog.csdn.net/Mundane\_World/article/details/81038274
2 当不需要计算生成器的梯度，因此在使用生成数据计算辨别器时使用.detach()作为输入数据，这样就对当前图进行拆分，返回一个新的从当前图中分离的 Variable，返回的 Variable 永远不会需要梯度.参考网址：https://blog.csdn.net/u011276025/article/details/76997425
3 对于我的代码，如果retain_graph=True则内存溢出，又找不到需要.detach()的地方，最后发现是因为我的model每次训练的时候没有重新初始化隐藏层。需要在model.zero_grad()之后model.hidden = model.init_hidden()来清空 LSTM 的隐状态，将其从上个实例的历史中分离出来，避免受之前运行代码的干扰。如果不重新初始化，会有报错。

参考文献：点击我进行查看原文