飞道的博客

使用 PyTorch 保存和加载模型 | 附完整代码

519人阅读  评论(0)

欢迎关注 “小白玩转Python”,发现更多 “有趣”

本文的目的是展示如何保存一个模型并加载它,以便在上一个 epoch 之后继续训练并进行预测。如果您正在阅读本文,我假定您熟悉深度学习和 PyTorch 的基本知识。

你是否经历过这样的情况:你花了几个小时或几天的时间来训练你的模型,然后它在中途停止了?或者你对自己的模型表现不满意,想继续训练?出于多种原因,我们可能需要一种灵活的方式来保存和加载我们的模型。

现在有很多免费的云服务,如 Kaggle、 Google Colab 等都有空闲超时功能,这会导致你的笔记本电脑断开连接,而且一旦超时,笔记本电脑就会被断开或中断。除非你用 GPU 训练一小段 epoch,否则这个过程需要时间。能够保存模型会给你带来巨大的优势,从而挽救局面。为了灵活起见,我将同时保存最新的 ckpt 和最好的 ckpt。

本文中的数据集使用比较常用的 Fashion_MNIST_data,我们将从导入数据中编写一个完整的流程来进行预测。(本文将使用 Kaggle 进行训练)

第一步:准备

  1. 在 Kaggle 默认情况下,您正在处理的文件被称为__notebook__.ipyn

  1. 创建两个目录来存储 ckpt 和最佳模型


   
  1. # uncomment if you want to create directory checkpoint, best_model
  2. % mkdir checkpoint best_model

第二步:导入相关库并创建辅助函数

导入库


   
  1. %matplotlib inline
  2. %config InlineBackend.figure_format = 'retina'
  3. import matplotlib.pyplot as plt
  4. import torch
  5. import shutil
  6. from torch import nn
  7. from torch import optim
  8. import torch.nn.functional as F
  9. from torchvision import datasets, transforms
  10. import numpy as np

   
  1. # check if CUDA is available
  2. use_cuda = torch.cuda.is_available()

保存功能

save_ckp 是为了保存 ckpt 文件而创建的,它是最新的也是最好的。这创建了灵活性:您可能对最新 ckpt 的状态感兴趣,也可能对最好的 ckpt 感兴趣。


   
  1. def save_ckp(state, is_best, checkpoint_path, best_model_path):
  2. """
  3. state: checkpoint we want to save
  4. is_best: is this the best checkpoint; min validation loss
  5. checkpoint_path: path to save checkpoint
  6. best_model_path: path to save best model
  7. """
  8. f_path = checkpoint_path
  9. # save checkpoint data to the path given, checkpoint_path
  10. torch.save(state, f_path)
  11. # if it is a best model, min validation loss
  12. if is_best:
  13. best_fpath = best_model_path
  14. # copy that checkpoint file to best path given, best_model_path
  15. shutil.copyfile(f_path, best_fpath)

在我们的例子中,我们希望保存一个 ckpt,允许我们使用这些信息来继续我们的模型训练。以下是我们需要的信息:

  • epoch:所有训练向量用于更新权重的次数

  • valid_loss_min:最小的验证损失,这是必需的,以便在我们继续训练时,可以从此值开始,而不是从np.Inf值开始。

  • state_dict:模型架构信息。它包括每个图层的参数矩阵。

  • optimizer:需要保存优化器参数,特别是在使用 Adam 作为优化器时。Adam 是一个在线机机器学习率方法,也就是说,它为不同的参数计算个人的学习率,如果我们想继续我们的训练,我们就需要这些参数。

加载函数


   
  1. def load_ckp(checkpoint_fpath, model, optimizer):
  2. """
  3. checkpoint_path: path to save checkpoint
  4. model: model that we want to load checkpoint parameters into
  5. optimizer: optimizer we defined in previous training
  6. """
  7. # load check point
  8. checkpoint = torch.load(checkpoint_fpath)
  9. # initialize state_dict from checkpoint to model
  10. model.load_state_dict(checkpoint[ 'state_dict'])
  11. # initialize optimizer from checkpoint to optimizer
  12. optimizer.load_state_dict(checkpoint[ 'optimizer'])
  13. # initialize valid_loss_min from checkpoint to valid_loss_min
  14. valid_loss_min = checkpoint[ 'valid_loss_min']
  15. # return model, optimizer, epoch value, min validation loss
  16. return model, optimizer, checkpoint[ 'epoch'], valid_loss_min.item()

为加载模型创建 load_chkp。它需要:

  • 被保存的 ckpt 的位置

  • 要将状态加载到的模型实例

  • 优化器

第三步:导入数据集 Fashion _MNIST_ data 并创建数据加载器


   
  1. # Define a transform to normalize the data
  2. transform = transforms.Compose([transforms.ToTensor(),
  3. transforms.Normalize(( 0.5, 0.5, 0.5), ( 0.5, 0.5, 0.5))])
  4. # Download and load the training data
  5. trainset = datasets.FashionMNIST( 'F_MNIST_data/', download= True, train= True, transform=transform)
  6. # Download and load the test data
  7. testset = datasets.FashionMNIST( 'F_MNIST_data/', download= True, train= False, transform=transform)
  8. loaders = {
  9. 'train' : torch.utils.data.DataLoader(trainset,batch_size = 64,shuffle= True),
  10. 'test' : torch.utils.data.DataLoader(testset,batch_size = 64,shuffle= True),
  11. }

第四步:定义和创建模型


   
  1. # Define your network ( Simple Example )
  2. class FashionClassifier(nn.Module):
  3. def __init__(self):
  4. super().__init_ _()
  5. input_size = 784
  6. self.fc1 = nn.Linear(input_size, 512)
  7. self.fc2 = nn.Linear( 512, 256)
  8. self.fc3 = nn.Linear( 256, 128)
  9. self.fc4 = nn.Linear( 128, 64)
  10. self.fc5 = nn.Linear( 64, 10)
  11. self.dropout = nn.Dropout(p= 0. 2)
  12. def forward(self, x):
  13. x = x.view(x.shape[ 0], - 1)
  14. x = self.dropout(F.relu( self.fc1(x)))
  15. x = self.dropout(F.relu( self.fc2(x)))
  16. x = self.dropout(F.relu( self.fc3(x)))
  17. x = self.dropout(F.relu( self.fc4(x)))
  18. x = F.log_softmax( self.fc5(x), dim= 1)
  19. return x

   
  1. # Create the network, define the criterion and optimizer
  2. model = FashionClassifier()
  3. # move model to GPU if CUDA is available
  4. if use_cuda:
  5. model = model.cuda()
  6. print(model)

模型结构输出:


   
  1. FashionClassifier(
  2. (fc1): Linear(in_features= 784, out_features= 512, bias= True)
  3. (fc2): Linear(in_features= 512, out_features= 256, bias= True)
  4. (fc3): Linear(in_features= 256, out_features= 128, bias= True)
  5. (fc4): Linear(in_features= 128, out_features= 64, bias= True)
  6. (fc5): Linear(in_features= 64, out_features= 10, bias= True)
  7. (dropout): Dropout(p= 0.2)
  8. )

第五步:训练网络并保存模型

训练函数使我们能够设置 epoch 值、学习率和其他参数。

定义损失函数和优化器

下面,我们将使用 Adam 优化器和交叉熵损失,因为我们将类别得分作为输出。我们计算损失并执行反向传播。


   
  1. #define loss function and optimizer
  2. criterion = nn.NLLLoss()
  3. optimizer = optim.Adam(model.parameters(), lr= 0.001)

定义训练方法


   
  1. def train(start_epochs, n_epochs, valid_loss_min_input, loaders, model, optimizer, criterion, use_cuda, checkpoint_path, best_model_path):
  2. """
  3. Keyword arguments:
  4. start_epochs -- the real part (default 0.0)
  5. n_epochs -- the imaginary part (default 0.0)
  6. valid_loss_min_input
  7. loaders
  8. model
  9. optimizer
  10. criterion
  11. use_cuda
  12. checkpoint_path
  13. best_model_path
  14. returns trained model
  15. """
  16. # initialize tracker for minimum validation loss
  17. valid_loss_min = valid_loss_min_input
  18. for epoch in range(start_epochs, n_epochs+ 1):
  19. # initialize variables to monitor training and validation loss
  20. train_loss = 0.0
  21. valid_loss = 0.0
  22. ###################
  23. # train the model #
  24. ###################
  25. model.train()
  26. for batch_idx, (data, target) in enumerate(loaders[ 'train']):
  27. # move to GPU
  28. if use_cuda:
  29. data, target = data.cuda(), target.cuda()
  30. ## find the loss and update the model parameters accordingly
  31. # clear the gradients of all optimized variables
  32. optimizer.zero_grad()
  33. # forward pass: compute predicted outputs by passing inputs to the model
  34. output = model(data)
  35. # calculate the batch loss
  36. loss = criterion(output, target)
  37. # backward pass: compute gradient of the loss with respect to model parameters
  38. loss.backward()
  39. # perform a single optimization step (parameter update)
  40. optimizer.step()
  41. ## record the average training loss, using something like
  42. ## train_loss = train_loss + ((1 / (batch_idx + 1)) * (loss.data - train_loss))
  43. train_loss = train_loss + (( 1 / (batch_idx + 1)) * (loss.data - train_loss))
  44. ######################
  45. # validate the model #
  46. ######################
  47. model.eval()
  48. for batch_idx, (data, target) in enumerate(loaders[ 'test']):
  49. # move to GPU
  50. if use_cuda:
  51. data, target = data.cuda(), target.cuda()
  52. ## update the average validation loss
  53. # forward pass: compute predicted outputs by passing inputs to the model
  54. output = model(data)
  55. # calculate the batch loss
  56. loss = criterion(output, target)
  57. # update average validation loss
  58. valid_loss = valid_loss + (( 1 / (batch_idx + 1)) * (loss.data - valid_loss))
  59. # calculate average losses
  60. train_loss = train_loss/len(loaders[ 'train'].dataset)
  61. valid_loss = valid_loss/len(loaders[ 'test'].dataset)
  62. # print training/validation statistics
  63. print( 'Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(
  64. epoch,
  65. train_loss,
  66. valid_loss
  67. ))
  68. # create checkpoint variable and add important data
  69. checkpoint = {
  70. 'epoch': epoch + 1,
  71. 'valid_loss_min': valid_loss,
  72. 'state_dict': model.state_dict(),
  73. 'optimizer': optimizer.state_dict(),
  74. }
  75. # save checkpoint
  76. save_ckp(checkpoint, False, checkpoint_path, best_model_path)
  77. ## TODO: save the model if validation loss has decreased
  78. if valid_loss <= valid_loss_min:
  79. print( 'Validation loss decreased ({:.6f} --> {:.6f}). Saving model ...'.format(valid_loss_min,valid_loss))
  80. # save checkpoint as best model
  81. save_ckp(checkpoint, True, checkpoint_path, best_model_path)
  82. valid_loss_min = valid_loss
  83. # return trained model
  84. return model

训练模型

trained_model = train(13, np.Inf, loaders, model, optimizer, criterion, use_cuda, "./checkpoint/current_checkpoint.pt""./best_model/best_model.pt")

输出:


   
  1. Epoch: 1 Training Loss: 0 .000010 Validation Loss: 0 .000044
  2. Validation loss decreased ( inf --> 0 .000044). Saving model ...
  3. Epoch: 2 Training Loss: 0 .000007 Validation Loss: 0 .000040
  4. Validation loss decreased (0 .000044 --> 0 .000040). Saving model ...
  5. Epoch: 3 Training Loss: 0 .000007 Validation Loss: 0 .000040
  6. Validation loss decreased (0 .000040 --> 0 .000040). Saving model ...

让我们关注一下我们上面使用的几个参数:

  • start_epoch:训练 epoch 的起始值

  • n_epochs:用于设置训练的 epoch 的结束值

  • valid_loss_min_input = np.Inf

  • checkpoint_path:保存训练的最新 ckpt 状态的完整路径

  • best_model_path:保存训练的最佳 ckpt 状态的完整路径

验证是否保存了模型

  • 列出 best_model 目录中的所有文件

%ls ./best_model/

输出:

best_model.pt
  • 列出 ckpt 目录中的所有文件

%ls ./checkpoint/

输出:

current_checkpoint.pt

第六步:加载模型

重构模型


   
  1. model = FashionClassifier()
  2. # move model to GPU if CUDA is available
  3. if use_cuda:
  4. model = model.cuda()
  5. print(model)

输出:


   
  1. FashionClassifier(
  2. (fc1): Linear(in_features= 784, out_features= 512, bias= True)
  3. (fc2): Linear(in_features= 512, out_features= 256, bias= True)
  4. (fc3): Linear(in_features= 256, out_features= 128, bias= True)
  5. (fc4): Linear(in_features= 128, out_features= 64, bias= True)
  6. (fc5): Linear(in_features= 64, out_features= 10, bias= True)
  7. (dropout): Dropout(p= 0.2)
  8. )

定义优化器和检查点文件路径


   
  1. # define optimzer
  2. optimizer = optim.Adam(model.parameters(), lr= 0.001)
  3. # define checkpoint saved path
  4. ckp_path = "./checkpoint/current_checkpoint.pt"

使用 load_ckp 函数加载模型


   
  1. # load the saved checkpoint
  2. model, optimizer, start_epoch, valid_loss_min = load_ckp(ckp_path, model, optimizer)

我打印出了从 load_ckp 得到的值,以确保一切正确。


   
  1. print( "model = ", model)
  2. print( "optimizer = ", optimizer)
  3. print( "start_epoch = ", start_epoch)
  4. print( "valid_loss_min = ", valid_loss_min)
  5. print( "valid_loss_min = {:.6f}".format(valid_loss_min))

输出:


   
  1. model = FashionClassifier(
  2. (fc1): Linear(in_features=784, out_features=512, bias=True)
  3. (fc2): Linear(in_features=512, out_features=256, bias=True)
  4. (fc3): Linear(in_features=256, out_features=128, bias=True)
  5. (fc4): Linear(in_features=128, out_features=64, bias=True)
  6. (fc5): Linear(in_features=64, out_features=10, bias=True)
  7. (dropout): Dropout(p=0.2)
  8. )
  9. optimizer = Adam (
  10. Parameter Group 0
  11. amsgrad: False
  12. betas: (0.9, 0.999)
  13. eps: 1e-08
  14. lr: 0.001
  15. weight_decay: 0
  16. )
  17. start_epoch = 4
  18. valid_loss_min = 3.952759288949892e-05
  19. valid_loss_min = 0.000040

加载所有需要的信息之后,我们也可以继续训练,从 epoch = 4开始。之前,我们把模型从1训练到3。

第七步:继续训练和/或推理

继续训练

我们可以继续使用训练函数来训练我们的模型,并提供我们从上面的 load_ckp 函数得到的 ckpt 值。

trained_model = train(start_epoch, 6, valid_loss_min, loaders, model, optimizer, criterion, use_cuda, "./checkpoint/current_checkpoint.pt""./best_model/best_model.pt")

输出:


   
  1. Epoch: 4 Training Loss: 0 .000006 Validation Loss: 0 .000040
  2. Epoch: 5 Training Loss: 0 .000006 Validation Loss: 0 .000037
  3. Validation loss decreased (0 .000040 --> 0 .000037). Saving model ...
  4. Epoch: 6 Training Loss: 0 .000006 Validation Loss: 0 .000036
  5. Validation loss decreased (0 .000037 --> 0 .000036). Saving model ...
  • 注意:epoch 现在从4开始到6结束 (start _ epoch = 4)

  • 验证损失从上一个训练 ckpt 继续

  • 在epoch = 3时,最小验证损失是0.000040

  • 在这里,最小验证损失以0.000040开始,而不是 INF

模型推理

在运行推理之前,必须调用 model.eval()将 dropout 和 batch、 normalization 层设置为 evaluation 模式。不这样做将导致不一致的推论结果。

trained_model.eval()

   
  1. test_acc = 0. 0
  2. for samples, labels in loaders[ 'test']:
  3. with torch.no_grad():
  4. samples, labels = samples.cuda(), labels.cuda()
  5. output = trained_model(samples)
  6. # calculate accuracy
  7. pred = torch.argmax(output, dim= 1)
  8. correct = pred.e q(labels)
  9. test_acc += torch.mean(correct.float())
  10. print( 'Accuracy of the network on {} test images: {}%'.format(len(testset), round(test_acc.item()* 100.0/len(loaders[ 'test']),  2)))

输出:

Accuracy of the network on 10000 test images: 86.58%

在哪里可以找到 Kaggle 笔记本的输出/保存文件:

在你的 Kaggle 笔记本中,你可以向下滚动到页面的底部。前面的操作中保存了一些文件。

完整代码链接:https://www.kaggle.com/vortanasay/saving-loading-and-cont-training-model-in-pytorch


   
  1. ·  END  ·
  2. HAPPY LIFE

转载:https://blog.csdn.net/weixin_38739735/article/details/114317581
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场