欢迎关注 “小白玩转Python”,发现更多 “有趣”
本文的目的是展示如何保存一个模型并加载它,以便在上一个 epoch 之后继续训练并进行预测。如果您正在阅读本文,我假定您熟悉深度学习和 PyTorch 的基本知识。
你是否经历过这样的情况:你花了几个小时或几天的时间来训练你的模型,然后它在中途停止了?或者你对自己的模型表现不满意,想继续训练?出于多种原因,我们可能需要一种灵活的方式来保存和加载我们的模型。
现在有很多免费的云服务,如 Kaggle、 Google Colab 等都有空闲超时功能,这会导致你的笔记本电脑断开连接,而且一旦超时,笔记本电脑就会被断开或中断。除非你用 GPU 训练一小段 epoch,否则这个过程需要时间。能够保存模型会给你带来巨大的优势,从而挽救局面。为了灵活起见,我将同时保存最新的 ckpt 和最好的 ckpt。
本文中的数据集使用比较常用的 Fashion_MNIST_data,我们将从导入数据中编写一个完整的流程来进行预测。(本文将使用 Kaggle 进行训练)
第一步:准备
在 Kaggle 默认情况下,您正在处理的文件被称为__notebook__.ipyn
创建两个目录来存储 ckpt 和最佳模型
-
#
uncomment if you want to create directory checkpoint, best_model
-
%
mkdir checkpoint best_model
第二步:导入相关库并创建辅助函数
导入库
-
%matplotlib inline
-
%config
InlineBackend.figure_format = 'retina'
-
-
-
import matplotlib.pyplot as plt
-
import torch
-
import shutil
-
from torch
import nn
-
from torch
import optim
-
import torch.nn.functional as F
-
from torchvision
import datasets, transforms
-
import numpy as np
-
# check if CUDA is available
-
use_cuda = torch.cuda.is_available()
保存功能
save_ckp 是为了保存 ckpt 文件而创建的,它是最新的也是最好的。这创建了灵活性:您可能对最新 ckpt 的状态感兴趣,也可能对最好的 ckpt 感兴趣。
-
def save_ckp(state, is_best, checkpoint_path, best_model_path):
-
"""
-
state: checkpoint we want to save
-
is_best: is this the best checkpoint; min validation loss
-
checkpoint_path: path to save checkpoint
-
best_model_path: path to save best model
-
"""
-
f_path = checkpoint_path
-
# save checkpoint data to the path given, checkpoint_path
-
torch.save(state, f_path)
-
# if it is a best model, min validation loss
-
if is_best:
-
best_fpath = best_model_path
-
# copy that checkpoint file to best path given, best_model_path
-
shutil.copyfile(f_path, best_fpath)
在我们的例子中,我们希望保存一个 ckpt,允许我们使用这些信息来继续我们的模型训练。以下是我们需要的信息:
epoch:所有训练向量用于更新权重的次数
valid_loss_min:最小的验证损失,这是必需的,以便在我们继续训练时,可以从此值开始,而不是从np.Inf值开始。
state_dict:模型架构信息。它包括每个图层的参数矩阵。
optimizer:需要保存优化器参数,特别是在使用 Adam 作为优化器时。Adam 是一个在线机机器学习率方法,也就是说,它为不同的参数计算个人的学习率,如果我们想继续我们的训练,我们就需要这些参数。
加载函数
-
def load_ckp(checkpoint_fpath, model, optimizer):
-
"""
-
checkpoint_path: path to save checkpoint
-
model: model that we want to load checkpoint parameters into
-
optimizer: optimizer we defined in previous training
-
"""
-
# load check point
-
checkpoint = torch.load(checkpoint_fpath)
-
# initialize state_dict from checkpoint to model
-
model.load_state_dict(checkpoint[
'state_dict'])
-
# initialize optimizer from checkpoint to optimizer
-
optimizer.load_state_dict(checkpoint[
'optimizer'])
-
# initialize valid_loss_min from checkpoint to valid_loss_min
-
valid_loss_min = checkpoint[
'valid_loss_min']
-
# return model, optimizer, epoch value, min validation loss
-
return model, optimizer, checkpoint[
'epoch'], valid_loss_min.item()
为加载模型创建 load_chkp。它需要:
被保存的 ckpt 的位置
要将状态加载到的模型实例
优化器
第三步:导入数据集 Fashion _MNIST_ data 并创建数据加载器
-
# Define a transform to normalize the data
-
transform = transforms.Compose([transforms.ToTensor(),
-
transforms.Normalize((
0.5,
0.5,
0.5), (
0.5,
0.5,
0.5))])
-
# Download and load the training data
-
trainset = datasets.FashionMNIST(
'F_MNIST_data/', download=
True, train=
True, transform=transform)
-
-
-
# Download and load the test data
-
testset = datasets.FashionMNIST(
'F_MNIST_data/', download=
True, train=
False, transform=transform)
-
-
-
loaders = {
-
'train' : torch.utils.data.DataLoader(trainset,batch_size =
64,shuffle=
True),
-
'test' : torch.utils.data.DataLoader(testset,batch_size =
64,shuffle=
True),
-
}
第四步:定义和创建模型
-
# Define your network ( Simple Example )
-
class FashionClassifier(nn.Module):
-
def __init__(self):
-
super().__init_
_()
-
input_size =
784
-
self.fc1 = nn.Linear(input_size,
512)
-
self.fc2 = nn.Linear(
512,
256)
-
self.fc3 = nn.Linear(
256,
128)
-
self.fc4 = nn.Linear(
128,
64)
-
self.fc5 = nn.Linear(
64,
10)
-
self.dropout = nn.Dropout(p=
0.
2)
-
-
def forward(self, x):
-
x = x.view(x.shape[
0], -
1)
-
x =
self.dropout(F.relu(
self.fc1(x)))
-
x =
self.dropout(F.relu(
self.fc2(x)))
-
x =
self.dropout(F.relu(
self.fc3(x)))
-
x =
self.dropout(F.relu(
self.fc4(x)))
-
x = F.log_softmax(
self.fc5(x), dim=
1)
-
return x
-
# Create the network, define the criterion and optimizer
-
model =
FashionClassifier()
-
-
-
# move model to GPU if CUDA is available
-
if
use_cuda:
-
model =
model.cuda()
-
-
print(model)
模型结构输出:
-
FashionClassifier(
-
(fc1): Linear(in_features=
784, out_features=
512, bias=
True)
-
(fc2): Linear(in_features=
512, out_features=
256, bias=
True)
-
(fc3): Linear(in_features=
256, out_features=
128, bias=
True)
-
(fc4): Linear(in_features=
128, out_features=
64, bias=
True)
-
(fc5): Linear(in_features=
64, out_features=
10, bias=
True)
-
(dropout): Dropout(p=
0.2)
-
)
第五步:训练网络并保存模型
训练函数使我们能够设置 epoch 值、学习率和其他参数。
定义损失函数和优化器
下面,我们将使用 Adam 优化器和交叉熵损失,因为我们将类别得分作为输出。我们计算损失并执行反向传播。
-
#define loss function and optimizer
-
criterion = nn.NLLLoss()
-
optimizer = optim.Adam(model.parameters(), lr=
0.001)
定义训练方法
-
def train(start_epochs, n_epochs, valid_loss_min_input, loaders, model, optimizer, criterion, use_cuda, checkpoint_path, best_model_path):
-
"""
-
Keyword arguments:
-
start_epochs -- the real part (default 0.0)
-
n_epochs -- the imaginary part (default 0.0)
-
valid_loss_min_input
-
loaders
-
model
-
optimizer
-
criterion
-
use_cuda
-
checkpoint_path
-
best_model_path
-
-
returns trained model
-
"""
-
# initialize tracker for minimum validation loss
-
valid_loss_min = valid_loss_min_input
-
-
for epoch
in range(start_epochs, n_epochs+
1):
-
# initialize variables to monitor training and validation loss
-
train_loss =
0.0
-
valid_loss =
0.0
-
-
###################
-
# train the model #
-
###################
-
model.train()
-
for batch_idx, (data, target)
in enumerate(loaders[
'train']):
-
# move to GPU
-
if use_cuda:
-
data, target = data.cuda(), target.cuda()
-
## find the loss and update the model parameters accordingly
-
# clear the gradients of all optimized variables
-
optimizer.zero_grad()
-
# forward pass: compute predicted outputs by passing inputs to the model
-
output = model(data)
-
# calculate the batch loss
-
loss = criterion(output, target)
-
# backward pass: compute gradient of the loss with respect to model parameters
-
loss.backward()
-
# perform a single optimization step (parameter update)
-
optimizer.step()
-
## record the average training loss, using something like
-
## train_loss = train_loss + ((1 / (batch_idx + 1)) * (loss.data - train_loss))
-
train_loss = train_loss + ((
1 / (batch_idx +
1)) * (loss.data - train_loss))
-
-
######################
-
# validate the model #
-
######################
-
model.eval()
-
for batch_idx, (data, target)
in enumerate(loaders[
'test']):
-
# move to GPU
-
if use_cuda:
-
data, target = data.cuda(), target.cuda()
-
## update the average validation loss
-
# forward pass: compute predicted outputs by passing inputs to the model
-
output = model(data)
-
# calculate the batch loss
-
loss = criterion(output, target)
-
# update average validation loss
-
valid_loss = valid_loss + ((
1 / (batch_idx +
1)) * (loss.data - valid_loss))
-
-
# calculate average losses
-
train_loss = train_loss/len(loaders[
'train'].dataset)
-
valid_loss = valid_loss/len(loaders[
'test'].dataset)
-
-
-
# print training/validation statistics
-
print(
'Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(
-
epoch,
-
train_loss,
-
valid_loss
-
))
-
-
# create checkpoint variable and add important data
-
checkpoint = {
-
'epoch': epoch +
1,
-
'valid_loss_min': valid_loss,
-
'state_dict': model.state_dict(),
-
'optimizer': optimizer.state_dict(),
-
}
-
-
# save checkpoint
-
save_ckp(checkpoint,
False, checkpoint_path, best_model_path)
-
-
## TODO: save the model if validation loss has decreased
-
if valid_loss <= valid_loss_min:
-
print(
'Validation loss decreased ({:.6f} --> {:.6f}). Saving model ...'.format(valid_loss_min,valid_loss))
-
# save checkpoint as best model
-
save_ckp(checkpoint,
True, checkpoint_path, best_model_path)
-
valid_loss_min = valid_loss
-
-
# return trained model
-
return model
训练模型
trained_model = train(1, 3, np.Inf, loaders, model, optimizer, criterion, use_cuda, "./checkpoint/current_checkpoint.pt", "./best_model/best_model.pt")
输出:
-
Epoch: 1
Training
Loss: 0
.000010
Validation
Loss: 0
.000044
-
Validation
loss
decreased (
inf
--> 0
.000044).
Saving
model ...
-
Epoch: 2
Training
Loss: 0
.000007
Validation
Loss: 0
.000040
-
Validation
loss
decreased (0
.000044
--> 0
.000040).
Saving
model ...
-
Epoch: 3
Training
Loss: 0
.000007
Validation
Loss: 0
.000040
-
Validation
loss
decreased (0
.000040
--> 0
.000040).
Saving
model ...
让我们关注一下我们上面使用的几个参数:
start_epoch:训练 epoch 的起始值
n_epochs:用于设置训练的 epoch 的结束值
valid_loss_min_input = np.Inf
checkpoint_path:保存训练的最新 ckpt 状态的完整路径
best_model_path:保存训练的最佳 ckpt 状态的完整路径
验证是否保存了模型
列出 best_model 目录中的所有文件
%ls ./best_model/
输出:
best_model.pt
列出 ckpt 目录中的所有文件
%ls ./checkpoint/
输出:
current_checkpoint.pt
第六步:加载模型
重构模型
-
model =
FashionClassifier()
-
-
-
# move model to GPU if CUDA is available
-
if
use_cuda:
-
model =
model.cuda()
-
-
print(model)
输出:
-
FashionClassifier(
-
(fc1): Linear(in_features=
784, out_features=
512, bias=
True)
-
(fc2): Linear(in_features=
512, out_features=
256, bias=
True)
-
(fc3): Linear(in_features=
256, out_features=
128, bias=
True)
-
(fc4): Linear(in_features=
128, out_features=
64, bias=
True)
-
(fc5): Linear(in_features=
64, out_features=
10, bias=
True)
-
(dropout): Dropout(p=
0.2)
-
)
定义优化器和检查点文件路径
-
# define optimzer
-
optimizer = optim.Adam(model.parameters(), lr=
0.001)
-
-
-
# define checkpoint saved path
-
ckp_path =
"./checkpoint/current_checkpoint.pt"
使用 load_ckp 函数加载模型
-
# load the saved checkpoint
-
model, optimizer, start_epoch, valid_loss_min = load_ckp(ckp_path, model, optimizer)
我打印出了从 load_ckp 得到的值,以确保一切正确。
-
print(
"model = ", model)
-
print(
"optimizer = ", optimizer)
-
print(
"start_epoch = ", start_epoch)
-
print(
"valid_loss_min = ", valid_loss_min)
-
print(
"valid_loss_min = {:.6f}".format(valid_loss_min))
输出:
-
model =
FashionClassifier(
-
(fc1):
Linear(in_features=784, out_features=512, bias=True)
-
(fc2):
Linear(in_features=512, out_features=256, bias=True)
-
(fc3):
Linear(in_features=256, out_features=128, bias=True)
-
(fc4):
Linear(in_features=128, out_features=64, bias=True)
-
(fc5):
Linear(in_features=64, out_features=10, bias=True)
-
(dropout):
Dropout(p=0.2)
-
)
-
optimizer =
Adam (
-
Parameter
Group 0
-
amsgrad:
False
-
betas:
(0.9, 0.999)
-
eps:
1e-08
-
lr:
0.001
-
weight_decay:
0
-
)
-
start_epoch =
4
-
valid_loss_min =
3.952759288949892e-05
-
valid_loss_min =
0.000040
加载所有需要的信息之后,我们也可以继续训练,从 epoch = 4开始。之前,我们把模型从1训练到3。
第七步:继续训练和/或推理
继续训练
我们可以继续使用训练函数来训练我们的模型,并提供我们从上面的 load_ckp 函数得到的 ckpt 值。
trained_model = train(start_epoch, 6, valid_loss_min, loaders, model, optimizer, criterion, use_cuda, "./checkpoint/current_checkpoint.pt", "./best_model/best_model.pt")
输出:
-
Epoch: 4
Training
Loss: 0
.000006
Validation
Loss: 0
.000040
-
Epoch: 5
Training
Loss: 0
.000006
Validation
Loss: 0
.000037
-
Validation
loss
decreased (0
.000040
--> 0
.000037).
Saving
model ...
-
Epoch: 6
Training
Loss: 0
.000006
Validation
Loss: 0
.000036
-
Validation
loss
decreased (0
.000037
--> 0
.000036).
Saving
model ...
注意:epoch 现在从4开始到6结束 (start _ epoch = 4)
验证损失从上一个训练 ckpt 继续
在epoch = 3时,最小验证损失是0.000040
在这里,最小验证损失以0.000040开始,而不是 INF
模型推理
在运行推理之前,必须调用 model.eval()将 dropout 和 batch、 normalization 层设置为 evaluation 模式。不这样做将导致不一致的推论结果。
trained_model.eval()
-
test_acc =
0.
0
-
for samples, labels in loaders[
'test']:
-
with torch.no_grad():
-
samples, labels = samples.cuda(), labels.cuda()
-
output = trained_model(samples)
-
# calculate accuracy
-
pred = torch.argmax(output, dim=
1)
-
correct = pred.e
q(labels)
-
test_acc += torch.mean(correct.float())
-
print(
'Accuracy of the network on {} test images: {}%'.format(len(testset), round(test_acc.item()*
100.0/len(loaders[
'test']),
2)))
输出:
Accuracy of the network on 10000 test images: 86.58%
在哪里可以找到 Kaggle 笔记本的输出/保存文件:
在你的 Kaggle 笔记本中,你可以向下滚动到页面的底部。前面的操作中保存了一些文件。
完整代码链接:https://www.kaggle.com/vortanasay/saving-loading-and-cont-training-model-in-pytorch
-
· END ·
-
-
-
HAPPY LIFE
-
-
转载:https://blog.csdn.net/weixin_38739735/article/details/114317581