基于CNN的模型构建/训练/推理
前言
本系列最后一部分:urban sound音频分类神经网络模型的搭建和训练,大纲和数据集的准备可以看我前期的内容:
1.PyTorch for Audio + Music Processing(1) :Course Overview(课程大纲)
2.PyTorch for Audio + Music Processing(2/3/4/5/6/7) :构建数据集和提取音频特征
本期的内容包括:
08 Implementing a CNN network
类似VGG网络结构的CNN模型的构建
09 Training urban sound classifier
urban sound音频分类模型的训练
10 Predictions with sound classifier
推理结构的实现
一、构建CNN模型
构建过程如下:
1.4个卷积block,对应conv1,conv2,conv3,conv4,每个block包含conv2d,relu,maxpooling
2.flatten层
3.全连接linear
4.softmax
代码如下:
class CNNNetwork(nn.Module):
def __init__(self):
super().__init__()
# 4 conv blocks / flatten / linear / softmax
self.conv1 = nn.Sequential(
nn.Conv2d(
in_channels=1,
out_channels=16,
kernel_size=3,
stride=1,
padding=2
),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2)
)
self.conv2 = nn.Sequential(
nn.Conv2d(
in_channels=16,
out_channels=32,
kernel_size=3,
stride=1,
padding=2
),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2)
)
self.conv3 = nn.Sequential(
nn.Conv2d(
in_channels=32,
out_channels=64,
kernel_size=3,
stride=1,
padding=2
),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2)
)
self.conv4 = nn.Sequential(
nn.Conv2d(
in_channels=64,
out_channels=128,
kernel_size=3,
stride=1,
padding=2
),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2)
)
self.flatten = nn.Flatten()
self.linear = nn.Linear(128 * 5 * 4, 10)
self.softmax = nn.Softmax(dim=1)
def forward(self, input_data):
x = self.conv1(input_data)
x = self.conv2(x)
x = self.conv3(x)
x = self.conv4(x)
x = self.flatten(x)
logits = self.linear(x)
predictions = self.softmax(logits)
return predictions
网络结构通过torchsummary打印出来
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 16, 66, 46] 160
ReLU-2 [-1, 16, 66, 46] 0
MaxPool2d-3 [-1, 16, 33, 23] 0
Conv2d-4 [-1, 32, 35, 25] 4,640
ReLU-5 [-1, 32, 35, 25] 0
MaxPool2d-6 [-1, 32, 17, 12] 0
Conv2d-7 [-1, 64, 19, 14] 18,496
ReLU-8 [-1, 64, 19, 14] 0
MaxPool2d-9 [-1, 64, 9, 7] 0
Conv2d-10 [-1, 128, 11, 9] 73,856
ReLU-11 [-1, 128, 11, 9] 0
MaxPool2d-12 [-1, 128, 5, 4] 0
Flatten-13 [-1, 2560] 0
Linear-14 [-1, 10] 25,610
Softmax-15 [-1, 10] 0
================================================================
Total params: 122,762
Trainable params: 122,762
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.01
Forward/backward pass size (MB): 1.83
Params size (MB): 0.47
Estimated Total Size (MB): 2.31
----------------------------------------------------------------
输入输出shape和参数数量说明
以第一个卷据块输出为例,前面音频通过梅尔频谱变换提取到的特征为shape为64*44的tensor
其结构定义为:
self.conv1 = nn.Sequential(
nn.Conv2d(
in_channels=1,
out_channels=16,
kernel_size=3,
stride=1,
padding=2
),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2)
)
输出shape的计算
其中padding=2,所以输出的shape就变成(64+2)x(44+2),即66x46
out_channels=16,即有16个通道(或卷积核数量),分别对输入进行卷积计算,所以输出的通道也是16
所以输出的tensor的shape为16x66x46
maxpool2d的size=2,所以其shape最终为16x33x23
输出参数数量的计算
每个kernel_size=3,且每个kernel是参数共享,所以每个kernel参数为3x3=9
由于有16个kernel,所以参数数量=16x9=144。由于还有偏置=kernel的数量,最终参数数量=16x9+16=160
二、模型的训练
创建dataloader
from torch.utils.data import DataLoader
# 导入torch的dataloader
def create_data_loader(train_data, batch_size):
# train_data为前期定义的urbanDataset,batch_size为每批训练样本数
train_dataloader = DataLoader(train_data, batch_size=batch_size)
return train_dataloader
单个epoch的训练过程
def train_single_epoch(model, data_loader, loss_fn, optimiser, device):
# model为一步骤定义的cnn模型,loss_fn为损失函数,optimiser为优化方法,device为训练设备
for input, target in data_loader:
input, target = input.to(device), target.to(device)
# 从迭代器获取训练数据和标签
# calculate loss
prediction = model(input)
# 前向输出预测结果
loss = loss_fn(prediction, target)
# 通过模型输出和标签计算损失函数
# backpropagate error and update weights
optimiser.zero_grad()
# 梯度归零,因为训练的过程通常使用mini-batch方法,所以如果不将梯度清零的话,梯度会与上一个batch的数据相关
loss.backward()
# 反向传播计算梯度
optimiser.step()
# 基于optimiser方法和梯度信息更新weight
print(f"loss: {
loss.item()}")
多个epoch
def train(model, data_loader, loss_fn, optimiser, device, epochs):
for i in range(epochs):
print(f"Epoch {
i+1}")
train_single_epoch(model, data_loader, loss_fn, optimiser, device)
print("---------------------------")
print("Finished training")
训练整体过程
if __name__ == "__main__":
if torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
print(f"Using {
device}")
mel_spectrogram = torchaudio.transforms.MelSpectrogram(
sample_rate=SAMPLE_RATE,
n_fft=1024,
hop_length=512,
n_mels=64
)
# 通过torchaudio定义梅尔转换,为下面的UrbanSoundDataset准备
usd = UrbanSoundDataset(ANNOTATIONS_FILE,
AUDIO_DIR,
mel_spectrogram,
SAMPLE_RATE,
NUM_SAMPLES,
device)
# 定义数据集
train_dataloader = create_data_loader(usd, BATCH_SIZE)
# 调用上面定义的create_data_loader
# construct model and assign it to device
cnn = CNNNetwork().to(device)
print(cnn)
# 实例化CNNNetwork模型
# initialise loss funtion + optimiser
loss_fn = nn.CrossEntropyLoss()
# 采用交叉熵损失函数
optimiser = torch.optim.Adam(cnn.parameters(),
lr=LEARNING_RATE)
# 定义优化方式
# train model
train(cnn, train_dataloader, loss_fn, optimiser, device, EPOCHS)
# 训练
# save model
torch.save(cnn.state_dict(), "feedforwardnet.pth")
print("Trained feed forward net saved at feedforwardnet.pth")
训练最终输出
Epoch 1
loss: 2.241577625274658
---------------------------
Epoch 2
loss: 2.2747385501861572
---------------------------
Epoch 3
loss: 2.3089897632598877
---------------------------
Epoch 4
loss: 2.348045587539673
---------------------------
Epoch 5
loss: 2.315420150756836
---------------------------
Epoch 6
loss: 2.3148367404937744
---------------------------
Epoch 7
loss: 2.31473708152771
---------------------------
Epoch 8
loss: 2.3141160011291504
---------------------------
Epoch 9
loss: 2.3157730102539062
---------------------------
Epoch 10
loss: 2.3171067237854004
---------------------------
Finished training
Trained feed forward net saved at feedforwardnet.pth
Process finished with exit code 0
三、模型推理
定义class_mapping
模型的输出是对应的class的序号,所以这里定义了一个序号(顺序)与类别的映射,其数据是根据数据集ubranDataset定义的类别
class_mapping = [
"air_conditioner",
"car_horn",
"children_playing",
"dog_bark",
"drilling",
"engine_idling",
"gun_shot",
"jackhammer",
"siren",
"street_music"
]
预测函数
def predict(model, input, target, class_mapping):
model.eval()
# 必须加这句,eval() 时,pytorch 会自动把 BN 和 DropOut 固定住,不会取平均,而是用训练好的值
with torch.no_grad():
predictions = model(input)
# Tensor (1, 10) -> [ [0.1, 0.01, ..., 0.6] ]
predicted_index = predictions[0].argmax(0)
predicted = class_mapping[predicted_index]
# 模型预测的值
expected = class_mapping[target]
# ground_truth值
return predicted, expected
总结
本系列PyTorch for Audio + Music Processing课程完整地讲述了:
1.基于torch audio的音频数据集处理,加载,梅尔特征提取过程
2.基于CNN的基础分类模型的构建
3.torch模型的训练和预测
逻辑比较清晰,讲的也很细致,很适合入门。但同时作者也说了,这个课程只是普及了基础框架和这类问题的处理思路,采用的网络模型也是很基础的类VGG结构,感兴趣的同学可以尝试更SOTA的模型和多种特征来加强模型的性能。
转载:https://blog.csdn.net/rain2211/article/details/128340960