数据分析大作业---山火/非法焚烧秸秆的预防系统

2020-05-14 20:56 658人阅读评论(0)

文章目录

背景及要求
开始动手

1.数据探索
2.数据预处理
3.数据建模与模型评估

1.SVM
2.MLP
3.随机森林
4.开始试试CNN
5.比较SVM和CNN
6.Adaboost是否能挽回一点传统机器学习的颜面呢

4.总结

吐槽时间

背景及要求

然后数据呢是BMP格式的图片，大小为40*40，大致如下

数据集目录结构

其中0是非烟火，1是烟火

然后总结一下
任务

本次大作业的目标是基于提供的图像，获得一个识别烟火的图像分类器。

流程

数据探索
数据预处理
数据建模
模型评估
总结

要求

建模：使用两种传统机器学习算法（SVM、MLP）+cnn+AdaBoost

模型评估：
acc>0.85 AUC>0.9

分析误报率和漏报率，探索哪个指标更重要，如何提高指标

对于cnn与传统机器学习模型，比较模型效率和模型的效果

开始动手

先把大概要用的库导入

import pandas as pd
import numpy as np
import matplotlib 
import matplotlib.pyplot as plt
from PIL import Image
import cv2
import seaborn as sns
from skimage.io import imread, imshow
import warnings 
warnings.filterwarnings('ignore')


import tensorflow as tf
import time
import os

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

1.数据探索

我们先来看看待处理的图片长什么样,打开目录人工查看一下图片发现都是bmp文件

然后这个文件格式其实平时并不多见，所以去查看一下介绍以及相关的知识

BMP文件的简介和python使用方法

然后随便打开一些文件来看看

import struct

basepath='./fogs'
nofire=basepath+'/0'
fire=basepath+'/1'

def view(filepath):
    f=open(filepath,'rb') #打开对应的文件
    '下面部分用来读取BMP位图的基础信息'
    f_type=str(f.read(2)) #这个就可以用来读取 文件类型 需要读取2个字节
    file_size_byte=f.read(4)# 这个可以用来读取文件的大小 需要读取4个字节
    f.seek(f.tell()+4) # 跳过中间无用的四个字节
    file_ofset_byte=f.read(4) # 读取位图数据的偏移量
    f.seek(f.tell()+4) # 跳过无用的两个字节
    file_wide_byte=f.read(4) #读取宽度字节
    file_height_byte=f.read(4) #读取高度字节
    f.seek(f.tell()+2) ## 跳过中间无用的两个字节
    file_bitcount_byte=f.read(4) #得到每个像素占位大小


    #下面就是将读取的字节转换成指定的类型
    f_size,=struct.unpack('l',file_size_byte)
    f_ofset,=struct.unpack('l',file_ofset_byte)
    f_wide,=struct.unpack('l',file_wide_byte)
    f_height,=struct.unpack('l',file_height_byte)
    f_bitcount,=struct.unpack('i',file_bitcount_byte)
    print("类型:",f_type,"大小:",f_size,"位图数据偏移量:",f_ofset,"宽度:",f_wide,"高度:",f_height,"位图:",f_bitcount)
    image_raw_data = tf.gfile.FastGFile(filepath, 'rb').read()
    f.close()
    with tf.Session() as sess:
        # 将图像使用jpeg的格式解码从而得到图相对应的三维矩阵。
        # TensorFlow还提供了tf.image.decode_png函数对png格式的图像进行解码。
        # 解码之后的结果为一个张量，在使用它的取值之前需要明确调用运行的过程。
        img_data = tf.image.decode_bmp(image_raw_data)

        # 使用pyplot工具可视化得到的图像。
        plt.imshow(img_data.eval())
        plt.axis('off')
        plt.show()
      
        
view(nofire+'/0-00013.BMP')
view(fire+'/1-00013.BMP')

统计一下目录中图片的数量

# 统计一下这两个目录下一共有多少张图片
def get_filesnum(filepath):
    file_count = 0
    for dirpath, dirnames, filenames in os.walk(filepath):
        for file in filenames:
            file_count = file_count + 1
    return dirpath,file_count

nofire_path,nofire_nums=get_filesnum(nofire)
fire_path,fire_nums=get_filesnum(fire)
print(nofire_path,"图片数量为:",nofire_nums)
print(fire_path,"图片数量为:",fire_nums)

plt.figure(figsize=(8,6))
plt.bar([1,2],[nofire_nums,fire_nums])
plt.title("烟火与非烟火图片的数量")
plt.xticks([1,2],['非烟火','烟火'])
plt.show()

2.数据预处理

一般流程

数据清洗
数据集成
数据变换
数据规约

对于我们这次的图片来说，没有什么异常值，也不需要集成，所以我们只需要进行数据变换和数据规约

数据变换：归一化处理

数据规约：图片特征提取

对于数据有好的特征提取是后续能构建好的模型的先决条件，所以对于数据预处理一定要好好对待

一开始想起之前的水质检测，也试着用颜色矩作为特征，但是后续结果并不理想，所以还是换点别的特征提取办法

再试试全部像素作为特征然后PCA提取主成分

在这里补充说一下，上面是我之前尝试的想法，但是不管是颜色矩还是全部像素作为特征，后续模型的效果都不好，所以先提取特征然后去建立模型试试看看不调参的模型效果，如果准确率很低（离要求的85%很远），很有可能就是图片特征提取不是特别理想，所以需要思考换一种特征提取方法。

我个人尝试的特征提取方法有：颜色矩、全部像素、HOG以及最后的平均像素值。我最终使用的是平均像素值，当然老师给的参考链接中有暗通道去雾算法，但是我尝试之后发现处理后的图片特征依旧不是特别好，所以也没有采取这种处理。（但是说不定处理后再去提取平均像素的特征会使得模型效果更好，期待大家去尝试）
暗通道去雾算法原理与实现

#之前试的颜色矩
# def img_extract(filepath):
#     #output_path='./result.csv'
#     result=[]
#     imglist=os.listdir(filepath)
#     for i in range(len(imglist)):
        
#         #开始图像分割
#         img=Image.open(filepath+'/'+imglist[i])
        
#         #提取颜色特征
#         rgb=np.array(img)/[255.0,255.0,255.0]
#         #print(rgb)
#         #一阶颜色矩
#         first_order=1.0*(rgb.sum(axis=0).sum(axis=0))/10000
#         err=rgb-first_order
#         #print(first_order)
        
#         #二阶颜色矩
#         second_order=np.sqrt(1.0*(np.power(err,2)).sum(axis=0).sum(axis=0)/10000)
        
#         #三阶颜色矩
#         third_order=1.0*(pow(err,3).sum(axis=0).sum(axis=0))/10000
#         third_order=np.cbrt(abs(third_order))*-1.0
#         #print(third_order)
        
#         res=np.concatenate((first_order,second_order,third_order))
#         result.append(res)
        
#     #保存到csv文件G
#     names=['R通道一阶矩','G通道一阶矩','B通道一阶矩',
#            'R通道二阶矩','G通道二阶矩','B通道二阶矩',
#            'R通道三阶矩','G通道三阶矩','B通道三阶矩']

#     df=pd.DataFrame(result,columns=names)
#     #print(df)
#     return df
#     #df.to_csv(output_path,encoding='utf-8',index=False)


#使用HOG提取特征
#定义对象hog，同时输入定义的参数，剩下的默认即可
# hog = cv2.HOGDescriptor((32,32),(16,16),(16,16),(16,16),30)


# def getHOG_3dims(pic_name):
#     img = cv2.imread(pic_name)
#     test_hog = hog.compute(img,(10,10),(16,16)).reshape((-1,))
#     return test_hog




# def img_extract(filepath):
#     #output_path='./result.csv'
#     result=[]
#     imglist=os.listdir(filepath)
#     for i in range(len(imglist)):
#         try:
#             img_array = getHOG_3dims(filepath+'/'+imglist[i])
#             result.append(img_array[:50])
#         except:
#             result.append([])
#     df=pd.DataFrame(result)
#     #print(df)
#     return df
#     #df.to_csv(output_path,encoding='utf-8',index=False)
    

#使用平均像素值提取特征  
def img_extract(filepath):
    #output_path='./result.csv'
    result=[]
    imglist=os.listdir(filepath)
    for i in range(len(imglist)):
        try:
            image = imread(filepath+'/'+imglist[i])
            feature_matrix = np.zeros((40,40))
            for i in range(0,image.shape[0]):
                for j in range(0,image.shape[1]):
                    feature_matrix[i][j] = ((int(image[i,j,0]) + int(image[i,j,1]) + int(image[i,j,2]))/3)

            features = np.reshape(feature_matrix, (40*40))
            result.append(features)
        except:
            result.append([])
    df=pd.DataFrame(result)
    #print(df)
    return df
    #df.to_csv(output_path,encoding='utf-8',index=False)

    
nofire_df=img_extract(nofire)
nofire_df['label']=0
nofire_df.head()

fire_df=img_extract(fire)
fire_df['label']=1
fire_df.head()

#合并两个DataFrame

data=nofire_df.append(fire_df,ignore_index=True)
data.info()
data.head()

# 查看数据没问题后进行保存
data.to_csv('./期末大作业平均像素值.csv',encoding='utf-8',index=False)

data=pd.read_csv('./期末大作业平均像素值.csv')
data.info()
data.head()

#去除空值行
data=data.dropna(axis=0, how='any')

这里空行值的出现主要是因为之前提取平均像素值的时候会出现异常，所以用空行代替，现在去掉这些。

然后提出X,Y,对X进行标准化处理并且提取主成分

X=data.iloc[0:3617,[i for i in range(0,1600)]]
Y=data['label']
print(X.shape,Y.shape)

# 利用PCA提取主成分
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

#数据标准化
scaler = StandardScaler()
X= scaler.fit_transform(X)

#PCA
pca=PCA(n_components=0.999)
X=pca.fit_transform(X)

X

3.数据建模与模型评估

开始建模之前，个人强烈建议大家把之前制作保存的数据集上传到kaggle，然后利用kaggle的notebook同时写接下来的建模代码。因为接下来调参的时候有可能会需要等运行太久，但是去kaggle那里的话电脑和kaggle就可以同时试着做两个模型，而且kaggle还支持一周免费时长的GPU甚至TPU的使用，对于深度学习网络的搭建运行体验很爽，总之一句话白嫖它不香吗？（很多人好奇为什么可以做这么快，一台电脑也可以当两台用，这就是秘诀啦）

#切分数据集
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import roc_curve, auc


X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2)
print(X_train.shape,X_test.shape)

#绘制混淆矩阵
def plot_confusion_matrix(confusion_mat):
    df_cm = pd.DataFrame(confusion_mat)
    ax = sns.heatmap(df_cm,annot=True,fmt='.20g')
    ax.set_title('混淆矩阵')
    ax.set_xlabel('预测标签')
    ax.set_ylabel('真实标签')
    plt.show()

1.SVM

没有参数的SVM

#利用SVM建模
from sklearn import metrics
from sklearn.svm import SVC

svm=SVC()
svm.fit(X_train,y_train)

train_pred=svm.predict(X_train)
test_pred=svm.predict(X_test)


print('----------模型在训练集的结果-------')
train_acc=metrics.accuracy_score(y_train,train_pred)
train_cm=metrics.confusion_matrix(y_train,train_pred)
train_report=metrics.classification_report(y_train,train_pred)
print("训练集准确率为:",train_acc)
print("-----------------")
print("训练集混淆矩阵:",train_cm)
print("-----------------")
print("训练集分类报告:",train_report)



print('----------模型在测试集的结果-------')
test_acc=metrics.accuracy_score(y_test,test_pred)
test_cm=metrics.confusion_matrix(y_test,test_pred)
test_report=metrics.classification_report(y_test,test_pred)
print("测试集准确率为:",test_acc)
print("-----------------")
print("测试集混淆矩阵:",test_cm)
print("-----------------")
print("测试集分类报告:",test_report)

调参之后

#添加参数进行优化
from sklearn.model_selection import GridSearchCV

# parameters =[{'kernel': ['rbf','poly','linear'],'C': [0.3,0.5,0.7],'gamma':[0.1,0.05,0.01]}]
parameters =[{'kernel': ['rbf'],'C':[2,3,4,5],'degree':[1]}]
clf=GridSearchCV(estimator=SVC(),param_grid=parameters,cv=5,scoring='accuracy')
clf.fit(X_train,y_train)

print("最好的超参数:",clf.best_params_)
print("最好的分数为:",clf.best_score_)

#使用最好的模型
best_svm=clf.best_estimator_
train_pred=best_svm.predict(X_train)
test_pred=best_svm.predict(X_test)


print('----------模型在训练集的结果-------')
train_acc=metrics.accuracy_score(y_train,train_pred)
train_cm=metrics.confusion_matrix(y_train,train_pred)
train_report=metrics.classification_report(y_train,train_pred)
print("训练集准确率为:",train_acc)
print("-----------------")
print("训练集混淆矩阵:",train_cm)
print("-----------------")
print("训练集分类报告:",train_report)



print('----------模型在测试集的结果-------')
svm_acc=metrics.accuracy_score(y_test,test_pred)
svm_cm=metrics.confusion_matrix(y_test,test_pred)
svm_report=metrics.classification_report(y_test,test_pred)
print("svm测试集准确率为:{:.2f}%".format(svm_acc*100))
print("-----------------")
print("svm测试集混淆矩阵:\n",svm_cm)
print("-----------------")
print("svm测试集分类报告:\n",svm_report)
plot_confusion_matrix(svm_cm)

#画ROC曲线
def draw_ROC_curve(y_test,y_predict):
    false_positive_rate,true_positive_rate,thresholds=roc_curve(y_test, y_predict)
    roc_auc=auc(false_positive_rate, true_positive_rate)
    plt.title('ROC')
    plt.plot(false_positive_rate, true_positive_rate,'b',label='AUC = %0.2f'% roc_auc)
    plt.legend(loc='lower right')
    plt.plot([0,1],[0,1],'r--')
    plt.ylabel('TPR')
    plt.xlabel('FPR')
    plt.show()

draw_ROC_curve(y_test,test_pred)

这里我特意把代码块像我写的一样分开贴就是想告诉大家尽量把定义的功能函数单独一个代码块。因为这个不可能一天写完，如果重启之后，后面的模型需要用到这个函数，但是定义功能函数的地方还涉及到之前的代码，那就得重新跑一遍之前的代码，白白耽误很多时间，所以为了提高效率这个是个很重要的小细节

#误判率
miscalculation_rate=svm_cm[1][0]/len(X_test)
#漏判率
leakage_rate=svm_cm[0][1]/len(X_test)

print("svm误判率为:{:.2f}%".format(miscalculation_rate*100))
print("svm漏判率为:{:.2f}%".format(leakage_rate*100))

调参的时候发现，添加gamma超参数就可以减少漏判率，但是会提高误判率，降低准确率，至于原因最后总结会写到。

下面就是添加gamma之后的效果，代码大致一样，就不贴了

误判率增加但是漏判率减少了很多，将近一半。

2.MLP

大部分代码差不多，只是把模型改了

from sklearn.neural_network import MLPClassifier

#无参数
mlp = MLPClassifier()
mlp.fit(X_train,y_train)
train_pred=mlp.predict(X_train)
test_pred=mlp.predict(X_test)


print('----------模型在训练集的结果-------')
train_acc=metrics.accuracy_score(y_train,train_pred)
train_cm=metrics.confusion_matrix(y_train,train_pred)
train_report=metrics.classification_report(y_train,train_pred)
print("训练集准确率为:",train_acc)
print("-----------------")
print("训练集混淆矩阵:",train_cm)
print("-----------------")
print("训练集分类报告:",train_report)



print('----------模型在测试集的结果-------')
test_acc=metrics.accuracy_score(y_test,test_pred)
test_cm=metrics.confusion_matrix(y_test,test_pred)
test_report=metrics.classification_report(y_test,test_pred)
print("测试集准确率为:",test_acc)
print("-----------------")
print("测试集混淆矩阵:",test_cm)
print("-----------------")
print("测试集分类报告:",test_report)

#添加少量参数
mlp=MLPClassifier(solver='sgd', activation='relu', alpha=1,hidden_layer_sizes=(20,20),max_iter=500,random_state=1)
mlp.fit(X_train,y_train)
train_pred=mlp.predict(X_train)
test_pred=mlp.predict(X_test)

后面那些一样的代码我就不贴了

至于L2惩罚项就是alpha参数，不过MLP的效果也一般

3.随机森林

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=50,max_depth=20,random_state=0)
rfc.fit(X_train,y_train)
train_pred=rfc.predict(X_train)
test_pred=rfc.predict(X_test)

不论是决策树还是随机森林都容易受深度和枝叶数目的影响导致过拟合，而且看到这应该有想法了吧，把每次绘制分类报告这些代码封装成一个函数可以大大减小代码量，别人看着也舒服。这个我懒得做了，但是建议大家封装起来。还有其他经典的分类算法，像逻辑回归（对数几率回归）、KNN、贝叶斯……这里就不一一尝试了

4.开始试试CNN

CNN这里的网络结构是VGGnet，具体的论文可以去看看这个
VGG论文翻译
然后至于模型搭建使用网上有很多参考，只需要改改参数拿来用用就行，深度学习这一块好像本科涉及到的都不多，大多都是自己凭兴趣学的一些零散知识。anyway，对于这个模型我是把图片全部作为输入的，然后对标签进行one-hot编码，最后得到模型，预测并评估模型

# 导入所需模块
from keras.models import Sequential
from keras.layers.normalization import BatchNormalization
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.initializers import TruncatedNormal
from keras.layers.core import Activation
from keras.layers.core import Flatten
from keras.layers.core import Dropout
from keras.layers.core import Dense
from keras import backend as K

class SimpleVGGNet:
    @staticmethod
    def build(width, height, depth, classes):
        model = Sequential()
        inputShape = (height, width, depth)
        chanDim = -1

        if K.image_data_format() == "channels_first":
            inputShape = (depth, height, width)
            chanDim = 1

        # CONV => RELU => POOL
        model.add(Conv2D(32, (3, 3), padding="same",
            input_shape=inputShape,kernel_initializer=TruncatedNormal(mean=0.0, stddev=0.01)))
        model.add(Activation("relu"))
        model.add(BatchNormalization(axis=chanDim))
        model.add(MaxPooling2D(pool_size=(2, 2)))
        model.add(Dropout(0.25))

        # (CONV => RELU) * 2 => POOL
        model.add(Conv2D(64, (3, 3), padding="same",kernel_initializer=TruncatedNormal(mean=0.0, stddev=0.01)))
        model.add(Activation("relu"))
        model.add(BatchNormalization(axis=chanDim))
        model.add(Conv2D(64, (3, 3), padding="same",kernel_initializer=TruncatedNormal(mean=0.0, stddev=0.01)))
        model.add(Activation("relu"))
        model.add(BatchNormalization(axis=chanDim))
        model.add(MaxPooling2D(pool_size=(2, 2)))
        #model.add(Dropout(0.25))

        # (CONV => RELU) * 3 => POOL
        model.add(Conv2D(128, (3, 3), padding="same",kernel_initializer=TruncatedNormal(mean=0.0, stddev=0.01)))
        model.add(Activation("relu"))
        model.add(BatchNormalization(axis=chanDim))
        model.add(Conv2D(128, (3, 3), padding="same",kernel_initializer=TruncatedNormal(mean=0.0, stddev=0.01)))
        model.add(Activation("relu"))
        model.add(BatchNormalization(axis=chanDim))
        model.add(Conv2D(128, (3, 3), padding="same",kernel_initializer=TruncatedNormal(mean=0.0, stddev=0.01)))
        model.add(Activation("relu"))
        model.add(BatchNormalization(axis=chanDim))
        model.add(MaxPooling2D(pool_size=(2, 2)))
        model.add(Dropout(0.25))

        # FC层
        model.add(Flatten())
        model.add(Dense(512,kernel_initializer=TruncatedNormal(mean=0.0, stddev=0.01)))
        model.add(Activation("relu"))
        model.add(BatchNormalization())
        model.add(Dropout(0.6))

        # softmax 分类
        model.add(Dense(classes,kernel_initializer=TruncatedNormal(mean=0.0, stddev=0.01)))
        model.add(Activation("sigmoid"))

        return model

# 导入所需工具包
from sklearn.preprocessing import LabelBinarizer,OneHotEncoder
from keras.optimizers import SGD
from keras.preprocessing.image import ImageDataGenerator
import argparse
import random
import pickle


# 读取数据和标签
print("------开始读取数据------")
data = []
labels = []


image_types = (".jpg", ".jpeg", ".png", ".bmp", ".tif", ".tiff")

def list_images(basePath, contains=None):
    # 返回有效的图片路径数据集
    return list_files(basePath, validExts=image_types, contains=contains)

def list_files(basePath, validExts=None, contains=None):
    # 遍历图片数据目录，生成每张图片的路径
    for (rootDir, dirNames, filenames) in os.walk(basePath):
        # 循环遍历当前目录中的文件名
        for filename in filenames:
            # if the contains string is not none and the filename does not contain
            # the supplied string, then ignore the file
            if contains is not None and filename.find(contains) == -1:
                continue
 
            # 通过确定.的位置，从而确定当前文件的文件扩展名
            ext = filename[filename.rfind("."):].lower()
 
            # 检查文件是否为图像，是否应进行处理
            if validExts is None or ext.endswith(validExts):
                # 构造图像路径
                imagePath = os.path.join(rootDir, filename)
                yield imagePath



# 拿到图像数据路径，方便后续读取
imagePaths = sorted(list(list_images('./fogs')))
random.seed(42)
random.shuffle(imagePaths)

# 遍历读取数据
for imagePath in imagePaths:
    # 读取图像数据
    image = cv2.imread(imagePath)
    image = cv2.resize(image, (40,40))
    data.append(image)
    # 读取标签
    label = imagePath.split(os.path.sep)[-2]
    labels.append(int(label))

# 对图像数据做scale操作
data = np.array(data, dtype="float") / 255.0
labels = np.array(labels)

onehot= OneHotEncoder() 
onehot.fit(labels.reshape(-1,1))
labels =onehot.transform(labels.reshape(-1,1)).toarray()

# 数据集切分
(trainX, testX, trainY, testY) = train_test_split(data,labels, test_size=0.25, random_state=42)


# 数据增强处理
aug = ImageDataGenerator(rotation_range=30, width_shift_range=0.1,
    height_shift_range=0.1, shear_range=0.2, zoom_range=0.2,
    horizontal_flip=True, fill_mode="nearest")

# 建立卷积神经网络
model = SimpleVGGNet.build(width=40, height=40, depth=3,classes=2)

# 设置初始化超参数
INIT_LR = 0.01
EPOCHS = 30
BS = 32

# 损失函数，编译模型
print("------准备训练网络------")
opt = SGD(lr=INIT_LR, decay=INIT_LR / EPOCHS)
model.compile(loss="categorical_crossentropy", optimizer=opt,metrics=["accuracy"])

# 训练网络模型
H = model.fit_generator(aug.flow(trainX, trainY, batch_size=BS),
    validation_data=(testX, testY), steps_per_epoch=len(trainX) // BS,
    epochs=EPOCHS)
"""
H = model.fit(trainX, trainY, validation_data=(testX, testY),
    epochs=EPOCHS, batch_size=32)
"""


# 测试
print("------测试网络------")
predictions = model.predict(testX, batch_size=32)

# 绘制结果曲线
N = np.arange(0, EPOCHS)
plt.style.use("ggplot")
plt.figure()
plt.plot(N, H.history["loss"], label="train_loss")
plt.plot(N, H.history["val_loss"], label="val_loss")
plt.plot(N, H.history["accuracy"], label="train_acc")
plt.plot(N, H.history["val_accuracy"], label="val_acc")
plt.title("Training Loss and Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend()
plt.show()

网络结构

学习曲线，有震荡但是最后还是下降的

pred=predictions.argmax(axis=1)
testY=testY.argmax(axis=1)

test_acc=metrics.accuracy_score(testY,pred)
test_cm=metrics.confusion_matrix(testY,pred)
test_report=metrics.classification_report(testY,pred)
print("cnn测试集准确率为:{:.2f}%".format(test_acc*100))
print("-----------------")
print("cnn测试集混淆矩阵:\n",test_cm)
print("-----------------")
print("cnn测试集分类报告:\n",test_report)


plot_confusion_matrix(test_cm)

5.比较SVM和CNN

#先定义一个计算时间的装饰器
def clock(func):
    def clocked(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(func.__name__+'耗时:', end - start)
        return result
    return clocked



#开始比较
@clock
def svm_test():
    predict=best_svm.predict(X_test)
@clock
def cnn_test():
    predict=model.predict(testX)

    
svm_test()
cnn_test()

发现预测时间上cnn还是比svm快，传统机器学习算法在图像处理方面的性能确实是比不过深度学习，这应该也是近几年深度学习逐渐大火的原因

6.Adaboost是否能挽回一点传统机器学习的颜面呢

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

bdt = AdaBoostClassifier(DecisionTreeClassifier(),
                         algorithm="SAMME",
                         n_estimators=20, learning_rate=0.8)
bdt.fit(X_train,y_train)
train_pred=bdt.predict(X_train)
test_pred=bdt.predict(X_test)


print('----------模型在训练集的结果-------')
train_acc=metrics.accuracy_score(y_train,train_pred)
train_cm=metrics.confusion_matrix(y_train,train_pred)
train_report=metrics.classification_report(y_train,train_pred)
print("训练集准确率为:",train_acc)
print("-----------------")
print("训练集混淆矩阵:",train_cm)
print("-----------------")
print("训练集分类报告:",train_report)



print('----------模型在测试集的结果-------')
test_acc=metrics.accuracy_score(y_test,test_pred)
test_cm=metrics.confusion_matrix(y_test,test_pred)
test_report=metrics.classification_report(y_test,test_pred)
print("测试集准确率为:",test_acc)
print("-----------------")
print("测试集混淆矩阵:",test_cm)
print("-----------------")
print("测试集分类报告:",test_report)

#如果我们把分类器用随机森林代替决策树呢，毕竟决策树容易过拟合

bdt = AdaBoostClassifier(RandomForestClassifier(),algorithm="SAMME",n_estimators=20, learning_rate=0.8)
bdt.fit(X_train,y_train)
train_pred=bdt.predict(X_train)
test_pred=bdt.predict(X_test)


print('----------模型在训练集的结果-------')
train_acc=metrics.accuracy_score(y_train,train_pred)
train_cm=metrics.confusion_matrix(y_train,train_pred)
train_report=metrics.classification_report(y_train,train_pred)
print("训练集准确率为:",train_acc)
print("-----------------")
print("训练集混淆矩阵:",train_cm)
print("-----------------")
print("训练集分类报告:",train_report)

print('----------模型在测试集的结果-------')
test_acc=metrics.accuracy_score(y_test,test_pred)
test_cm=metrics.confusion_matrix(y_test,test_pred)
test_report=metrics.classification_report(y_test,test_pred)
print("测试集准确率为:",test_acc)
print("-----------------")
print("测试集混淆矩阵:",test_cm)
print("-----------------")
print("测试集分类报告:",test_report)

虽然没有进一步调参，但是明显可以看出来Adaboost依然存在过拟合严重的问题需要解决。但是从好的方面看基本都是误判的多，而漏判的数目比cnn还少，也就说明如果以“宁可错杀一千也不放过一个”的思想，这个模型的效果也是可以接受的。毕竟如果误判最多是人多去确认一下，但是漏判就会造成不可挽回的损失

4.总结

通过建立SVM、MLP、随机森林三种经典机器学习模型，发现SVM效果最好，达到了0.86以上，通过对SVM添加gamma参数从而添加支持向量使模型过拟合从而达到降低漏判率的目的。
然后又去尝试了cnn以及Adaboost,对比了cnn与svm的模型效果以及预测速度，发现cnn都优于svm，cnn的准确率达到了91.38%AUC的值也有0.92，并且最后尝试了Adaboost之后发现模型虽然存在过拟合，但是Adaboost的漏判率特别低，也就是最多也就是人多去确认一下，很小可能漏掉火灾图片。

吐槽时间

最近各种结课作业太多了，忘了多久没复习，自己状态也不是很好。等下晚上去把web课设报告做完，明天开始认真学习！！！

转载：https://blog.csdn.net/shelgi/article/details/106125712

查看评论

小言_互联网的博客

小言_互联网的博客

个人资料

文章分类

文章存档

阅读排行

评论排行

推荐文章