小言_互联网的博客

机器学习-基于KNN和LMKNN的心脏病预测

356人阅读  评论(0)

一、简介和环境准备

knn一般指邻近算法。 邻近算法,或者说K最邻近(KNN,K-NearestNeighbor)分类算法是数据挖掘分类技术中最简单的方法之一。而lmknn是局部均值k最近邻分类算法。

本次实验环境需要用的是Google Colab和Google Drive,文件后缀是.ipynb可以直接用。非常轻量不用安装配置啥的,前者是像notebooks的网页编辑器,后者是谷歌云盘,用来存储数据,大家自行搜索登录即可,让colab绑定上云盘。colab一段时间不操作就断连了。

 

先点击左边的文件图标,再点上面第三个图标绑定谷歌云盘。


  
  1. from google.colab import drive
  2. drive.mount( '/content/drive')

然后运行该代码(第一次会弹出个框进行确认),之后就会发现文件列表多了一个drive文件夹,这下面就是你Google Drive的文件内容

然后就可以用里面的文件了,我准备了心脏故障分析的表格heartfailure_clinical_……csv,用于此次实验。存放位置见上图。

同时引入基本库


  
  1. from sklearn.preprocessing import MinMaxScaler
  2. import pandas as pd

 二、算法解析

2.1 KNN算法


  
  1. import scipy.spatial
  2. from collections import Counter
  3. class KNN:
  4. def __init__( self, k):
  5. self.k = k
  6. def fit( self, X, y):
  7. self.X_train = X
  8. self.y_train = y
  9. def distance( self, X1, X2):
  10. return scipy.spatial.distance.euclidean(X1, X2)
  11. def predict( self, X_test):
  12. final_output = []
  13. for i in range( len(X_test)):
  14. d = []
  15. votes = []
  16. for j in range( len(X_train)):
  17. dist = scipy.spatial.distance.euclidean(X_train[j] , X_test[i])
  18. d.append([dist, j])
  19. d.sort()
  20. d = d[ 0:self.k]
  21. for d, j in d:
  22. votes.append(self.y_train[j])
  23. ans = Counter(votes).most_common( 1)[ 0][ 0]
  24. final_output.append(ans)
  25. return final_output
  26. def score( self, X_test, y_test):
  27. predictions = self.predict(X_test)
  28. value = 0
  29. for i in range( len(y_test)):
  30. if(predictions[i] == y_test[i]):
  31. value += 1
  32. return value / len(y_test)

2.2LMKNN算法


  
  1. import scipy.spatial
  2. import numpy as np
  3. from operator import itemgetter
  4. from collections import Counter
  5. class LMKNN:
  6. def __init__( self, k):
  7. self.k = k
  8. def fit( self, X, y):
  9. self.X_train = X
  10. self.y_train = y
  11. def distance( self, X1, X2):
  12. return scipy.spatial.distance.euclidean(X1, X2)
  13. def predict( self, X_test):
  14. final_output = []
  15. myclass = list( set(self.y_train))
  16. for i in range( len(X_test)):
  17. eucDist = []
  18. votes = []
  19. for j in range( len(X_train)):
  20. dist = scipy.spatial.distance.euclidean(X_train[j] , X_test[i])
  21. eucDist.append([dist, j, self.y_train[j]])
  22. eucDist.sort()
  23. minimum_dist_per_class = []
  24. for c in myclass:
  25. minimum_class = []
  26. for di in range( len(eucDist)):
  27. if( len(minimum_class) != self.k):
  28. if(eucDist[di][ 2] == c):
  29. minimum_class.append(eucDist[di])
  30. else:
  31. break
  32. minimum_dist_per_class.append(minimum_class)
  33. indexData = []
  34. for a in range( len(minimum_dist_per_class)):
  35. temp_index = []
  36. for j in range( len(minimum_dist_per_class[a])):
  37. temp_index.append(minimum_dist_per_class[a][j][ 1])
  38. indexData.append(temp_index)
  39. centroid = []
  40. for a in range( len(indexData)):
  41. transposeData = X_train[indexData[a]].T
  42. tempCentroid = []
  43. for j in range( len(transposeData)):
  44. tempCentroid.append(np.mean(transposeData[j]))
  45. centroid.append(tempCentroid)
  46. centroid = np.array(centroid)
  47. eucDist_final = []
  48. for b in range( len(centroid)):
  49. dist = scipy.spatial.distance.euclidean(centroid[b] , X_test[i])
  50. eucDist_final.append([dist, myclass[b]])
  51. sorted_eucDist_final = sorted(eucDist_final, key=itemgetter( 0))
  52. final_output.append(sorted_eucDist_final[ 0][ 1])
  53. return final_output
  54. def score( self, X_test, y_test):
  55. predictions = self.predict(X_test)
  56. value = 0
  57. for i in range( len(y_test)):
  58. if(predictions[i] == y_test[i]):
  59. value += 1
  60. return value / len(y_test)

2.2.1两种算法代码比对 

两者唯一不同在于预测函数(predict)的不同,LMKNN的基本原理是在进行分类决策时,利用每个类中k个最近邻的局部平均向量对查询模式进行分类。可以看到加强局部模型的计算。最后处理结果的异同优异结尾有写。

2.3数据处理

2.3.1导入数据


  
  1. train_path = r'drive/My Drive/Colab Notebooks/Dataset/heart_failure_clinical_records_dataset.csv'
  2. data_train = pd.read_csv(train_path)
  3. data_train.head()

 检查有没有缺失的数据,查了下没有(都为0%)


  
  1. for col in data_train.columns:
  2. print(col, str( round( 100* data_train[col].isnull(). sum() / len(data_train), 2)) + '%')

2.3.2数据预处理


  
  1. data_train.drop( 'time',axis= 1, inplace= True)
  2. print(data_train.columns.tolist())

['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes', 'ejection_fraction', 'high_blood_pressure', 'platelets', 'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'DEATH_EVENT']

得到所有标签,放到后面的data_train 中


  
  1. label_train = data_train[ 'DEATH_EVENT'].to_numpy()
  2. fitur_train = data_train[[ 'anaemia', 'creatinine_phosphokinase', 'diabetes', 'ejection_fraction', 'high_blood_pressure', 'platelets', 'serum_creatinine', 'serum_sodium']].to_numpy()
  3. scaler = MinMaxScaler(feature_range=( 0, 1))
  4. scaler.fit(fitur_train)
  5. fitur_train_normalize = scaler.transform(fitur_train)
  6. print(fitur_train_normalize[ 0])

[0. 0.07131921 0. 0.09090909 1. 0.29082313 0.15730337 0.48571429]

2.4分类,令3K=3时

分别调用knn和lmknn训练,并输出准确率


  
  1. from sklearn.model_selection import StratifiedKFold
  2. kf = StratifiedKFold(n_splits= 10, random_state= None, shuffle= True)
  3. kf.get_n_splits(fitur_train_normalize)
  4. acc_LMKNN_heart = []
  5. acc_KNN_heart = []
  6. for train_index, test_index in kf.split(fitur_train_normalize,label_train):
  7. knn = KNN( 3)
  8. lmknn = LMKNN( 3)
  9. X_train, X_test = fitur_train_normalize[train_index], fitur_train_normalize[test_index]
  10. y_train, y_test = label_train[train_index], label_train[test_index]
  11. knn.fit(X_train,y_train)
  12. prediction = knn.score(X_test, y_test)
  13. acc_KNN_heart.append(prediction)
  14. lmknn.fit(X_train, y_train)
  15. result = lmknn.score(X_test, y_test)
  16. acc_LMKNN_heart.append(result)
  17. print(np.mean(acc_KNN_heart))
  18. print(np.mean(acc_LMKNN_heart))

0.7157471264367816

0.7022988505747125

2.5对邻域大小K的敏感性结果

令k处于2-15调用看看,并输出所有结果。好像没有成功调用gpu,花了20s左右。


  
  1. from sklearn.model_selection import StratifiedKFold
  2. kf = StratifiedKFold(n_splits= 10, random_state= None, shuffle= True)
  3. kf.get_n_splits(fitur_train_normalize)
  4. K = range( 2, 15)
  5. result_KNN_HR = []
  6. result_LMKNN_HR = []
  7. for k in K :
  8. acc_LMKNN_heart = []
  9. acc_KNN_heart = []
  10. for train_index, test_index in kf.split(fitur_train_normalize,label_train):
  11. knn = KNN(k)
  12. lmknn = LMKNN(k)
  13. X_train, X_test = fitur_train_normalize[train_index], fitur_train_normalize[test_index]
  14. y_train, y_test = label_train[train_index], label_train[test_index]
  15. knn.fit(X_train,y_train)
  16. prediction = knn.score(X_test, y_test)
  17. acc_KNN_heart.append(prediction)
  18. lmknn.fit(X_train, y_train)
  19. result = lmknn.score(X_test, y_test)
  20. acc_LMKNN_heart.append(result)
  21. result_KNN_HR.append(np.mean(acc_KNN_heart))
  22. result_LMKNN_HR.append(np.mean(acc_LMKNN_heart))
  23. print( 'KNN : ',result_KNN_HR)
  24. print( 'LMKNN : ',result_LMKNN_HR)

KNN : [0.6624137931034483, 0.7127586206896551, 0.7088505747126437, 0.7057471264367816, 0.689080459770115, 0.6886206896551724, 0.682528735632184, 0.6826436781609195, 0.69183908045977, 0.6791954022988506, 0.6722988505747127, 0.669080459770115, 0.6588505747126436]

LMKNN : [0.6922988505747126, 0.695977011494253, 0.6885057471264368, 0.6722988505747127, 0.6889655172413793, 0.6652873563218391, 0.6924137931034483, 0.6926436781609195, 0.7051724137931035, 0.6691954022988506, 0.689080459770115, 0.6857471264367816, 0.682528735632184]

生成图像,蓝色是knn,绿色是lmknn


  
  1. import matplotlib.pyplot as plt
  2. plt.plot( range( 2, 15), result_KNN_HR)
  3. plt.plot( range( 2, 15), result_LMKNN_HR, color= "green")
  4. plt.ylabel( 'Accuracy')
  5. plt.xlabel( 'K')
  6. plt.show()

 也可以调用鸢尾花等机器学习数据集,这里不再演示。

三、总结

除了心脏病这个,原文做了四个实验。最后结果图如下:knn蓝lmknn黄

 可以简单分析得到分类的种类k越大时,lmknn具有更高的准确性


来源:GitHub - baguspurnama98/lmknn-python: Local Mean K Nearest Neighbor with Python


转载:https://blog.csdn.net/m0_62237233/article/details/128933295
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场