飞道的博客

预测足球世界杯比赛

373人阅读  评论(0)

目录

1. 下载数据集

2. 数据预处理

3. 模型训练与选择

4. 预测


1. 下载数据集

下载后数据如下:

FIFA World Cup | Kaggle

2. 数据预处理

 reprocess_dataset() 方法是数据进行预处理。预处理过的数据如下:
 

 

save_dataset() 方法是对预处理过的数据,进行向量化。

完整代码如下:


  
  1. import pandas as pd
  2. import numpy as np
  3. from sklearn.feature_extraction import DictVectorizer
  4. import joblib
  5. root_path = "models"
  6. def reprocess_dataset():
  7. #load data
  8. results = pd.read_csv( 'datasets/WorldCupMatches.csv', encoding= 'gbk')
  9. #Adding goal difference and establishing who is the winner
  10. winner = []
  11. for i in range ( len(results[ 'Home Team Name'])):
  12. if results [ 'Home Team Goals'][i] > results[ 'Away Team Goals'][i]:
  13. winner.append(results[ 'Home Team Name'][i])
  14. elif results[ 'Home Team Goals'][i] < results [ 'Away Team Goals'][i]:
  15. winner.append(results[ 'Away Team Name'][i])
  16. else:
  17. winner.append( 'Draw')
  18. results[ 'winning_team'] = winner
  19. #adding goal difference column
  20. results[ 'goal_difference'] = np.absolute(results[ 'Home Team Goals'] - results[ 'Away Team Goals'])
  21. # narrowing to team patcipating in the world cup, totally there are 32 football teams in 2022
  22. worldcup_teams = [ 'Qatar', 'Germany', 'Denmark', 'Brazil', 'France', 'Belgium', 'Serbia',
  23. 'Spain', 'Croatia', 'Switzerland', 'England', 'Netherlands', 'Argentina', ' Iran',
  24. 'Korea Republic', 'Saudi Arabia', 'Japan', 'Uruguay', 'Ecuador', 'Canada',
  25. 'Senegal', 'Poland', 'Portugal', 'Tunisia', 'Morocco', 'Cameroon', 'USA',
  26. 'Mexico', 'Wales', 'Australia', 'Costa Rica', 'Ghana']
  27. df_teams_home = results[results[ 'Home Team Name'].isin(worldcup_teams)]
  28. df_teams_away = results[results[ 'Away Team Name'].isin(worldcup_teams)]
  29. df_teams = pd.concat((df_teams_home, df_teams_away))
  30. df_teams.drop_duplicates()
  31. df_teams.count()
  32. #dropping columns that wll not affect matchoutcomes
  33. df_teams_new =df_teams[[ 'Home Team Name', 'Away Team Name', 'winning_team']]
  34. print(df_teams_new.head() )
  35. #Building the model
  36. #the prediction label: The winning_team column will show "2" if the home team has won, "1" if it was a tie, and "0" if the away team has won.
  37. df_teams_new = df_teams_new.reset_index(drop= True)
  38. df_teams_new.loc[df_teams_new.winning_team == df_teams_new[ 'Home Team Name'], 'winning_team']= 2
  39. df_teams_new.loc[df_teams_new.winning_team == 'Draw', 'winning_team']= 1
  40. df_teams_new.loc[df_teams_new.winning_team == df_teams_new[ 'Away Team Name'], 'winning_team']= 0
  41. print(df_teams_new.count() )
  42. df_teams_new.to_csv( 'datasets/raw_train_data.csv', encoding= 'gbk', index = False)
  43. def save_dataset():
  44. df_teams_new = pd.read_csv( 'datasets/raw_train_data.csv', encoding= 'gbk')
  45. feature = df_teams_new[[ 'Home Team Name', 'Away Team Name']]
  46. vec = DictVectorizer(sparse= False)
  47. print(feature.to_dict(orient= 'records'))
  48. X =vec.fit_transform(feature.to_dict(orient= 'records'))
  49. X = X.astype( 'int')
  50. print( "===")
  51. print(vec.get_feature_names())
  52. print(vec.feature_names_)
  53. y = df_teams_new[[ 'winning_team']]
  54. y =y.astype( 'int')
  55. print(X.shape)
  56. print(y.shape)
  57. joblib.dump(vec, root_path+ "/vec.joblib")
  58. np.savez( 'datasets/train_data', x= X, y = y)
  59. if __name__ == '__main__':
  60. reprocess_dataset()
  61. save_dataset();

3. 模型训练与选择

用不同的传统机器学习方法进行训练,训练后的模型比较

Model Training Accuracy Test Accuracy
Logistic Regression 67.40% 61.60%
SVM 67.30% 62.70%
Naive Bayes 65.50% 63.80%
Random Forest 90.80% 65.50%
XGB 75.30% 62.00%

可以看到随机森林模型在测试集上准确率最高,所以我们可以用它来做预测。

下面是完整训练代码:


  
  1. import pandas as pd
  2. import numpy as np
  3. import matplotlib.pyplot as plt
  4. import seaborn as sns
  5. import matplotlib.ticker as ticker
  6. import matplotlib.ticker as plticker
  7. from sklearn.model_selection import train_test_split
  8. from sklearn.linear_model import LogisticRegression
  9. from sklearn import svm
  10. import sklearn as sklearn
  11. from sklearn.feature_extraction import DictVectorizer
  12. from sklearn.naive_bayes import MultinomialNB
  13. from sklearn.ensemble import RandomForestClassifier
  14. import joblib
  15. from sklearn.metrics import classification_report
  16. from xgboost import XGBClassifier
  17. from sklearn.metrics import confusion_matrix
  18. root_path = "models_1"
  19. def get_dataset():
  20. train_data = np.load( 'datasets/train_data.npz')
  21. return train_data
  22. def train_by_LogisticRegression( train_data):
  23. X = train_data[ 'x']
  24. y = train_data[ 'y']
  25. # Separate train and test sets
  26. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.30, random_state= 42)
  27. logreg = LogisticRegression()
  28. logreg.fit(X_train, y_train)
  29. joblib.dump(logreg, root_path+ '/LogisticRegression_model.joblib')
  30. score = logreg.score(X_train, y_train)
  31. score2 = logreg.score(X_test, y_test)
  32. print( "LogisticRegression Training set accuracy: ", '%.3f'%(score))
  33. print( "LogisticRegression Test set accuracy: ", '%.3f'%(score2))
  34. def train_by_svm( train_data):
  35. X = train_data[ 'x']
  36. y = train_data[ 'y']
  37. # Separate train and test sets
  38. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.30, random_state= 42)
  39. model = svm.SVC(kernel= 'linear', verbose= True, probability= True)
  40. model.fit(X_train, y_train)
  41. joblib.dump(model, root_path+ '/svm_model.joblib')
  42. score = model.score(X_train, y_train)
  43. score2 = model.score(X_test, y_test)
  44. print( "SVM Training set accuracy: ", '%.3f' % (score))
  45. print( "SVM Test set accuracy: ", '%.3f' % (score2))
  46. def train_by_naive_bayes( train_data):
  47. X = train_data[ 'x']
  48. y = train_data[ 'y']
  49. # Separate train and test sets
  50. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.30, random_state= 42)
  51. model = MultinomialNB()
  52. model.fit(X_train, y_train)
  53. joblib.dump(model, root_path+ '/naive_bayes_model.joblib')
  54. score = model.score(X_train, y_train)
  55. score2 = model.score(X_test, y_test)
  56. print( "naive_bayes Training set accuracy: ", '%.3f' % (score))
  57. print( "naive_bayes Test set accuracy: ", '%.3f' % (score2))
  58. def train_by_random_forest( train_data):
  59. X = train_data[ 'x']
  60. y = train_data[ 'y']
  61. # Separate train and test sets
  62. X_train = X
  63. y_train = y
  64. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.30, random_state= 42)
  65. model = RandomForestClassifier(criterion= 'gini', max_features= 'sqrt')
  66. model.fit(X_train, y_train)
  67. joblib.dump(model, root_path+ '/random_forest_model.joblib')
  68. score = model.score(X_train, y_train)
  69. score2 = model.score(X_test, y_test)
  70. print( "random forest Training set accuracy: ", '%.3f' % (score))
  71. print( "random forest Test set accuracy: ", '%.3f' % (score2))
  72. def train_by_xgb( train_data):
  73. X = train_data[ 'x']
  74. y = train_data[ 'y']
  75. # Separate train and test sets
  76. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.30, random_state= 42)
  77. model = XGBClassifier(use_label_encoder= False)
  78. model.fit(X_train, y_train)
  79. joblib.dump(model, root_path+ '/xgb_model.joblib')
  80. score = model.score(X_train, y_train)
  81. score2 = model.score(X_test, y_test)
  82. print( "xgb Training set accuracy: ", '%.3f' % (score))
  83. print( "xgb Test set accuracy: ", '%.3f' % (score2))
  84. y_pred = model.predict(X_test)
  85. report = classification_report(y_test, y_pred, output_dict= True)
  86. # show_confusion_matrix(y_test, y_pred)
  87. print(report)
  88. def show_confusion_matrix( y_true, y_pred, pic_name = "confusion_matrix"):
  89. confusion = confusion_matrix(y_true=y_true, y_pred=y_pred)
  90. print(confusion)
  91. sns.heatmap(confusion, annot= True, cmap= 'Blues', xticklabels=[ '0', '1', '2'], yticklabels=[ '0', '1', '2'], fmt = '.20g')
  92. plt.xlabel( 'Predicted class')
  93. plt.ylabel( 'Actual Class')
  94. plt.title(pic_name)
  95. # plt.savefig('pic/' + pic_name)
  96. plt.show()
  97. if __name__ == '__main__':
  98. train_data = get_dataset()
  99. train_by_LogisticRegression(train_data)
  100. train_by_svm(train_data)
  101. train_by_naive_bayes(train_data)
  102. train_by_random_forest(train_data)
  103. train_by_xgb(train_data)

4. 预测

执行下面预测代码,结果是Ecuador胜于Qatar, 英国队胜于伊朗队。


  
  1. [ 2]
  2. [[ 0.05       0.22033333 0.72966667]]
  3. Probability of  Ecuador  winning: 0.730
  4. Probability of Draw: 0.220
  5. Probability of  Qatar  winning: 0.050
  6. [ 2]
  7. [[ 0.02342857 0.21770455 0.75886688]]
  8. Probability of  England  winning: 0.759
  9. Probability of Draw: 0.218
  10. Probability of   Iran  winning: 0.023

完整代码


  
  1. import joblib
  2. worldcup_teams = [ 'Qatar', 'Germany', 'Denmark', 'Brazil', 'France', 'Belgium', 'Serbia',
  3. 'Spain', 'Croatia', 'Switzerland', 'England', 'Netherlands', 'Argentina', ' Iran',
  4. 'Korea Republic', 'Saudi Arabia', 'Japan', 'Uruguay', 'Ecuador', 'Canada',
  5. 'Senegal', 'Poland', 'Portugal', 'Tunisia', 'Morocco', 'Cameroon', 'USA',
  6. 'Mexico', 'Wales', 'Australia', 'Costa Rica', 'Ghana']
  7. root_path = "models_1"
  8. def verify_team_name( team_name):
  9. for worldcup_team in worldcup_teams:
  10. if team_name==worldcup_team:
  11. return True
  12. return False
  13. def predict( model_dir =root_path+'/LogisticRegression_model.joblib', team_a='France', team_b = 'Mexico'):
  14. if not verify_team_name(team_a):
  15. print(team_a, ' is not correct')
  16. return
  17. if not verify_team_name(team_b) :
  18. print(team_b, ' is not correct')
  19. return
  20. logreg = joblib.load(model_dir)
  21. input_x = [{ 'Home Team Name': team_a, 'Away Team Name': team_b}]
  22. vec = joblib.load(root_path+ "/vec.joblib")
  23. input_x = vec.transform(input_x)
  24. result = logreg.predict(input_x)
  25. print(result)
  26. result1 = logreg.predict_proba(input_x)
  27. print(result1)
  28. print( 'Probability of ',team_a , ' winning:', '%.3f'%result1[ 0][ 2])
  29. print( 'Probability of Draw:', '%.3f' % result1[ 0][ 1])
  30. print( 'Probability of ', team_b, ' winning:', '%.3f' % result1[ 0][ 0])
  31. if __name__ == '__main__':
  32. team_a = 'Ecuador'
  33. team_b = 'Qatar'
  34. predict( 'models/random_forest_model.joblib', team_a, team_b)
  35. team_a = 'England'
  36. team_b = ' Iran'
  37. predict( 'models/random_forest_model.joblib', team_a, team_b)


转载:https://blog.csdn.net/keeppractice/article/details/128022027
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场