情感分析(二):基于 scikit-learn 的 Naive Bayes 实现 
  
 
在上一篇博客 情感分析(一):基于 NLTK 的 Naive Bayes 实现 中,我们介绍了基于 NLTK 实现朴素贝叶斯分类的方法,本文将基于 scikit-learn 再次介绍朴素贝叶斯分类的实现方法。
本文代码已上传至 我的GitHub,需要可自行下载。
1.导入包
import pandas as pd
import sys
sys.path.append("..") # Adds higher directory to python modules path.
from NLPmoviereviews.data import load_data_sent
from NLPmoviereviews.utilities import preprocessing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.model_selection import cross_validate
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
2.导入数据
我们仍然使用的是 tensorflow-datasets 提供的 imdb_reviews 数据集。
# load data
X_train, y_train, X_test, y_test = load_data_sent(percentage_of_sentences=10)
# create dataframe from data
d = {
   'text': X_train, 'sentiment': y_train}
df = pd.DataFrame(d)
df.head()

 原数据集提供了  25000 25000 25000 条训练数据,取  10 % 10\% 10% 即  2500 2500 2500 条。
# check shape
df.shape

# check class balance (it's pretty balanced)
df.sentiment.value_counts()

df.text[0]

3.数据预处理
删除自定义停用词可提高分数(可忽略不计,约提高 0.002 % 0.002\% 0.002%)。
# remove custom stop-words (improves accuracy)
def rm_custom_stops(sentence):
    '''
    Custom stop word remover
    Parameters:
        sentence (str): a string of words
    Returns:
        list_of_words (list): cleaned sentence as a list of words
    '''
    words = sentence.split()
    stop_words = {
   'br', 'movie', 'film'}
    cleaned_words = [w for w in words if not w in stop_words]
    return ' '.join(cleaned_words)
# clean text data
df['text'] = df.text.apply(preprocessing)
df['text'] = df.text.apply(rm_custom_stops)
df.head()

 注:调整 n-gram 似乎用处不大。N-grams 是要考虑目标单词周围的单词数量:增加 n-grams 可以帮助机器在其上下文中理解单词,从而更好地分析单词的含义。
示例:
- vectorizer = CountVectorizer(ngram_range = (2,2)) # 得到较低的准确率 78 % 78\% 78%
- vectorizer = CountVectorizer(ngram_range = (5,5)) # 得到更低的准确率 50 % 50\% 50%
将文本通过 词袋模型 转化为向量。
# vectorize text (convert collection of texts to a matrix of token counts)
vectorizer = CountVectorizer()
X_train_count = vectorizer.fit_transform(df.text)
处理测试数据。
# process test data
d_test = {
   'text': X_test}
df_test = pd.DataFrame(d_test)      # create dataframe
df_test['text'] = df_test.text.apply(preprocessing)     # preprocess
df_test['text'] = df_test.text.apply(rm_custom_stops)
X_test_count = vectorizer.transform(df_test.text)     # vectorize
4.交叉验证模型
scikit-learn 官网提供了  5 5 5 种朴素贝叶斯算法:
| 全称 | 导入 | 
|---|---|
| Gaussian Naive Bayes | from sklearn.naive_bayes import GaussianNB | 
| Multinomial Naive Bayes | from sklearn.naive_bayes import MultinomialNB | 
| Complement Naive Bayes | from sklearn.naive_bayes import ComplementNB | 
| Bernoulli Naive Bayes | from sklearn.naive_bayes import BernoulliNB | 
| Categorical Naive Bayes | from sklearn.naive_bayes import CategoricalNB | 
# initialize & cross validate a basic model
naivebayes = MultinomialNB()
cv_nb = cross_validate(naivebayes,
                       X_train_count,
                       y_train,
                       scoring = "accuracy")
# evaluate accuracy
cv_nb['test_score'].mean()

 拟合交叉验证模型。
# fit model
naivebayes.fit(X_train_count, y_train)

5.评估模型
# get accuracy score
naivebayes.score(X_test_count, y_test)

# Plot confusion matrix
disp = ConfusionMatrixDisplay.from_estimator(naivebayes,
                             X_test_count, y_test,
                             cmap="Blues");
# 160 false positives, 323 false negatives
注意:ConfusionMatrixDisplay.from_estimator 需要 scikit-learn  1.0 1.0 1.0 以上版本。
 
 解决方案:ConfusionMatrixDisplay.from_estimator missing #21775
根据官网提示,如果你使用的是 Python 3.6 版本,是装不上 scikit-learn  1.0 1.0 1.0 版本及以上的。

 
 解决完这个 bug 后,工作继续。
# print classification report
Y_predict = naivebayes.fit(X_train_count, y_train).predict(X_test_count)
print(classification_report(y_test, Y_predict))

6.优化模型
6.1 使用 TF-IDF 构建词向量
关于 TF-IDF,想详细了解可以浏览我的这篇博客【自然语言处理】BOW和TF-IDF详解。
vectorizer = TfidfVectorizer(max_df=0.3) # ignore words with a frequency higher than this %
X_train_vec = vectorizer.fit_transform(df.text)
X_test_vec = vectorizer.transform(df_test.text)
model = MultinomialNB() # 多项式朴素贝叶斯分类器
model.fit(X_train_vec, y_train)
model.score(X_test_vec, y_test)

6.2 使用 ComplementNB 模型
ComplementNB 是 MultinomialNB 模型的一个变种,实现了 补码朴素贝叶斯(CNB)算法。CNB 是 标准多项式朴素贝叶斯(MNB)算法的一种改进,比较适用于不平衡的数据集,在文本分类上的结果通常比 MultinomialNB 模型好,具体来说,CNB 使用来自每个类的补数的统计数据来计算模型的权重。CNB 的发明者的研究表明,CNB 的参数估计比 MNB 的参数估计更稳定。
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(df.text)
X_test_vec = vectorizer.transform(df_test.text)
model = ComplementNB()
model.fit(X_train_vec, y_train)
model.score(X_test_vec, y_test)

6.3 清除 HTML 标签
import re
# function to remove html tags from text
def cleanHtml(review):
    cleanreg = re.compile('<.*?>')
    cleantxt = re.sub(cleanreg, ' ', review)
    return cleantxt
# load data
X_train, y_train, X_test, y_test = load_data_sent(percentage_of_sentences=10)
# process train data
df = pd.DataFrame({
   'text': X_train})
df.text = df.text.apply(cleanHtml)
df['text'] = df.text.apply(preprocessing)
df['text'] = df.text.apply(rm_custom_stops)
# process test data
df_test = pd.DataFrame({
   'text': X_test})
df_test.text = df_test.text.apply(cleanHtml)
df_test['text'] = df_test.text.apply(preprocessing)
df_test['text'] = df_test.text.apply(rm_custom_stops)
# modelling
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(df.text)
X_test_vec = vectorizer.transform(df_test.text)
model = ComplementNB()
model.fit(X_train_vec, y_train)
model.score(X_test_vec, y_test)
 
7.获取词频分布的方法(补充)
# create a new column with words and word counts
vectorizer = CountVectorizer()
analyzer = vectorizer.build_analyzer()
def wordcounts(s):
    c = {
   }
    if analyzer(s):
        d = {
   }
        w = vectorizer.fit_transform([s]).toarray()
        vc = vectorizer.vocabulary_
        for k,v in vc.items():
            d[v]=k # d -> index:word 
        for index,i in enumerate(w[0]):
            c[d[index]] = i # c -> word:count
    return  c
df['Word Counts'] = df.text.apply(wordcounts)
df.head()
 w 示例输出 [[1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1]]
vc 示例输出 {
   'big': 3, 'step': 34, 'surprisingly': 36, 'enjoyable': 5, 'original': 23, 'sequel': 29, 'isnt': 12, 'nearly': 19, 'fun': 9, 'part': 24, 'one': 21, 'instead': 10, 'spend': 32, 'much': 18, 'time': 40, 'plot': 26, 'development': 4, 'tim': 39, 'thomerson': 38, 'still': 35, 'best': 2, 'thing': 37, 'series': 30, 'wisecrack': 45, 'tone': 41, 'entry': 7, 'performance': 25, 'adequate': 1, 'script': 28, 'let': 14, 'action': 0, 'merely': 16, 'routine': 27, 'mildly': 17, 'interest': 11, 'need': 20, 'lot': 15, 'silly': 31, 'laugh': 13, 'order': 22, 'stay': 33, 'entertain': 6, 'trancers': 42, 'unfortunately': 43, 'far': 8, 'watchable': 44}

import operator
first_review = df['Word Counts'].iloc[0]
sorted_by_value = sorted(first_review.items(), key=operator.itemgetter(1),reverse=True)
print(sorted_by_value )

转载:https://blog.csdn.net/be_racle/article/details/128763410
 
					