自然语言处理：网购商品评论情感判定

2021-04-09 21:09 632人阅读评论(0)

自然语言处理（Natural Language Processing，简称NLP），是为各类企业及开发者提供的用于文本分析及挖掘的核心工具，旨在帮助用户高效的处理文本，已经广泛应用在电商、文娱、司法、公安、金融、医疗、电力等行业客户的多项业务中，取得了良好的效果。

1、项目背景

任何行业领域，用户对产品的评价都显得尤为重要。通过用户评论，可以对用户情感倾向进行判定。

例如，目前最为普遍的网购行为：对于用户来说，参考评论可以做出更优的购买决策；对于商家来说，对商品评论按照情感倾向进行分类，并通过文本聚类得到普遍提及的商品优缺点，可以进一步改良产品。

本案例主要讨论如何对商品评论进行情感倾向判定。下图为某电商平台上针对某款手机的部分评论：

2、数据集

这份某款手机的商品评论信息数据集，包含2个属性，共计8187个样本。

使用Pandas中的read_excel函数读取xls格式的数据集文件，注意文件的编码设置为gb18030，代码如下所示：


  
   
    
     
    
    
     
      import pandas 
      as pd
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      #读入数据集
     
    
   
    
     
    
    
     
      data = pd.read_excel(
      "data.xls", encoding=
      'gb18030')
     
    
   
    
     
    
    
     
      print(
      data.head())

读取数据集效果（部分）如下所示：

查看数据集的相关信息，包括行列数，列名，以及各个类别的样本数，实现代码如下所示：


  
   
    
     
    
    
     
      # 数据集的大小
     
    
   
    
     
    
    
     
      print(
      data.shape)
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      # 数据集的列名
     
    
   
    
     
    
    
     
      print(
      data.columns.values)
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      # 不同类别数据记录的统计
     
    
   
    
     
    
    
     
      print(
      data['Class'].value_counts())

效果如下所示


  
   
    
     
    
    
     
      (
      8186, 
      2)
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      array([
      u'Comment', 
      u'Class'], dtype=object)
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
      
      1    
      3042
     
    
   
    
     
    
    
     
      -1    
      2657
     
    
   
    
     
    
    
      
      0    
      2487
     
    
   
    
     
    
    
     
      Name: Class, dtype: int64

3、数据预处理

现在，我们要将Comment列的文本信息，转化成数值矩阵表示，也就是将文本映射到特征空间。

首先，通过jieba，使用HMM模型，对文本进行中文分词，实现代码如下所示：


  
   
    
     
    
    
     
      # 导入中文分词库jieba
     
    
   
    
     
    
    
     
      import jieba
     
    
   
    
     
    
    
     
      import numpy 
      as np

接下来，对数据集的每个样本的文本进行中文分词，如遇到缺失值，使用“还行、一般吧”进行填充，实现代码如下所示：


  
   
    
     
    
    
     
      cutted = []
     
    
   
    
     
    
    
     
      for row 
      in data.values:
     
    
   
    
     
    
    
         
      try:
     
    
   
    
     
    
    
     
              raw_words = (
      " ".join(jieba.cut(row[
      0])))
     
    
   
    
     
    
    
     
              cutted.append(raw_words)
     
    
   
    
     
    
    
         
      except AttributeError:
     
    
   
    
     
    
    
             
      print row[
      0]
     
    
   
    
     
    
    
     
              cutted.append(
      u"还行 一般吧")
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      cutted_array = np.array(cutted)
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      # 生成新数据文件，Comment字段为分词后的内容
     
    
   
    
     
    
    
     
      data_cutted = pd.DataFrame({
     
    
   
    
     
    
    
         
      'Comment': cutted_array,
     
    
   
    
     
    
    
         
      'Class': data[
      'Class']
     
    
   
    
     
    
    
     
      })

读取并查看预处理后的数据，实现代码如下所示：

print(data_cutted.head())

数据集效果（部分）如下所示：

为了更直观地观察词频高的词语，我们使用第三方库wordcloud进行文本的可视化，导入库实现代码如下所示：


  
   
    
     
    
    
     
      # 导入第三方库wordcloud
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      from wordcloud 
      import WordCloud
     
    
   
    
     
    
    
     
      import matplotlib.pyplot 
      as plt

针对好评，中评和差评的文本，建立WordCloud对象，绘制词云，好评词云可视化实现代码如下所示：


  
   
    
     
    
    
     
      # 好评
     
    
   
    
     
    
    
     
      wc = WordCloud(font
      _path='Courier.ttf')
     
    
   
    
     
    
    
     
      wc.generate(''.join(data_cutted[
      'Comment'][
      data_cutted['Class'] == 1]))
     
    
   
    
     
    
    
     
      plt.axis('off')
     
    
   
    
     
    
    
     
      plt.imshow(wc)
     
    
   
    
     
    
    
     
      plt.show()

好评词云效果如下所示：

中评词云可视化实现代码如下所示：


  
   
    
     
    
    
     
      # 中评
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      wc = WordCloud(font
      _path='Courier.ttf')
     
    
   
    
     
    
    
     
      wc.generate(''.join(data_cutted[
      'Comment'][
      data_cutted['Class'] == 0]))
     
    
   
    
     
    
    
     
      plt.axis('off')
     
    
   
    
     
    
    
     
      plt.imshow(wc)
     
    
   
    
     
    
    
     
      plt.show()

中评词云效果如下所示：

差评词云可视化实现代码如下所示：


  
   
    
     
    
    
     
      # 差评
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      wc = WordCloud(font
      _path='Courier.ttf')
     
    
   
    
     
    
    
     
      wc.generate(''.join(data_cutted[
      'Comment'][
      data_cutted['Class'] == -1]))
     
    
   
    
     
    
    
     
      plt.axis('off')
     
    
   
    
     
    
    
     
      plt.imshow(wc)
     
    
   
    
     
    
    
     
      plt.show()

差评词云效果如下所示：

从词云展现的词频统计图来看，"手机"，"就是"，"屏幕"，"收到"等词对于区分毫无帮助而且会造成偏差。因此，需要把这些对区分类没有意义的词语筛选出来，放到停用词文件stopwords.txt中。实现代码如下所示：


  
   
    
     
    
    
     
      # 读入停用词文件
     
    
   
    
     
    
    
     
      import codecs
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      with codecs.open(
      'stopwords.txt', 
      'r', encoding=
      'utf-8') 
      as f:
     
    
   
    
     
    
    
     
          stopwords = [item.strip() 
      for item 
      in f]
     
    
   
    
     
    
    
         
     
    
   
    
     
    
    
     
      for item 
      in stopwords[
      0:
      200]:
     
    
   
    
     
    
    
         
      print(item,)

输出停用词效果如下所示：

使用jieba库的extract_tags函数，统计好评，中评，差评文本中的TOP20关键词。


  
   
    
     
    
    
     
      #设定停用词文件,在统计关键词的时候，过滤停用词
     
    
   
    
     
    
    
     
      import jieba.analyse
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      jieba.analyse.set_stop_words('stopwords.txt')

好评关键词分析，实现代码如下所示：


  
   
    
     
    
    
     
      # 好评关键词
     
    
   
    
     
    
    
     
      keywords
      _pos = jieba.analyse.extract_tags(''.join(data
      _cutted['Comment'][data_cutted['Class'] == 1]), topK=20)
     
    
   
    
     
    
    
     
      for item in keywords_pos:
     
    
   
    
     
    
    
     
       print（item,）

好评关键词TOP20如下所示：

不错 正品 赠品 五分 发货 东西 满意 机子 喜欢 收到 很漂亮 充电 好评 很快 卖家 速度 评价 流畅 快递 物流

中评关键词分析，实现代码如下所示：


  
   
    
     
    
    
     
      #中评关键词
     
    
   
    
     
    
    
     
      keywords
      _med = jieba.analyse.extract_tags(''.join(data
      _cutted['Comment'][data_cutted['Class'] == 0]), topK=20)
     
    
   
    
     
    
    
     
      for item in keywords_med:
     
    
   
    
     
    
    
     
       print(item,)

中评关键词TOP20如下所示：

充电 不错 发热 外观 感觉 电池 机子 问题 赠品 有点 无线 发烫 换货 软件 快递 安卓 内存 退货 知道 售后

差评关键词分析，实现代码如下所示：


  
   
    
     
    
    
     
      #差评关键词
     
    
   
    
     
    
    
     
      keywords
      _neg = jieba.analyse.extract_tags(''.join(data
      _cutted['Comment'][data_cutted['Class'] == -1]), topK=20)
     
    
   
    
     
    
    
     
      
     
    
   
    
     
    
    
     
      for item in keywords_neg:
     
    
   
    
     
    
    
     
       print（item,）

差评关键词TOP20如下所示：

差评 售后 垃圾 赠品 退货 问题 换货 充电 降价 发票 充电器 东西 刚买 发热 无线 机子 死机 收到 质量 15

经过以上步骤的处理，整个数据集的预处理工作“告一段落”。在中文文本分析和情感分析的工作中，数据预处理的内容主要是分词。只有经过分词处理后的文本数据集才可以进行下一步的向量化操作，满足输入模型的条件。

4、基于SVM的情感分类模型

经过分词之后的文本数据集要先进行向量化之后才能输入到分类模型中进行运算。

我们使用sklearn库实现向量化方法，去掉停用词，并将其通过tf，tf-idf映射到特征空间。

其中，tftf为词频，即分词后每个词项在该条评论中出现的次数；dfdf为出现该词项评论数目；NN为评论总数，使用对数来适当抑制tftf和dfdf值的影响。

我们使用sklearn库中的函数直接实现SVM算法。在这里，我们选取以下形式的SVM模型参与运算。

为了方便，创建文本情感分析类CommentClassifier，来实现建模过程：

__init__为类的初始化函数，输入参数classifier_type和vector_type,分别代表分类模型的类型和向量化方法的类型。
fit()函数，来实现向量化与模型建立的过程。

实现代码如下所示：


  
   
    
     
    
    
     
      # 实现向量化方法
     
    
   
    
     
    
    
     
      from sklearn.feature_extraction.text import TfidfVectorizer
     
    
   
    
     
    
    
     
      from sklearn.feature_extraction.text import CountVectorizer
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      #实现svm和贝叶斯模型
     
    
   
    
     
    
    
     
      from sklearn.svm import SVC
     
    
   
    
     
    
    
     
      from sklearn.svm import LinearSVC
     
    
   
    
     
    
    
     
      from sklearn.linear_model import SGDClassifier
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      # 实现交叉验证
     
    
   
    
     
    
    
     
      from sklearn.cross_validation import train_test_split
     
    
   
    
     
    
    
     
      from sklearn.cross_validation import cross_val_score
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      # 实现评价指标
     
    
   
    
     
    
    
     
      from sklearn import metrics
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      # 文本情感分类的类：CommentClassifier
     
    
   
    
     
    
    
     
      class CommentClassifier:
     
    
   
    
     
    
    
         
      def __init__(self, classifier_type, vector_type):
     
    
   
    
     
    
    
             
      self.classifier_type = classifier_type 
      #分类器类型：支持向量机或贝叶斯分类
     
    
   
    
     
    
    
             
      self.vector_type = vector_type         
      #文本向量化模型：0\1模型,TF模型,TF-IDF模型
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
         
      def fit(self, train_x, train_y, max_df):
     
    
   
    
     
    
    
     
              list_text = list(train_x)
     
    
   
    
     
    
    
             
     
    
   
    
     
    
    
             
      #向量化方法：0 - 0/1,1 - TF,2 - TF-IDF
     
    
   
    
     
    
    
             
      if 
      self.vector_type == 
      0:
     
    
   
    
     
    
    
                 
      self.vectorizer = CountVectorizer(max_df, stop_words = stopwords, ngram_range=(
      1, 
      3)).fit(list_text)
     
    
   
    
     
    
    
     
              elif 
      self.vector_type == 
      1:
     
    
   
    
     
    
    
                 
      self.vectorizer = TfidfVectorizer(max_df, stop_words = stopwords, ngram_range=(
      1, 
      3), use_idf=False).fit(list_text)
     
    
   
    
     
    
    
             
      else:
     
    
   
    
     
    
    
                 
      self.vectorizer = TfidfVectorizer(max_df, stop_words = stopwords, ngram_range=(
      1, 
      3)).fit(list_text)
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
             
      self.array_trainx = 
      self.vectorizer.transform(list_text)
     
    
   
    
     
    
    
             
      self.array_trainy = train_y
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
             
      #分类模型选择：1 - SVC,2 - LinearSVC,3 - SGDClassifier，三种SVM模型 
     
    
   
    
     
    
    
             
      if 
      self.classifier_type == 
      1:
     
    
   
    
     
    
    
                 
      self.model = SVC(kernel=
      'linear', gamma=
      10 ** -
      5, C=
      1).fit(
      self.array_trainx, 
      self.array_trainy)
     
    
   
    
     
    
    
     
              elif 
      self.classifier_type == 
      2:
     
    
   
    
     
    
    
                 
      self.model = LinearSVC().fit(
      self.array_trainx, 
      self.array_trainy)
     
    
   
    
     
    
    
             
      else:
     
    
   
    
     
    
    
                 
      self.model = SGDClassifier().fit(
      self.array_trainx, 
      self.array_trainy)
     
    
   
    
     
    
    
             
     
    
   
    
     
    
    
         
      def predict_value(self, test_x):
     
    
   
    
     
    
    
     
              list_text = list(test_x)
     
    
   
    
     
    
    
             
      self.array_testx = 
      self.vectorizer.transform(list_text)
     
    
   
    
     
    
    
     
              array_predict = 
      self.model.predict(
      self.array_testx)
     
    
   
    
     
    
    
             
      return array_predict
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
         
      def predict_proba(self, test_x):
     
    
   
    
     
    
    
     
              list_text = list(test_x)
     
    
   
    
     
    
    
             
      self.array_testx = 
      self.vectorizer.transform(list_text)
     
    
   
    
     
    
    
     
              array_score = 
      self.model.predict_proba(
      self.array_testx)
     
    
   
    
     
    
    
             
      return array_score

使用train_test_split()函数划分训练集和测试集。训练集：80%；测试集：20%。
建立classifier_type和vector_type两个参数的取值列表，来表示选择的向量化方法以及分类模型
输出每种向量化方法和分类模型的组合所对应的分类评价结果，内容包括混淆矩阵以及含Precision、Recall和F1-score三个指标的评分矩阵

实现代码如下所示：


  
   
    
     
    
    
     
      #划分训练集，测试集
     
    
   
    
     
    
    
     
      train_x, test_x, train_y, test_y = train_test_split(data_cutted['Comment'].ravel().astype('U'), data_cutted['Class'].ravel(),
     
    
   
    
     
    
    
                                                             
      test_size=
      0.
      2, random_state=
      4)
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      classifier_list =
       [1,2,3]
     
    
   
    
     
    
    
     
      vector_list =
       [0,1,2]
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      for classifier_type in classifier_list:
     
    
   
    
     
    
    
         
      for vector_type in vector_list:
     
    
   
    
     
    
    
             
      commentCls = CommentClassifier(classifier_type, vector_type)
     
    
   
    
     
    
    
             
      #max_df 设置为0.98
     
    
   
    
     
    
    
             
      commentCls.fit(train_x, train_y, 
      0.
      98)
     
    
   
    
     
    
    
             
      if classifier_type == 
      0:
     
    
   
    
     
    
    
                 
      value_result = commentCls.predict_value(test_x)
     
    
   
    
     
    
    
                 
      proba_result = commentCls.predict_proba(test_x)
     
    
   
    
     
    
    
                 
      print(classifier_type,vector_type)
     
    
   
    
     
    
    
                 
      print('classification report')
     
    
   
    
     
    
    
                 
      print(metrics.classification_report(test_y, value_result, labels=[-
      1, 
      0, 
      1]))
     
    
   
    
     
    
    
                 
      print('confusion matrix')
     
    
   
    
     
    
    
                 
      print(metrics.confusion_matrix(test_y, value_result, labels=[-
      1, 
      0, 
      1]))
     
    
   
    
     
    
    
             
      else:
     
    
   
    
     
    
    
                 
      value_result = commentCls.predict_value(test_x)
     
    
   
    
     
    
    
                 
      print(classifier_type,vector_type)
     
    
   
    
     
    
    
                 
      print('classification report')
     
    
   
    
     
    
    
                 
      print(metrics.classification_report(test_y, value_result, labels=[-
      1, 
      0, 
      1]))
     
    
   
    
     
    
    
                 
      print('confusion matrix')
     
    
   
    
     
    
    
                 
      print(metrics.confusion_matrix(test_y, value_result, labels=[-
      1, 
      0, 
      1]))

输出效果如下所示：


  
   
    
     
    
    
     
      1 
      0
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.68      
      0.62      
      0.65       
      519
     
    
   
    
     
    
    
               
      0       
      0.55      
      0.49      
      0.52       
      485
     
    
   
    
     
    
    
               
      1       
      0.75      
      0.86      
      0.80       
      634
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.67      
      0.68      
      0.67      
      1638
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      324 
      130  
      65]
     
    
   
    
     
    
    
     
       [
      131 
      236 
      118]
     
    
   
    
     
    
    
     
       [ 
      24  
      64 
      546]]
     
    
   
    
     
    
    
     
      1 
      1
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.71      
      0.74      
      0.72       
      519
     
    
   
    
     
    
    
               
      0       
      0.58      
      0.54      
      0.56       
      485
     
    
   
    
     
    
    
               
      1       
      0.84      
      0.85      
      0.85       
      634
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.72      
      0.72      
      0.72      
      1638
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      385 
      109  
      25]
     
    
   
    
     
    
    
     
       [
      145 
      263  
      77]
     
    
   
    
     
    
    
     
       [ 
      15  
      80 
      539]]
     
    
   
    
     
    
    
     
      1 
      2
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.70      
      0.74      
      0.72       
      519
     
    
   
    
     
    
    
               
      0       
      0.58      
      0.52      
      0.55       
      485
     
    
   
    
     
    
    
               
      1       
      0.84      
      0.86      
      0.85       
      634
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.72      
      0.72      
      0.72      
      1638
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      386 
      106  
      27]
     
    
   
    
     
    
    
     
       [
      151 
      254  
      80]
     
    
   
    
     
    
    
     
       [ 
      14  
      76 
      544]]
     
    
   
    
     
    
    
     
      2 
      0
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.70      
      0.62      
      0.66       
      519
     
    
   
    
     
    
    
               
      0       
      0.56      
      0.51      
      0.54       
      485
     
    
   
    
     
    
    
               
      1       
      0.76      
      0.88      
      0.82       
      634
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.68      
      0.69      
      0.68      
      1638
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      320 
      135  
      64]
     
    
   
    
     
    
    
     
       [
      122 
      248 
      115]
     
    
   
    
     
    
    
     
       [ 
      16  
      57 
      561]]
     
    
   
    
     
    
    
     
      2 
      1
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.69      
      0.73      
      0.71       
      519
     
    
   
    
     
    
    
               
      0       
      0.61      
      0.48      
      0.54       
      485
     
    
   
    
     
    
    
               
      1       
      0.81      
      0.91      
      0.86       
      634
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.71      
      0.73      
      0.72      
      1638
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      377 
      108  
      34]
     
    
   
    
     
    
    
     
       [
      154 
      233  
      98]
     
    
   
    
     
    
    
     
       [ 
      12  
      44 
      578]]
     
    
   
    
     
    
    
     
      2 
      2
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.70      
      0.74      
      0.72       
      519
     
    
   
    
     
    
    
               
      0       
      0.61      
      0.50      
      0.55       
      485
     
    
   
    
     
    
    
               
      1       
      0.83      
      0.91      
      0.87       
      634
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.72      
      0.73      
      0.73      
      1638
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      383 
      108  
      28]
     
    
   
    
     
    
    
     
       [
      154 
      241  
      90]
     
    
   
    
     
    
    
     
       [ 
      13  
      43 
      578]]
     
    
   
    
     
    
    
     
      3 
      0
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.69      
      0.69      
      0.69       
      519
     
    
   
    
     
    
    
               
      0       
      0.58      
      0.47      
      0.52       
      485
     
    
   
    
     
    
    
               
      1       
      0.79      
      0.90      
      0.84       
      634
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.70      
      0.71      
      0.70      
      1638
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      359 
      118  
      42]
     
    
   
    
     
    
    
     
       [
      148 
      228 
      109]
     
    
   
    
     
    
    
     
       [ 
      14  
      47 
      573]]
     
    
   
    
     
    
    
     
      3 
      1
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.70      
      0.74      
      0.72       
      519
     
    
   
    
     
    
    
               
      0       
      0.60      
      0.49      
      0.54       
      485
     
    
   
    
     
    
    
               
      1       
      0.81      
      0.88      
      0.84       
      634
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.71      
      0.72      
      0.71      
      1638
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      386  
      96  
      37]
     
    
   
    
     
    
    
     
       [
      152 
      240  
      93]
     
    
   
    
     
    
    
     
       [ 
      13  
      66 
      555]]
     
    
   
    
     
    
    
     
      3 
      2
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.65      
      0.75      
      0.69       
      519
     
    
   
    
     
    
    
               
      0       
      0.63      
      0.49      
      0.55       
      485
     
    
   
    
     
    
    
               
      1       
      0.83      
      0.86      
      0.85       
      634
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.71      
      0.72      
      0.71      
      1638
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      389  
      98  
      32]
     
    
   
    
     
    
    
     
       [
      169 
      236  
      80]
     
    
   
    
     
    
    
     
       [ 
      45  
      41 
      548]]

从结果上来看，选择tfidf向量化方法，使用LinearSVC模型效果比较好，f1-socre为0.73

从混淆矩阵来看，我们会发现多数的错误分类都出现在中评和差评上。我们可以将原始数据集的中评删除。实现代码如下所示：


  
   
    
     
    
    
     
      data_bi = data_cutted[data_cutted[
      'Class'] != 
      0]
     
    
   
    
     
    
    
     
      data_bi[
      'Class'].value_counts()

效果如下所示：

再次运行分类模型，查看分类结果，如下所示：


  
   
    
     
    
    
     
      1 
      0
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.90      
      0.79      
      0.84       
      537
     
    
   
    
     
    
    
               
      1       
      0.83      
      0.92      
      0.87       
      603
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.86      
      0.86      
      0.86      
      1140
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      425 
      112]
     
    
   
    
     
    
    
     
       [ 
      48 
      555]]
     
    
   
    
     
    
    
     
      1 
      1
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.87      
      0.92      
      0.90       
      537
     
    
   
    
     
    
    
               
      1       
      0.93      
      0.88      
      0.90       
      603
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.90      
      0.90      
      0.90      
      1140
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      496  
      41]
     
    
   
    
     
    
    
     
       [ 
      71 
      532]]
     
    
   
    
     
    
    
     
      1 
      2
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.88      
      0.93      
      0.90       
      537
     
    
   
    
     
    
    
               
      1       
      0.93      
      0.88      
      0.91       
      603
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.90      
      0.90      
      0.90      
      1140
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      497  
      40]
     
    
   
    
     
    
    
     
       [ 
      70 
      533]]
     
    
   
    
     
    
    
     
      2 
      0
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.90      
      0.80      
      0.85       
      537
     
    
   
    
     
    
    
               
      1       
      0.84      
      0.92      
      0.88       
      603
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.87      
      0.86      
      0.86      
      1140
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      431 
      106]
     
    
   
    
     
    
    
     
       [ 
      48 
      555]]
     
    
   
    
     
    
    
     
      2 
      1
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.92      
      0.91      
      0.91       
      537
     
    
   
    
     
    
    
               
      1       
      0.92      
      0.93      
      0.92       
      603
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.92      
      0.92      
      0.92      
      1140
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      486  
      51]
     
    
   
    
     
    
    
     
       [ 
      43 
      560]]
     
    
   
    
     
    
    
     
      2 
      2
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.93      
      0.91      
      0.92       
      537
     
    
   
    
     
    
    
               
      1       
      0.92      
      0.94      
      0.93       
      603
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.92      
      0.92      
      0.92      
      1140
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      488  
      49]
     
    
   
    
     
    
    
     
       [ 
      39 
      564]]
     
    
   
    
     
    
    
     
      3 
      0
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.92      
      0.82      
      0.87       
      537
     
    
   
    
     
    
    
               
      1       
      0.86      
      0.94      
      0.90       
      603
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.89      
      0.88      
      0.88      
      1140
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      443  
      94]
     
    
   
    
     
    
    
     
       [ 
      38 
      565]]
     
    
   
    
     
    
    
     
      3 
      1
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.92      
      0.91      
      0.91       
      537
     
    
   
    
     
    
    
               
      1       
      0.92      
      0.93      
      0.92       
      603
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.92      
      0.92      
      0.92      
      1140
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      486  
      51]
     
    
   
    
     
    
    
     
       [ 
      41 
      562]]
     
    
   
    
     
    
    
     
      3 
      2
     
    
   
    
     
    
    
     
      classification 
      report
     
    
   
    
     
    
    
                  
      precision    
      recall  
      f1-score   
      support
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
              
      -1       
      0.88      
      0.93      
      0.90       
      537
     
    
   
    
     
    
    
               
      1       
      0.93      
      0.89      
      0.91       
      603
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      avg 
      / 
      total       
      0.91      
      0.91      
      0.91      
      1140
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      confusion 
      matrix
     
    
   
    
     
    
    
     
      [[
      497  
      40]
     
    
   
    
     
    
    
     
       [ 
      67 
      536]]

删除差评之后，不同组合的分类模型效果均有显著提升。这也说明，分类模型能够有效地将好评区分出来。

数据集中存在标注不准确的问题，主要集中在中评。由于人在评论时，除非有问题否则一般都会打好评，如果打了中评说明对产品有不满意之处，在情感的表达上就会趋向于负向情感，同时评论具有很大主观性，很多中评会将其归为差评，但数据集中却认为是中评。因此，将一条评论分类为好评、中评、差评是不够客观，中评与差评之间的边界很模糊，因此识别率很难提高。

5、基于word2vec中doc2vec的无监督分类模型

开源文本向量化工具word2vec，可以为文本数据寻求更加深层次的特征表示。词语之间可以进行运算:

w2v(woman)-w2v(man)+w2v(king)=w2v(queen)

基于word2vec的doc2vec，将每个文档表示为一个向量，并且通过余弦距离可以计算两个文档的相似程度，那么就可以计算一句话和一句极好的好评的距离，以及一句话到极差的差评的距离。

在本案例的数据集中：

好评：快就是手感满意也好喜欢也流畅很服务态度实用超快挺快用着速度礼品也不错非常好挺好感觉才来还行好看也快不错的送了非常不错超级赞好多东西很实用各方面挺好的很多漂亮配件还不错也多特意慢满分好用非常漂亮......
差评：不多说上当差差刚用服务差一点也不不要简直还是去实体店大家保证不肯生气开发票磨损后悔印记网什么破烂烂左边失效太骗掉价走下坡路不说了彻底三星手机自营几次真心别的看完简单说机会这是生气了触动缝隙冲动了失望......

我们使用第三方库gensim来实现doc2vec模型。

实现代码如下所示：


  
   
    
     
    
    
     
      import pandas 
      as pd
     
    
   
    
     
    
    
     
      from gensim.models 
      import Doc2Vec
     
    
   
    
     
    
    
     
      from gensim.models.doc2vec 
      import TaggedDocument
     
    
   
    
     
    
    
     
      import logging
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      logging.basicConfig(format=
      '%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      train_x = data_bi[
      'Comment'].ravel()
     
    
   
    
     
    
    
     
      train_y = data_bi[
      'Class'].ravel()
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      #为train_x列贴上标签"TRAIN"
     
    
   
    
     
    
    
     
      def labelizeReviews(reviews, label_type):
     
    
   
    
     
    
    
     
          labelized = []
     
    
   
    
     
    
    
         
      for i, v 
      in enumerate(reviews):
     
    
   
    
     
    
    
     
              label = 
      '%s_%s' % (label_type, i)
     
    
   
    
     
    
    
     
              labelized.append(TaggedDocument(v.split(
      " "), [label]))
     
    
   
    
     
    
    
         
      return labelized
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      train_x = labelizeReviews(train_x, 
      "TRAIN")
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      #建立Doc2Vec模型model
     
    
   
    
     
    
    
     
      size = 
      300
     
    
   
    
     
    
    
     
      all_data = []
     
    
   
    
     
    
    
     
      all_data.extend(train_x)
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      model = Doc2Vec(min_count=
      1, window=
      8, size=size, sample=
      1e-4, negative=
      5, hs=
      0, iter=
      5, workers=
      8)
     
    
   
    
     
    
    
     
      model.build_vocab(all_data)
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      # 设置迭代次数10
     
    
   
    
     
    
    
     
      for epoch 
      in range(
      10):
     
    
   
    
     
    
    
     
          model.train(train_x)
     
    
   
    
     
    
    
         
     
    
   
    
     
    
    
     
      #建立空列表pos和neg以对相似度计算结果进行存储，计算每个评论和极好评论之间的余弦距离，并存在pos列表中
     
    
   
    
     
    
    
     
      #计算每个评论和极差评论之间的余弦距离，并存在neg列表中
     
    
   
    
     
    
    
     
      pos = []
     
    
   
    
     
    
    
     
      neg = []
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      for i 
      in range(
      0,len(train_x)):
     
    
   
    
     
    
    
     
          pos.append(model.docvecs.similarity(
      "TRAIN_0",
      "TRAIN_{}".format(i)))
     
    
   
    
     
    
    
     
          neg.append(model.docvecs.similarity(
      "TRAIN_1",
      "TRAIN_{}".format(i)))
     
    
   
    
     
    
    
         
     
    
   
    
     
    
    
     
      #将pos列表和neg列表更新到原始数据文件中，分别表示为字段PosSim和字段NegSim
     
    
   
    
     
    
    
     
      data_bi[
      u'PosSim'] = pos
     
    
   
    
     
    
    
     
      data_bi[
      u'NegSim'] = neg

模型训练过程如下所示：


  
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      28,
      393 : INFO : collecting 
      all words and their counts
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      28,
      394 : INFO : PROGRESS: at example #
      0, processed 
      0 words (
      0/s), 
      0 word types, 
      0 tags
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      28,
      593 : INFO : collected 
      10545 word types and 
      5700 unique tags from a corpus of 
      5700 examples and 
      482148 words
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      28,
      595 : INFO : Loading a fresh vocabulary
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      28,
      649 : INFO : min_count=
      1 retains 
      10545 unique words (
      100% of original 
      10545, drops 
      0)
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      28,
      650 : INFO : min_count=
      1 leaves 
      482148 word corpus (
      100% of original 
      482148, drops 
      0)
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      28,
      705 : INFO : deleting the raw counts dictionary of 
      10545 items
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      28,
      706 : INFO : sample=
      0.
      0001 downsamples 
      217 most-common words
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      28,
      707 : INFO : downsampling leaves estimated 
      108356 word corpus (
      22.
      5% of prior 
      482148)
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      28,
      709 : INFO : estimated required memory for 
      10545 words and 
      300 dimensions: 
      38560500 bytes
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      28,
      784 : INFO : resetting layer weights
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      29,
      120 : INFO : training model with 
      8 workers 
      on 
      10545 vocabulary and 
      300 features, using sg=
      0 hs=
      0 sample=
      0.
      0001 negative=
      5 window=
      8
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      29,
      121 : INFO : expecting 
      5700 sentences, matching count from corpus used for vocabulary survey
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      30,
      176 : INFO : PROGRESS: at 
      10.
      24% examples, 
      72316 words/s, in_qsize 
      15, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      31,
      211 : INFO : PROGRESS: at 
      29.
      96% examples, 
      91057 words/s, in_qsize 
      16, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      32,
      218 : INFO : PROGRESS: at 
      66.
      30% examples, 
      126742 words/s, in_qsize 
      15, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      33,
      231 : INFO : PROGRESS: at 
      86.
      00% examples, 
      122698 words/s, in_qsize 
      15, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      33,
      571 : INFO : worker thread finished; awaiting finish of 
      7 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      33,
      573 : INFO : worker thread finished; awaiting finish of 
      6 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      33,
      605 : INFO : worker thread finished; awaiting finish of 
      5 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      33,
      647 : INFO : worker thread finished; awaiting finish of 
      4 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      33,
      678 : INFO : worker thread finished; awaiting finish of 
      3 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      33,
      696 : INFO : worker thread finished; awaiting finish of 
      2 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      33,
      711 : INFO : worker thread finished; awaiting finish of 
      1 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      33,
      722 : INFO : worker thread finished; awaiting finish of 
      0 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      33,
      724 : INFO : training 
      on 
      2410740 raw words (
      570332 effective words) took 
      4.
      6s, 
      124032 effective words/s
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      33,
      727 : INFO : training model with 
      8 workers 
      on 
      10545 vocabulary and 
      300 features, using sg=
      0 hs=
      0 sample=
      0.
      0001 negative=
      5 window=
      8
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      33,
      731 : INFO : expecting 
      5700 sentences, matching count from corpus used for vocabulary survey
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      34,
      753 : INFO : PROGRESS: at 
      36.
      38% examples, 
      212225 words/s, in_qsize 
      15, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      35,
      762 : INFO : PROGRESS: at 
      75.
      24% examples, 
      216859 words/s, in_qsize 
      16, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      36,
      243 : INFO : worker thread finished; awaiting finish of 
      7 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      36,
      244 : INFO : worker thread finished; awaiting finish of 
      6 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      36,
      264 : INFO : worker thread finished; awaiting finish of 
      5 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      36,
      306 : INFO : worker thread finished; awaiting finish of 
      4 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      36,
      311 : INFO : worker thread finished; awaiting finish of 
      3 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      36,
      320 : INFO : worker thread finished; awaiting finish of 
      2 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      36,
      330 : INFO : worker thread finished; awaiting finish of 
      1 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      36,
      336 : INFO : worker thread finished; awaiting finish of 
      0 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      36,
      338 : INFO : training 
      on 
      2410740 raw words (
      570008 effective words) took 
      2.
      6s, 
      219523 effective words/s
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      36,
      339 : INFO : training model with 
      8 workers 
      on 
      10545 vocabulary and 
      300 features, using sg=
      0 hs=
      0 sample=
      0.
      0001 negative=
      5 window=
      8
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      36,
      341 : INFO : expecting 
      5700 sentences, matching count from corpus used for vocabulary survey
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      37,
      353 : INFO : PROGRESS: at 
      28.
      23% examples, 
      177496 words/s, in_qsize 
      16, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      38,
      372 : INFO : PROGRESS: at 
      66.
      30% examples, 
      193880 words/s, in_qsize 
      16, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      39,
      061 : INFO : worker thread finished; awaiting finish of 
      7 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      39,
      062 : INFO : worker thread finished; awaiting finish of 
      6 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      39,
      074 : INFO : worker thread finished; awaiting finish of 
      5 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      39,
      115 : INFO : worker thread finished; awaiting finish of 
      4 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      39,
      122 : INFO : worker thread finished; awaiting finish of 
      3 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      39,
      132 : INFO : worker thread finished; awaiting finish of 
      2 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      39,
      147 : INFO : worker thread finished; awaiting finish of 
      1 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      39,
      154 : INFO : worker thread finished; awaiting finish of 
      0 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      39,
      155 : INFO : training 
      on 
      2410740 raw words (
      570746 effective words) took 
      2.
      8s, 
      203312 effective words/s
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      39,
      158 : INFO : training model with 
      8 workers 
      on 
      10545 vocabulary and 
      300 features, using sg=
      0 hs=
      0 sample=
      0.
      0001 negative=
      5 window=
      8
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      39,
      159 : INFO : expecting 
      5700 sentences, matching count from corpus used for vocabulary survey
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      40,
      168 : INFO : PROGRESS: at 
      37.
      74% examples, 
      222816 words/s, in_qsize 
      16, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      41,
      177 : INFO : PROGRESS: at 
      77.
      55% examples, 
      223202 words/s, in_qsize 
      16, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      41,
      605 : INFO : worker thread finished; awaiting finish of 
      7 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      41,
      610 : INFO : worker thread finished; awaiting finish of 
      6 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      41,
      614 : INFO : worker thread finished; awaiting finish of 
      5 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      41,
      645 : INFO : worker thread finished; awaiting finish of 
      4 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      41,
      670 : INFO : worker thread finished; awaiting finish of 
      3 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      41,
      674 : INFO : worker thread finished; awaiting finish of 
      2 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      41,
      682 : INFO : worker thread finished; awaiting finish of 
      1 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      41,
      690 : INFO : worker thread finished; awaiting finish of 
      0 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      41,
      692 : INFO : training 
      on 
      2410740 raw words (
      569889 effective words) took 
      2.
      5s, 
      225457 effective words/s
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      41,
      694 : INFO : training model with 
      8 workers 
      on 
      10545 vocabulary and 
      300 features, using sg=
      0 hs=
      0 sample=
      0.
      0001 negative=
      5 window=
      8
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      41,
      696 : INFO : expecting 
      5700 sentences, matching count from corpus used for vocabulary survey
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      42,
      712 : INFO : PROGRESS: at 
      29.
      16% examples, 
      183182 words/s, in_qsize 
      15, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      43,
      754 : INFO : PROGRESS: at 
      69.
      96% examples, 
      203560 words/s, in_qsize 
      15, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      44,
      804 : INFO : PROGRESS: at 
      91.
      97% examples, 
      173787 words/s, in_qsize 
      14, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      44,
      973 : INFO : worker thread finished; awaiting finish of 
      7 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      44,
      989 : INFO : worker thread finished; awaiting finish of 
      6 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      45,
      028 : INFO : worker thread finished; awaiting finish of 
      5 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      45,
      061 : INFO : worker thread finished; awaiting finish of 
      4 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      45,
      097 : INFO : worker thread finished; awaiting finish of 
      3 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      45,
      101 : INFO : worker thread finished; awaiting finish of 
      2 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      45,
      121 : INFO : worker thread finished; awaiting finish of 
      1 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      45,
      125 : INFO : worker thread finished; awaiting finish of 
      0 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      45,
      128 : INFO : training 
      on 
      2410740 raw words (
      569903 effective words) took 
      3.
      4s, 
      166370 effective words/s
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      45,
      131 : INFO : training model with 
      8 workers 
      on 
      10545 vocabulary and 
      300 features, using sg=
      0 hs=
      0 sample=
      0.
      0001 negative=
      5 window=
      8
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      45,
      132 : INFO : expecting 
      5700 sentences, matching count from corpus used for vocabulary survey
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      46,
      152 : INFO : PROGRESS: at 
      11.
      26% examples, 
      79348 words/s, in_qsize 
      16, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      47,
      153 : INFO : PROGRESS: at 
      27.
      52% examples, 
      85992 words/s, in_qsize 
      16, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      48,
      166 : INFO : PROGRESS: at 
      66.
      47% examples, 
      130273 words/s, in_qsize 
      15, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      49,
      061 : INFO : worker thread finished; awaiting finish of 
      7 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      49,
      076 : INFO : worker thread finished; awaiting finish of 
      6 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      49,
      088 : INFO : worker thread finished; awaiting finish of 
      5 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      49,
      123 : INFO : worker thread finished; awaiting finish of 
      4 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      49,
      144 : INFO : worker thread finished; awaiting finish of 
      3 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      49,
      147 : INFO : worker thread finished; awaiting finish of 
      2 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      49,
      152 : INFO : worker thread finished; awaiting finish of 
      1 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      49,
      159 : INFO : worker thread finished; awaiting finish of 
      0 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      49,
      160 : INFO : training 
      on 
      2410740 raw words (
      570333 effective words) took 
      4.
      0s, 
      141860 effective words/s
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      49,
      161 : INFO : training model with 
      8 workers 
      on 
      10545 vocabulary and 
      300 features, using sg=
      0 hs=
      0 sample=
      0.
      0001 negative=
      5 window=
      8
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      49,
      163 : INFO : expecting 
      5700 sentences, matching count from corpus used for vocabulary survey
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      50,
      185 : INFO : PROGRESS: at 
      31.
      78% examples, 
      193530 words/s, in_qsize 
      15, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      51,
      244 : INFO : PROGRESS: at 
      48.
      51% examples, 
      141817 words/s, in_qsize 
      15, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      52,
      278 : INFO : PROGRESS: at 
      69.
      96% examples, 
      134399 words/s, in_qsize 
      16, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      52,
      918 : INFO : worker thread finished; awaiting finish of 
      7 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      52,
      936 : INFO : worker thread finished; awaiting finish of 
      6 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      52,
      945 : INFO : worker thread finished; awaiting finish of 
      5 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      52,
      976 : INFO : worker thread finished; awaiting finish of 
      4 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      52,
      979 : INFO : worker thread finished; awaiting finish of 
      3 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      52,
      984 : INFO : worker thread finished; awaiting finish of 
      2 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      52,
      995 : INFO : worker thread finished; awaiting finish of 
      1 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      52,
      998 : INFO : worker thread finished; awaiting finish of 
      0 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      52,
      999 : INFO : training 
      on 
      2410740 raw words (
      570031 effective words) took 
      3.
      8s, 
      148864 effective words/s
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      53,
      000 : INFO : training model with 
      8 workers 
      on 
      10545 vocabulary and 
      300 features, using sg=
      0 hs=
      0 sample=
      0.
      0001 negative=
      5 window=
      8
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      53,
      002 : INFO : expecting 
      5700 sentences, matching count from corpus used for vocabulary survey
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      54,
      024 : INFO : PROGRESS: at 
      34.
      48% examples, 
      202424 words/s, in_qsize 
      15, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      55,
      035 : INFO : PROGRESS: at 
      68.
      58% examples, 
      201499 words/s, in_qsize 
      15, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      56,
      010 : INFO : worker thread finished; awaiting finish of 
      7 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      56,
      017 : INFO : worker thread finished; awaiting finish of 
      6 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      56,
      048 : INFO : PROGRESS: at 
      96.
      89% examples, 
      183861 words/s, in_qsize 
      5, out_qsize 
      1
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      56,
      049 : INFO : worker thread finished; awaiting finish of 
      5 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      56,
      071 : INFO : worker thread finished; awaiting finish of 
      4 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      56,
      084 : INFO : worker thread finished; awaiting finish of 
      3 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      56,
      099 : INFO : worker thread finished; awaiting finish of 
      2 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      56,
      101 : INFO : worker thread finished; awaiting finish of 
      1 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      56,
      104 : INFO : worker thread finished; awaiting finish of 
      0 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      56,
      104 : INFO : training 
      on 
      2410740 raw words (
      570328 effective words) took 
      3.
      1s, 
      184129 effective words/s
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      56,
      105 : INFO : training model with 
      8 workers 
      on 
      10545 vocabulary and 
      300 features, using sg=
      0 hs=
      0 sample=
      0.
      0001 negative=
      5 window=
      8
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      56,
      107 : INFO : expecting 
      5700 sentences, matching count from corpus used for vocabulary survey
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      57,
      134 : INFO : PROGRESS: at 
      33.
      13% examples, 
      197730 words/s, in_qsize 
      15, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      58,
      140 : INFO : PROGRESS: at 
      69.
      96% examples, 
      206423 words/s, in_qsize 
      15, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      58,
      876 : INFO : worker thread finished; awaiting finish of 
      7 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      58,
      883 : INFO : worker thread finished; awaiting finish of 
      6 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      58,
      889 : INFO : worker thread finished; awaiting finish of 
      5 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      58,
      937 : INFO : worker thread finished; awaiting finish of 
      4 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      58,
      949 : INFO : worker thread finished; awaiting finish of 
      3 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      58,
      953 : INFO : worker thread finished; awaiting finish of 
      2 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      58,
      960 : INFO : worker thread finished; awaiting finish of 
      1 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      58,
      967 : INFO : worker thread finished; awaiting finish of 
      0 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      58,
      968 : INFO : training 
      on 
      2410740 raw words (
      570312 effective words) took 
      2.
      9s, 
      199922 effective words/s
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      58,
      969 : INFO : training model with 
      8 workers 
      on 
      10545 vocabulary and 
      300 features, using sg=
      0 hs=
      0 sample=
      0.
      0001 negative=
      5 window=
      8
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      58,
      970 : INFO : expecting 
      5700 sentences, matching count from corpus used for vocabulary survey
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      30:
      59,
      991 : INFO : PROGRESS: at 
      32.
      86% examples, 
      198045 words/s, in_qsize 
      16, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      31:
      00,
      993 : INFO : PROGRESS: at 
      68.
      23% examples, 
      201443 words/s, in_qsize 
      16, out_qsize 
      0
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      31:
      01,
      881 : INFO : worker thread finished; awaiting finish of 
      7 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      31:
      01,
      888 : INFO : worker thread finished; awaiting finish of 
      6 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      31:
      01,
      907 : INFO : worker thread finished; awaiting finish of 
      5 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      31:
      01,
      922 : INFO : worker thread finished; awaiting finish of 
      4 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      31:
      01,
      941 : INFO : worker thread finished; awaiting finish of 
      3 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      31:
      01,
      948 : INFO : worker thread finished; awaiting finish of 
      2 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      31:
      01,
      955 : INFO : worker thread finished; awaiting finish of 
      1 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      31:
      01,
      961 : INFO : worker thread finished; awaiting finish of 
      0 more threads
     
    
   
    
     
    
    
     
      2017-
      05-
      27 
      14:
      31:
      01,
      962 : INFO : training 
      on 
      2410740 raw words (
      570826 effective words) took 
      3.
      0s, 
      191072 effective words/s

最后可视化评论分类效果，实现代码如下所示：


  
   
    
     
    
    
     
      from matplotlib import pyplot 
      as plt
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      label= data_bi[
      'Class'].ravel()
     
    
   
    
     
    
    
     
      values = data_bi[[
      'PosSim' , 
      'NegSim']].values
     
    
   
    
     
    
    
      
     
    
   
    
     
    
    
     
      plt.scatter(values[:,
      0], values[:,
      1], c=
      label, alpha=
      0.4)
     
    
   
    
     
    
    
     
      plt.show()

效果如下所示：

从上图中可以看到，好评与差评基本上可以通过一条直线区分开（蓝色为差评，红色为好评）

该方法与传统思路完全不同，没有使用词频率，情感词等特征，其优点有：

将数据集映射到了极低维度的空间，只有二维
一种无监督的学习方法，不需要对原始训练数据进行标注
具有普适性，在其他领域也可以用这种方法，只需要先找出该领域极其正和极其负的方法，将其与所有待识别的数据通过doc2vec转化为向量计算距离即可

转载：https://blog.csdn.net/m0_38106923/article/details/115532908

查看评论

小言_互联网的博客

小言_互联网的博客

个人资料

文章分类

文章存档

阅读排行

评论排行

推荐文章

自然语言处理：网购商品评论情感判定

目录

1、项目背景

2、数据集

3、数据预处理

4、基于SVM的情感分类模型

5、基于word2vec中doc2vec的无监督分类模型

* 以上用户言论只代表其个人观点，不代表本网站的观点或立场