用Python对哈利波特系列小说进行情感分析_小言_互联网的博客

用Python对哈利波特系列小说进行情感分析

2020-04-01 15:16 452人阅读评论(0)

Python&Stata数据采集与数据分析实证前沿寒假工作坊现在开始招生了，有兴趣的同学和老师可以戳进来了解

准备数据

现有的数据是一部小说放在一个txt里，我们想按照章节(列表中第一个就是章节1的内容，列表中第二个是章节2的内容)进行分析，这就需要用到正则表达式整理数据。

比如我们先看看 01-Harry Potter and the Sorcerer's Stone.txt" 里的章节情况，我们打开txt

经过检索发现，所有章节存在规律性表达

[Chapter][空格][整数][换行符\n][可能含有空格的英文标题][换行符\n]

我们先熟悉下正则，使用这个设计一个模板pattern提取章节信息


   
    
     
      
     
     
      
       import re
      
     
    
     
      
     
     
      
       import nltk
      
     
    
     
      
     
     
      
       raw_text = open(
       "data/01-Harry Potter and the Sorcerer's Stone.txt").read()
      
     
    
     
      
     
     
      
       pattern = 
       'Chapter \d+\n[a-zA-Z ]+\n'
      
     
    
     
      
     
     
      
       re.findall(pattern, raw_text)


   
    
     
      
     
     
      
       [
       'Chapter 1\nThe Boy Who Lived\n',
      
     
    
     
      
     
     
      
       'Chapter 2\nThe Vanishing Glass\n',
      
     
    
     
      
     
     
      
       'Chapter 3\nThe Letters From No One\n',
      
     
    
     
      
     
     
      
       'Chapter 4\nThe Keeper Of The Keys\n',
      
     
    
     
      
     
     
      
       'Chapter 5\nDiagon Alley\n',
      
     
    
     
      
     
     
      
       'Chapter 7\nThe Sorting Hat\n',
      
     
    
     
      
     
     
      
       'Chapter 8\nThe Potions Master\n',
      
     
    
     
      
     
     
      
       'Chapter 9\nThe Midnight Duel\n',
      
     
    
     
      
     
     
      
       'Chapter 10\nHalloween\n',
      
     
    
     
      
     
     
      
       'Chapter 11\nQuidditch\n',
      
     
    
     
      
     
     
      
       'Chapter 12\nThe Mirror Of Erised\n',
      
     
    
     
      
     
     
      
       'Chapter 13\nNicholas Flamel\n',
      
     
    
     
      
     
     
      
       'Chapter 14\nNorbert the Norwegian Ridgeback\n',
      
     
    
     
      
     
     
      
       'Chapter 15\nThe Forbidden Forest\n',
      
     
    
     
      
     
     
      
       'Chapter 16\nThrough the Trapdoor\n',
      
     
    
     
      
     
     
      
       'Chapter 17\nThe Man With Two Faces\n']

熟悉上面的正则表达式操作，我们想更精准一些。我准备了一个test文本，与实际小说中章节目录表达相似，只不过文本更短，更利于理解。按照我们的预期，我们数据中只有5个章节，那么列表的长度应该是5。这样操作后的列表中第一个内容就是章节1的内容，列表中第二个内容是章节2的内容。


   
    
     
      
     
     
      
       import re
      
     
    
     
      
     
     
      
       test = 
       ""
       "Chapter 1\nThe Boy Who Lived\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.\nMr. Dursley was the director of a firm called Grunnings,
      
     
    
     
      
     
     
      
        Chapter 2\nThe Vanishing Glass\nFor a second, Mr. Dursley didn’t realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn’t a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat.
      
     
    
     
      
     
     
      
        Chapter 3\nThe Letters From No One\nThe traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.\nMr. Dursley always sat with his back to the window in his office on the ninth floor. If he hadn’t, he might have found it harder to concentrate on drills that morning.
      
     
    
     
      
     
     
      
        Chapter 4\nThe Keeper Of The Keys\nHe didn’t know why, but they made him uneasy. This bunch were whispering excitedly, too, and he couldn’t see a single collecting tin.
      
     
    
     
      
     
     
      
        Chapter 5\nDiagon Alley\nIt was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. "
       ""
      
     
    
     
      
     
     
      
       #获取章节内容列表(列表中第一个内容就是章节
       1的内容，列表中第二个内容是章节
       2的内容)
      
     
    
     
      
     
     
      
       #为防止列表中有空内容，这里加了一个条件判断，保证列表长度与章节数预期一致
      
     
    
     
      
     
     
      
       chapter_contents = [c 
       for c in re.split(
       'Chapter \d+\n[a-zA-Z ]+\n', test) 
       if c]
      
     
    
     
      
     
     
      
       chapter_contents


   
    
     
      
     
     
      
       [
       'Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.\nMr. Dursley was the director of a firm called Grunnings,\n ',
      
     
    
     
      
     
     
      
       'For a second, Mr. Dursley didn’t realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn’t a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat.\n ',
      
     
    
     
      
     
     
      
       'The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.\nMr. Dursley always sat with his back to the window in his office on the ninth floor. If he hadn’t, he might have found it harder to concentrate on drills that morning.\n ',
      
     
    
     
      
     
     
      
       'He didn’t know why, but they made him uneasy. This bunch were whispering excitedly, too, and he couldn’t see a single collecting tin. \n ',
      
     
    
     
      
     
     
      
       'It was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. ']

能得到哈利波特的章节内容列表

也就意味着我们可以做真正的文本分析了

数据分析

章节数对比


   
    
     
      
     
     
      
       import os
      
     
    
     
      
     
     
      
       import re
      
     
    
     
      
     
     
      
       import matplotlib.pyplot as plt
      
     
    
     
      
     
     
      
       colors = [
       '#78C850', 
       '#A8A878',
       '#F08030',
       '#C03028',
       '#6890F0', 
       '#A890F0',
       '#A040A0']
      
     
    
     
      
     
     
      
       harry_potters = [
       "Harry Potter and the Sorcerer's Stone.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Chamber of Secrets.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Prisoner of Azkaban.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Goblet of Fire.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Order of the Phoenix.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Half-Blood Prince.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Deathly Hallows.txt"]
      
     
    
     
      
     
     
      
       #横坐标为小说名
      
     
    
     
      
     
     
      
       harry_potter_names = [n.replace(
       'Harry Potter and the ', 
       '')[:-4] 
      
     
    
     
      
     
     
      
       for n in harry_potters]
      
     
    
     
      
     
     
      
       #纵坐标为章节数
      
     
    
     
      
     
     
      
       chapter_nums = []
      
     
    
     
      
     
     
      
       for harry_potter in harry_potters:
      
     
    
     
      
     
     
      
        file = "data/"+harry_potter
      
     
    
     
      
     
     
      
        raw_text = open(file).read()
      
     
    
     
      
     
     
      
        pattern = 'Chapter \d+\n[a-zA-Z ]+\n
       '
      
     
    
     
      
     
     
      
        chapter_contents = [c for c in re.split(pattern, raw_text) if c] 
      
     
    
     
      
     
     
      
        chapter_nums.append(len(chapter_contents))
      
     
    
     
      
     
     
      
       #设置画布尺寸
      
     
    
     
      
     
     
      
       plt.figure(figsize=(20, 10))
      
     
    
     
      
     
     
      
       #图的名字，字体大小，粗体
      
     
    
     
      
     
     
      
       plt.title('Chapter Number of Harry Potter
       ', fontsize=25, weight='bold
       ')
      
     
    
     
      
     
     
      
       #绘制带色条形图
      
     
    
     
      
     
     
      
       plt.bar(harry_potter_names, chapter_nums, color=colors)
      
     
    
     
      
     
     
      
       #横坐标刻度上的字体大小及倾斜角度
      
     
    
     
      
     
     
      
       plt.xticks(rotation=25, fontsize=16, weight='bold
       ')
      
     
    
     
      
     
     
      
       plt.yticks(fontsize=16, weight='bold
       ')
      
     
    
     
      
     
     
      
       #坐标轴名字
      
     
    
     
      
     
     
      
       plt.xlabel('Harry Potter Series
       ', fontsize=20, weight='bold
       ')
      
     
    
     
      
     
     
      
       plt.ylabel('Chapter Number
       ', rotation=25, fontsize=20, weight='bold
       ')
      
     
    
     
      
     
     
      
       plt.show()

从上面可以看出哈利波特系列小说的后四部章节数据较多（这分析没啥大用处，主要是练习）

用词丰富程度

如果说一句100个词的句子，同时词语不带重样的，那么用词的丰富程度为100。

而如果说同样长度的句子，只用到20个词语，那么用词的丰富程度为100/20=5。


   
    
     
      
     
     
      
       import os
      
     
    
     
      
     
     
      
       import re
      
     
    
     
      
     
     
      
       import matplotlib.pyplot as plt
      
     
    
     
      
     
     
      
       from nltk 
       import word_tokenize
      
     
    
     
      
     
     
      
       from nltk.stem.snowball importSnowballStemmer
      
     
    
     
      
     
     
      
       plt.style.use(
       'fivethirtyeight')
      
     
    
     
      
     
     
      
       colors = [
       '#78C850', 
       '#A8A878',
       '#F08030',
       '#C03028',
       '#6890F0', 
       '#A890F0',
       '#A040A0']
      
     
    
     
      
     
     
      
       harry_potters = [
       "Harry Potter and the Sorcerer's Stone.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Chamber of Secrets.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Prisoner of Azkaban.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Goblet of Fire.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Order of the Phoenix.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Half-Blood Prince.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Deathly Hallows.txt"]
      
     
    
     
      
     
     
      
       #横坐标为小说名
      
     
    
     
      
     
     
      
       harry_potter_names = [n.replace(
       'Harry Potter and the ', 
       '')[:-4] 
      
     
    
     
      
     
     
      
       for n in harry_potters]
      
     
    
     
      
     
     
      
       #用词丰富程度
      
     
    
     
      
     
     
      
       richness_of_words = []
      
     
    
     
      
     
     
      
       stemmer = SnowballStemmer("english")
      
     
    
     
      
     
     
      
       for harry_potter in harry_potters:
      
     
    
     
      
     
     
      
        file = "data/"+harry_potter
      
     
    
     
      
     
     
      
        raw_text = open(file).read()
      
     
    
     
      
     
     
      
        words = word_tokenize(raw_text)
      
     
    
     
      
     
     
      
        words = [stemmer.stem(w.lower()) for w in words]
      
     
    
     
      
     
     
      
        wordset = set(words)
      
     
    
     
      
     
     
      
        richness = len(words)/len(wordset)
      
     
    
     
      
     
     
      
        richness_of_words.append(richness)
      
     
    
     
      
     
     
      
       #设置画布尺寸
      
     
    
     
      
     
     
      
       plt.figure(figsize=(20, 10))
      
     
    
     
      
     
     
      
       #图的名字，字体大小，粗体
      
     
    
     
      
     
     
      
       plt.title('The Richness of Word in Harry Potter
       ', fontsize=25, weight='bold
       ')
      
     
    
     
      
     
     
      
       #绘制带色条形图
      
     
    
     
      
     
     
      
       plt.bar(harry_potter_names, richness_of_words, color=colors)
      
     
    
     
      
     
     
      
       #横坐标刻度上的字体大小及倾斜角度
      
     
    
     
      
     
     
      
       plt.xticks(rotation=25, fontsize=16, weight='bold
       ')
      
     
    
     
      
     
     
      
       plt.yticks(fontsize=16, weight='bold
       ')
      
     
    
     
      
     
     
      
       #坐标轴名字
      
     
    
     
      
     
     
      
       plt.xlabel('Harry Potter Series
       ', fontsize=20, weight='bold
       ')
      
     
    
     
      
     
     
      
       plt.ylabel('Richness of Words
       ', rotation=25, fontsize=20, weight='bold
       ')
      
     
    
     
      
     
     
      
       plt.show()

情感分析

哈利波特系列小说情绪发展趋势，这里使用VADER,有现成的库vaderSentiment，这里使用其中的polarity_scores函数，可以得到

neg:负面得分
neu：中性得分
pos：积极得分
compound: 综合情感得分


   
    
     
      
     
     
      
       from vaderSentiment.vaderSentiment importSentimentIntensityAnalyzer
      
     
    
     
      
     
     
      
       analyzer = SentimentIntensityAnalyzer()
      
     
    
     
      
     
     
      
       test = 
       'i am so sorry'
      
     
    
     
      
     
     
      
       analyzer.polarity_scores(test)

{'neg': 0.443, 'neu': 0.557, 'pos': 0.0, 'compound': -0.1513}


   
    
     
      
     
     
      
       import os
      
     
    
     
      
     
     
      
       import re
      
     
    
     
      
     
     
      
       import matplotlib.pyplot as plt
      
     
    
     
      
     
     
      
       from nltk.tokenize 
       import sent_tokenize
      
     
    
     
      
     
     
      
       from vaderSentiment.vaderSentiment importSentimentIntensityAnalyzer
      
     
    
     
      
     
     
      
       harry_potters = [
       "Harry Potter and the Sorcerer's Stone.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Chamber of Secrets.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Prisoner of Azkaban.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Goblet of Fire.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Order of the Phoenix.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Half-Blood Prince.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Deathly Hallows.txt"]
      
     
    
     
      
     
     
      
       #横坐标为章节序列
      
     
    
     
      
     
     
      
       chapter_indexes = []
      
     
    
     
      
     
     
      
       #纵坐标为章节情绪得分
      
     
    
     
      
     
     
      
       compounds = []
      
     
    
     
      
     
     
      
       analyzer = SentimentIntensityAnalyzer()
      
     
    
     
      
     
     
      
       chapter_index = 
       1
      
     
    
     
      
     
     
      
       for harry_potter in harry_potters:
      
     
    
     
      
     
     
      
           file = 
       "data/"+harry_potter
      
     
    
     
      
     
     
      
           raw_text = open(file).read()
      
     
    
     
      
     
     
      
           pattern = 
       'Chapter \d+\n[a-zA-Z ]+\n'
      
     
    
     
      
     
     
      
           chapters = [c 
       for c in re.split(pattern, raw_text) 
       if c]
      
     
    
     
      
     
     
      
       #计算每个章节的情感得分
      
     
    
     
      
     
     
      
       for chapter in chapters:
      
     
    
     
      
     
     
      
               compound = 
       0
      
     
    
     
      
     
     
      
               sentences = sent_tokenize(chapter) 
      
     
    
     
      
     
     
      
       for sentence in sentences:
      
     
    
     
      
     
     
      
                   score = analyzer.polarity_scores(sentence)
      
     
    
     
      
     
     
      
                   compound += score[
       'compound']
      
     
    
     
      
     
     
      
               compounds.
       append(compound/
       len(sentences))
      
     
    
     
      
     
     
      
               chapter_indexes.
       append(chapter_index)
      
     
    
     
      
     
     
      
               chapter_index+=
       1
      
     
    
     
      
     
     
      
       #设置画布尺寸
      
     
    
     
      
     
     
      
       plt.figure(figsize=(
       20, 
       10))
      
     
    
     
      
     
     
      
       #图的名字，字体大小，粗体
      
     
    
     
      
     
     
      
       plt.title(
       'Average Sentiment of the Harry Potter', fontsize=
       25, weight=
       'bold')
      
     
    
     
      
     
     
      
       #绘制折线图
      
     
    
     
      
     
     
      
       plt.plot(chapter_indexes, compounds, color=
       '#A040A0')
      
     
    
     
      
     
     
      
       #横坐标刻度上的字体大小及倾斜角度
      
     
    
     
      
     
     
      
       plt.xticks(rotation=
       25, fontsize=
       16, weight=
       'bold')
      
     
    
     
      
     
     
      
       plt.yticks(fontsize=
       16, weight=
       'bold')
      
     
    
     
      
     
     
      
       #坐标轴名字
      
     
    
     
      
     
     
      
       plt.xlabel(
       'Chapter', fontsize=
       20, weight=
       'bold')
      
     
    
     
      
     
     
      
       plt.ylabel(
       'Average Sentiment', rotation=
       25, fontsize=
       20, weight=
       'bold')
      
     
    
     
      
     
     
      
       plt.show()

曲线不够平滑，为了熨平曲线波动，自定义了一个函数


   
    
     
      
     
     
      
       import numpy as np
      
     
    
     
      
     
     
      
       import os
      
     
    
     
      
     
     
      
       import re
      
     
    
     
      
     
     
      
       import matplotlib.pyplot as plt
      
     
    
     
      
     
     
      
       from nltk.tokenize 
       import sent_tokenize
      
     
    
     
      
     
     
      
       from vaderSentiment.vaderSentiment importSentimentIntensityAnalyzer
      
     
    
     
      
     
     
      
       #曲线平滑函数
      
     
    
     
      
     
     
      
       def movingaverage(value_series, window_size):
      
     
    
     
      
     
     
      
           window = np.ones(
       int(window_size))/float(window_size)
      
     
    
     
      
     
     
      
       return np.convolve(value_series, window, 
       'same')
      
     
    
     
      
     
     
      
       harry_potters = [
       "Harry Potter and the Sorcerer's Stone.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Chamber of Secrets.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Prisoner of Azkaban.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Goblet of Fire.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Order of the Phoenix.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Half-Blood Prince.txt",
      
     
    
     
      
     
     
      
       "Harry Potter and the Deathly Hallows.txt"]
      
     
    
     
      
     
     
      
       #横坐标为章节序列
      
     
    
     
      
     
     
      
       chapter_indexes = []
      
     
    
     
      
     
     
      
       #纵坐标为章节情绪得分
      
     
    
     
      
     
     
      
       compounds = []
      
     
    
     
      
     
     
      
       analyzer = SentimentIntensityAnalyzer()
      
     
    
     
      
     
     
      
       chapter_index = 
       1
      
     
    
     
      
     
     
      
       for harry_potter in harry_potters:
      
     
    
     
      
     
     
      
           file = 
       "data/"+harry_potter
      
     
    
     
      
     
     
      
           raw_text = open(file).read()
      
     
    
     
      
     
     
      
           pattern = 
       'Chapter \d+\n[a-zA-Z ]+\n'
      
     
    
     
      
     
     
      
           chapters = [c 
       for c in re.split(pattern, raw_text) 
       if c]
      
     
    
     
      
     
     
      
       #计算每个章节的情感得分
      
     
    
     
      
     
     
      
       for chapter in chapters:
      
     
    
     
      
     
     
      
               compound = 
       0
      
     
    
     
      
     
     
      
               sentences = sent_tokenize(chapter) 
      
     
    
     
      
     
     
      
       for sentence in sentences:
      
     
    
     
      
     
     
      
                   score = analyzer.polarity_scores(sentence)
      
     
    
     
      
     
     
      
                   compound += score[
       'compound']
      
     
    
     
      
     
     
      
               compounds.
       append(compound/
       len(sentences))
      
     
    
     
      
     
     
      
               chapter_indexes.
       append(chapter_index)
      
     
    
     
      
     
     
      
               chapter_index+=
       1
      
     
    
     
      
     
     
      
       #设置画布尺寸
      
     
    
     
      
     
     
      
       plt.figure(figsize=(
       20, 
       10))
      
     
    
     
      
     
     
      
       #图的名字，字体大小，粗体
      
     
    
     
      
     
     
      
       plt.title(
       'Average Sentiment of the Harry Potter', 
      
     
    
     
      
     
     
      
                 fontsize=
       25, 
      
     
    
     
      
     
     
      
                 weight=
       'bold')
      
     
    
     
      
     
     
      
       #绘制折线图
      
     
    
     
      
     
     
      
       plt.plot(chapter_indexes, compounds, 
      
     
    
     
      
     
     
      
                color=
       'red')
      
     
    
     
      
     
     
      
       plt.plot(movingaverage(compounds, 
       10), 
      
     
    
     
      
     
     
      
                color=
       'black', 
      
     
    
     
      
     
     
      
                linestyle=
       ':')
      
     
    
     
      
     
     
      
       #横坐标刻度上的字体大小及倾斜角度
      
     
    
     
      
     
     
      
       plt.xticks(rotation=
       25, 
      
     
    
     
      
     
     
      
                  fontsize=
       16, 
      
     
    
     
      
     
     
      
                  weight=
       'bold')
      
     
    
     
      
     
     
      
       plt.yticks(fontsize=
       16, 
      
     
    
     
      
     
     
      
                  weight=
       'bold')
      
     
    
     
      
     
     
      
       #坐标轴名字
      
     
    
     
      
     
     
      
       plt.xlabel(
       'Chapter', 
      
     
    
     
      
     
     
      
                  fontsize=
       20, 
      
     
    
     
      
     
     
      
                  weight=
       'bold')
      
     
    
     
      
     
     
      
       plt.ylabel(
       'Average Sentiment', 
      
     
    
     
      
     
     
      
                  rotation=
       25, 
      
     
    
     
      
     
     
      
                  fontsize=
       20, 
      
     
    
     
      
     
     
      
                  weight=
       'bold')
      
     
    
     
      
     
     
      
       plt.show()

小言_互联网的博客

小言_互联网的博客

个人资料

文章分类

文章存档

阅读排行

评论排行

推荐文章