小言_互联网的博客

代理IP怎么帮助爬虫爬取你好李焕英豆瓣短评?

355人阅读  评论(0)

你好,李焕英 短评豆瓣链接:
https://movie.douban.com/subject/34841067/comments?start=20&limit=20&status=P&sort=new_score

分析要爬取短评;
34841067:电影ID
start=20:开始页面
limit=20:每页评论条数
代码:

url = ‘https://movie.douban.com/subject/%s/comments?start=%s&limit=20&sort=new_score&status=P % (movie_id, (i - 1) * 20)
按F12,在Google浏览器中进入开发者调试模式,查看源代码,找到简短评论代码的位置,查看它在哪个div和标签下:
可以看到评论在div[id=‘comments’]下的div[class=‘comment-item’]中的第一个span[class=‘short’]中,使用正则表达式提取短评内容,即代码为:
url = ‘https://movie.douban.com/subject/%s/comments?start=%s&limit=20&sort=new_score&status=P’ \ % (movie_id, (i - 1) * 20)req = requests.get(url, headers=headers)req.encoding = ‘utf-8’comments = re.findall(’(.*)’, req.text)

完整代码:

import refrom PIL import Imageimport requestsimport jiebaimport matplotlib.pyplot as pltimport numpy as npfrom os import pathfrom wordcloud import WordCloud, STOPWORDS

headers = { ‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0’}d = path.dirname(file)def spider_comment(movie_id, page):"""
爬取评论
:param movie_id: 电影ID
:param page: 爬取前N页
:return: 评论内容
“”"comment_list = []for i in range(page):url = ‘https://movie.douban.com/subject/%s/comments?start=%s&limit=20&sort=new_score&status=P&percent_type=’ \ % (movie_id, (i - 1) * 20)req = requests.get(url, headers=headers)req.encoding = ‘utf-8’comment_list = re.findall(’(.*)’, req.text)print(“当前页数:%s,总评论数:%s” % (i, len(comment_list)))return comment_listdef wordcloud(comment_list):wordlist = jieba.lcut(’ '.join(comment_list))text = ’ '.join(wordlist)print(text)# 调用包PIL中的open方法,读取图片文件,通过numpy中的array方法生成数组backgroud_Image = np.array(Image.open(path.join(d, “wordcloud.png”)))wordcloud = WordCloud(font_path=“simsun.ttc”,background_color=“white”,mask=backgroud_Image, # 设置背景图片stopwords=STOPWORDS,width=2852,height=2031,margin=2,max_words=6000, # 设置最大显示的字数#stopwords={‘企业’}, # 设置停用词,停用词则不再词云图中表示max_font_size=250, # 设置字体最大值random_state=1, # 设置有多少种随机生成状态,即有多少种配色方案scale=1) # 设置生成的词云图的大小# 传入需画词云图的文本wordcloud.generate(text)wordcloud.to_image()wordcloud.to_file(“cloud.png”)plt.imshow(wordcloud)plt.axis(“off”)plt.show()# 主函数if name == ‘main’:movie_id = '34841067’page = 11comment_list = spider_comment(movie_id, page)wordcloud(comment_list

文章部分内容源于网络,联系侵删*


转载:https://blog.csdn.net/zhimaHTTP/article/details/114086365
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场