一 前言
本篇是知识追寻者随意在淘宝找了一款销量较高的口红进行数据分析,学完本篇读者将会使用词云库,基础的分词进行数据分析;
二 登陆淘宝爬取口红评论
地址 : https://detail.tmall.com/item.htm?spm=a230r.1.14.1.da793f34qRRoFb&id=594188372494&ns=1&abbucket=9
首先需要登陆淘宝然后打开网址
其次开发者工具打开 network按照如下方式进行锁定url
先在评论区复制关键字段黏贴进索引进行评论url定位;
点击 url 获取关键字段
- 请求 url
- referer
- cookie
- User-Agent
除了url将内容复制进行请求头,示例如下
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
'referer':'https://detail.tmall.com/item.htm?spm=a230r.1.14.1.da793f34qRRoFb&id=594188372494&ns=1&abbucket=9',
'cookie':'.........'
}
为了分析评论需要大量的数据进行支撑,故选定250页的评论,currentPage 为当前页,我们只需要改变页码即可,抓取的url如下
https://rate.tmall.com/list_detail_rate.htm?itemId=594188372494&spuId=1439508724&sellerId=4144020062&order=3¤tPage=1&.....
然后发起请求进行数据爬取存入excel,整体代码如下
# -*- coding: utf-8 -*-
import re
import requests
import pandas as pd
import time
from urllib import error
rate_list = []
classify = []
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
'referer':'https://detail.tmall.com/item.htm?spm=a230r.1.14.1.da793f34qRRoFb&id=594188372494&ns=1&abbucket=9',
'cookie':'........'
}
for page in range(1,250,1):
try:
front = 'https://rate.tmall.com/list_detail_rate.htm?itemId=594188372494&spuId=1439508724&sellerId=4144020062&order=3¤tPage='
rear = '&append=0&content=1&tagId=&posi=&picture=&groupId=&ua=098%23E1hvwvvEvbQvU9CkvvvvvjiPn25p6jtbn2LwzjivPmPvsjYRR2M96jDvP259AjibRsujvpvhvvpvv8wCvvpvvUmmvphvC9v9vvCvpbyCvm9vvvvvphvvvvvv96Cvpv3Zvvm2phCvhRvvvUnvphvppvvv96CvpCCvkphvC99vvOCgo8yCvv9vvUmgOg9MyvyCvhQUaGyvClsWa4AU%2B2DkLuc61WkwVzBO0f0DyBvOJ1kHsX7veC6AxYjxAfyp%2B3%2BIaNoxfBAKfvDrgjc6%2BulsbdmxfwkK5kx%2Fgj7QD46w2QhvCPMMvvvtvpvhvvvvvv%3D%3D&needFold=0&_ksTS=1585445591007_822&callback=jsonp823'
url = front + str(page) + rear
data = requests.get(url,headers=headers).text
rate = re.findall('"rateContent":"(.*?)","fromMall"',data)
clazz = re.findall('"auctionSku":"(.*?)","anony"',data)
rate_list.append(rate)
classify.append(clazz)
time.sleep(8)
print('当前页%s'% page)
except error.URLError as e:
print(e)
frame = pd.DataFrame()
frame['评论'] = rate_list
frame['分类'] = classify
frame.to_excel('口红评论分类.xlsx')
经过漫长的等待数据准备完毕;
三 进行数据分析
# -*- coding: utf-8 -*-
import pandas as pd
import matplotlib.pyplot as plt
import jieba #分词库
from wordcloud import WordCloud ,ImageColorGenerator #词云库
from PIL import Image
import numpy as np
import re
frame = pd.read_excel('../口红评论分类.xlsx')
values = frame['评论'].values.tolist()
segments =[]
for value in values:
# 字符串处理
slic = value.replace('[','',1).replace(']','',1)
for val in slic.split(','):
seg_list = jieba.cut(val, cut_all=False)
segments.append(seg_list)
worlds = []
# 字符串处理
for segment in segments:
for seg in segment:
sub_str = re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])", "", seg)
if sub_str=='':
pass
else:
worlds.append(sub_str)
# 统计词个数
word_count = pd.Series(data=worlds).value_counts()
wc = WordCloud(font_path=r"C:\Windows\WinSxS\amd64_microsoft-windows-font-truetype-dengxian_31bf3856ad364e35_10.0.18362.1_none_2f009e78b33b73a9\Dengb.ttf"
, background_color='white', width=350,
height=276, max_font_size=80,
max_words=1000)
#获取头50个词语
wc.fit_words(word_count[:100])
# 定义背景
image = Image.open(r'C:\mydata\generator\py\main.jpg')
graph = np.array(image)
#从背景图片生成颜色值
image_color = ImageColorGenerator(graph)
# 重新设定颜色
wc.recolor(color_func=image_color)
wc.to_file('lipstick.png')
# 指定所绘图名称
plt.figure("口红评论")
# 以图片的形式显示词云
plt.imshow(wc)
# 关闭图像坐标系
plt.axis("off")
plt.show()
生成图片如下,女士们对这款口红总体来说比较满意,喜欢;
四 参考文档
【词云】https://blog.csdn.net/csdn2497242041/article/details/77175112
【词云】https://blog.csdn.net/FontThrone/article/details/72782499
【中文正则匹配】 https://blog.csdn.net/jlulxg/article/details/84650683
转载:https://blog.csdn.net/youku1327/article/details/105179988
查看评论