飞道的博客

如何爬取知乎中问题的回答以及评论的数据?

498人阅读  评论(0)

如何爬取知乎中问题的回答以及评论的数据?

我们以爬取“为什么中医没有得到外界认可?”为例来讨论一下如何爬取知乎中问题的回答以及评论的数据。

爬取网页数据通常情况下会经历以下三个步骤。

第一步:网页分析,确认自己所要数据的真正存储地址,以及这些url地址的规律。

第二步:爬取网页数据,并对这些数据进行清洗和整理变成结构化数据。

第三步:存储数据,以便于后面的分析。

下面我们分别来详细介绍。

一、网页分析

我们利用Chrome浏览器,打开所要爬取的网页:

https://www.zhihu.com/question/370697253

按F12查看元素,点击“Network”,再点击“XHR”选项。

先按左边的小圆圈清空列表,方便后面查找请求链接,再按“F5”刷新一下网页,如下图所示:

在列表中找到存储回答数据的url地址,点击后在“Preview”面板可以看到Josn格式的数据。

观察每一页数据对应的url地址。

第1页:

https://www.zhihu.com/api/v4/questions/370697253/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=0&platform=desktop&sort_by=default

第2页:

https://www.zhihu.com/api/v4/questions/370697253/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=5&platform=desktop&sort_by=default

第3页:

https://www.zhihu.com/api/v4/questions/370697253/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=10&platform=desktop&sort_by=default

我们发现,除了offset属性对应的取值不同,其余部分全部相同。而且offset属性对应的取值从0开始,每一页相差5。最后一页Json中的 paging -> is_end属性为false

以上是问题回答的网页分析。我们再分析一下针对每个回答的评论。

跟上面的步骤相同,找到这些评论存储的真正网络地址。

观察每一页数据对应的url地址如下:

第1页:

https://www.zhihu.com/api/v4/answers/1014424784/root_comments?order=normal&limit=20&offset=0&status=open

第2页:

https://www.zhihu.com/api/v4/answers/1014424784/root_comments?order=normal&limit=20&offset=20&status=open

第3页:

https://www.zhihu.com/api/v4/answers/1014424784/root_comments?order=normal&limit=20&offset=40&status=open

“1014424784”是该回答的id,不同的回答该id值不同。上面的url是针对同一回答的评论,这些url地址除了offset属性对应的取值不同,其余部分全部相同。而且offset属性对应的取值从0开始,每一页相差20。最后一页Json中的 paging -> is_end属性为false

二、常用库介绍

(1)requests

requests的作用就是发送网络请求,返回响应数据。

官方文档如下:

https://docs.python-requests.org/zh_CN/latest/user/quickstart.html

(2)json

Json 是一种轻量级的数据交换格式,完全独立于任何程序语言的文本格式。一般,后台应用程序将响应数据封装成Json格式返回。

官方文档如下:

https://docs.python.org/zh-cn/3.7/library/json.html

(3)lxml

lxml 是一个HTML/XML的解析器,主要功能是解析和提取 HTML/XML 数据。

官方文档如下:

https://lxml.de/index.html

由于本图文的篇幅有限,后面会另写图文分别介绍上面这些跟爬虫相关的库。

三、完整代码

GetAnswers方法用于爬取问题的回答数据。

回答数据结构化之后的属性有:帖子的ID(answer_id)、作者名称(author)、发表时间(created_time)、帖子内容(content)。

GetComments方法用于爬取每个问题的评论数据。

评论数据结构化之后的属性有:评论的ID(answer_id_comment_id)、作者名称(author)、发表时间(created_time)、评论内容(content)。

这些数据全部存储在“知乎评论.csv”这个文件中,需要注意的是该文件用Excel打开之后出现中文乱码,解决方法可以参考前面的一篇图文如何解决Python3写入CSV出现’gbk’ codec can’t encode的错误?

import requests
import json
import time
import csv
from lxml import etree

headers = {
   
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36',
}

csvfile = open('知乎评论.csv', 'w', newline='', encoding='utf-8')
writer = csv.writer(csvfile)
writer.writerow(['id', 'created_time', 'author', 'content'])


def GetAnswers():
    i = 0
    while True:
        url = 'https://www.zhihu.com/api/v4/questions/370697253/answers' \
              '?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%' \
              '2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%' \
              '2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%' \
              '2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%' \
              '2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%' \
              '2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%' \
              '2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset={0}&platform=desktop&' \
              'sort_by=default'.format(i)

        state=1
        while state:
            try:
                res = requests.get(url, headers=headers, timeout=(3, 7))
                state=0
            except:
                continue

        res.encoding = 'utf-8'
        jsonAnswer = json.loads(res.text)
        is_end = jsonAnswer['paging']['is_end']

        for data in jsonAnswer['data']:
            l = list()
            answer_id = str(data['id'])
            l.append(answer_id)
            l.append(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(data['created_time'])))
            l.append(data['author']['name'])
            l.append(''.join(etree.HTML(data['content']).xpath('//p//text()')))
            writer.writerow(l)
            print(l)

            if data['admin_closed_comment'] == False and data['can_comment']['status'] and data['comment_count'] > 0:
                GetComments(answer_id)

        i += 5
        print('打印到第{0}页'.format(int(i / 5)))

        if is_end:
            break

        time.sleep(1)


def GetComments(answer_id):
    j = 0
    while True:
        url = 'https://www.zhihu.com/api/v4/answers/{0}/root_comments?order=normal&limit=20&offset={1}&status=open'.format(
            answer_id, j)

        state=1
        while state:
            try:
                res = requests.get(url, headers=headers, timeout=(3, 7))
                state=0
            except:
                continue

        res.encoding = 'utf-8'
        jsonComment = json.loads(res.text)
        is_end = jsonComment['paging']['is_end']

        for data in jsonComment['data']:
            l = list()
            comment_id = str(answer_id) + "_" + str(data['id'])
            l.append(comment_id)
            l.append(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(data['created_time'])))
            l.append(data['author']['member']['name'])
            l.append(''.join(etree.HTML(data['content']).xpath('//p//text()')))
            writer.writerow(l)
            print(l)

            for child_comments in data['child_comments']:
                l.clear()
                l.append(str(comment_id) + "_" + str(child_comments['id']))
                l.append(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(child_comments['created_time'])))
                l.append(child_comments['author']['member']['name'])
                l.append(''.join(etree.HTML(child_comments['content']).xpath('//p//text()')))
                writer.writerow(l)
                print(l)
        j += 20
        if is_end:
            break

        time.sleep(1)


GetAnswers()
csvfile.close()

四、总结

本篇文档是大数据与哲学社会科学实验室召开第75次学术讨论会上汇报的内容,大家如果感兴趣可以在微信后台回复“资料下载”来获取源码,以及该帖子爬取的8000多条数据。


转载:https://blog.csdn.net/LSGO_MYP/article/details/116240402
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场