参观我们准备爬取的网页,注意:不要停留太久,内容太过丰富且有趣,等回过神来已经半天过去了
观察这个页面包含的信息,包括[标题][播放量][视频弹幕数量][up主姓名]…
常规操作,F12查看这些数据源码所处的位置
👆日榜100名的list列表
👆每一个item中数据所在位置
了解到结构后,就可以开始写爬虫了。首先爬虫需要的几个库,没有的话(pip install ***)
- BeautifulSoup4(解析html页面)
- requests(发送请求)
- datetime(最后在文件中加入日期)
- json(处理json文件格式数据)
- time(每个循环后加入时间函数,减轻服务器请求压力)
- os(文件操作)
url =('https://www.bilibili.com/ranking/all/0/0/3')
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 Edg/80.0.361.48'}
response = requests.get(url,headers = headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text,'html.parser')
- url为我们要获取数据的网页的网址
- headers为请求头,这一步的目的是模拟浏览器访问,减小被识别为爬虫的几率(具体查看方法:chrome或Edge浏览器在搜索框输入about:version→用户代理)
- if response.status_code == 200指当网站可以正常请求时,解析返回的网页
开始定位数据位置
for item in soup.find_all(attrs={"class": "rank-item"}):
no = item.find(attrs = {"class":"num"})#no为在日榜的排名
title = item.find(attrs = {"class":"title"})
web = item.find(attrs = {"class":"detail"})
web_detail = web.find('a')
web_detail = web_detail['href'].replace("//space.bilibili.com/","")#up主个人的账号
tt = title.text.replace(",","")#视频标题名
for details in item.find_all(attrs = {"class":"data-box"})://在class为data-box中定位
for detail in details:
if(detail.string == None):
continue
detail.string#获取视频的播放量,弹幕数,up主姓名
csv的特性是,遇到“,”会跳到下一列。所以在处理视频标题时要将tt中的“,”去掉,以免出现格式错误。
所有的数据已经定位完成,接着把数据存入csv表格中
完整代码如下
from bs4 import BeautifulSoup
import requests
import datetime
date = datetime.datetime.now().strftime('%Y-%m-%d')
with open('F:/crawler_data/Bilibili/day_rank/day_rank_'+date+'.csv', 'w', encoding='gb18030', errors='ignore') as file:
file.write("日排名,标题,播放量,弹幕数量,UP主姓名,UP主ID号\n")
url = ('https://www.bilibili.com/ranking?spm_id_from=333.158.b_7072696d61727950616765546162.3')
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 Edg/80.0.361.48'}
response = requests.get(url,headers = headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text,'html.parser')
file_id_list = open("F:/crawler_data/Bilibili/rank_id_list/" + date + "日榜up主id.txt", "w")//将日榜前100位up主的id号导入txt文件中
for item in soup.find_all(attrs={"class": "rank-item"}):
no = item.find(attrs = {"class":"num"})#no为在日榜的排名
title = item.find(attrs = {"class":"title"})
web = item.find(attrs = {"class":"detail"})
web_detail = web.find('a')
web_detail = web_detail['href'].replace("//space.bilibili.com/","")#up主个人的账号
file_id_list.write(web_detail + "\n")
tt = title.text.replace(",","")#视频标题名
file.write("{},{}".format(no.text,tt))
for details in item.find_all(attrs = {"class":"data-box"}):
for detail in details:
if(detail.string == None):
continue
file.write(",")
file.write("{}".format(detail.string))#获取视频的播放量,弹幕数,up主姓名
file.write(",")
file.write("{}".format(web_detail))
file.write("\n")
print("————正在获取id为:"+web_detail+"的up主信息————")
file.write(date)
print("——done——")
file_id_list.close()
file.close()
爬取到的数据
分析日榜up主个人的视频数据
在爬取日榜名单时,我将日榜上up主的id都存在了一个txt文件中,示例如下
一开始,我想直接利用request请求,但发现没有返回值
经过查询,up主的个人页面是动态请求的,换一种方法。
F12→network→刷新→选择XHR→在一众文件中找到
👆https://api.bilibili.com/x/space/arc/search?mid=270308437&pn=1&ps=25&jsonp=jsonp
up主粉丝json文件👉https://api.bilibili.com/x/relation/stat?vmid=270308437&jsonp=jsonp
前期准备工作完成
- import requests
- import time
- import datetime
- import os
- import json
从事先准备的txt文件中读取id
data = []#存储日榜前100位up主的id号
for line in open("F:/crawler_data/Bilibili/rank_id_list/"+date+"日榜up主id.txt","r")://从txt文件中读取
line = line[:-1]
data.append(line)
因为不止爬取单一up主的视频信息,所以要对URL做如下操作
for j in data:
up_detail = 'https://api.bilibili.com/x/space/acc/info?mid=%s&jsonp=jsonp'%j
up_fans = 'https://api.bilibili.com/x/relation/stat?vmid=%s&jsonp=jsonp'%j
请求的返回值是json格式,利用json库解析
response_fans = requests.get(up_fans,headers = headers)
response_detail = requests.get(up_detail,headers = headers)
text_fans = json.loads(response_fans.text)
text_detail = json.loads((response_detail.text))
获取up主的个人粉丝数:
res_fans = text_fans['data']
follower = str(res_fans['follower']) # up主个人的粉丝数
获得up主视频页数:
res_page = text_page['data']['page']
page = int(res_page['count'] / 30 + 1)#获取视频的页数
获得每一个视频的标题、av号、评论数、播放量、时长:
res = text['data']['list']['vlist']
for item in res:
title = str(item['title'])#视频标题
av = str(item['aid']) # 视频av号
comment = str(item['comment']) # 视频评论数
play = str(item['play']) # 视频播放量
video_length = str(item['length']) # 视频时长
完整代码如下
import requests
import json
import time
import datetime
import os
date = datetime.datetime.now().strftime('%Y-%m-%d')
data = []#存储日榜前100位up主的id号
for line in open("F:/crawler_data/Bilibili/rank_id_list/"+date+"日榜up主id.txt","r"):
line = line[:-1]
data.append(line)
path = 'F:/crawler_data/Bilibili/up_detail/'+date+'/'
isExists=os.path.exists(path)
if not isExists:
os.makedirs(path)
print("目录创建成功")
for j in data:
with open('F:/crawler_data/Bilibili/up_detail/'+date+'/'+j+'.csv', 'w', encoding='gb18030',errors='ignore') as file:
print("——正在爬取id为"+j+"的up主视频信息——")
file.write("视频av号,视频标题,视频评论数,视频时长,视频观看量")
file.write("\n")
up_detail = 'https://api.bilibili.com/x/space/acc/info?mid=%s&jsonp=jsonp'%j
up_fans = 'https://api.bilibili.com/x/relation/stat?vmid=%s&jsonp=jsonp'%j
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 Edg/80.0.361.50'}
response_fans = requests.get(up_fans,headers = headers)
response_detail = requests.get(up_detail,headers = headers)
text_fans = json.loads(response_fans.text)
text_detail = json.loads((response_detail.text))
res_fans = text_fans['data']
follower = str(res_fans['follower']) # up主个人的粉丝数
url_page = 'https://api.bilibili.com/x/space/arc/search?mid=%s&ps=30&tid=0&pn=1&keyword=&order=pubdate&jsonp=jsonp'%j
response_page = requests.get(url_page, headers=headers)
text_page = json.loads(response_page.text)
res_page = text_page['data']['page']
page = int(res_page['count'] / 30 + 1)#获取视频的页数
for i in range(1, page):
url = 'https://api.bilibili.com/x/space/arc/search?mid=%s&ps=30&tid=0&pn=%s&keyword=&order=pubdate&jsonp=jsonp'%(j,i)
response = requests.get(url, headers=headers)
text = json.loads(response.text)
res = text['data']['list']['vlist']
print("------正在爬取第-----"+str(i)+"-----页-----")
for item in res:
title = str(item['title'])#视频标题
av = str(item['aid']) # 视频av号
comment = str(item['comment']) # 视频评论数
play = str(item['play']) # 视频播放量
video_length = str(item['length']) # 视频长度
file.write("{},{},{},{},{}".format(av,title,comment,video_length,play))
file.write("\n")
print("-----正在爬取视频av号为:" + av + "的信息-----")
print("-----完成-----")
file.write("{}".format(follower))
file.close()
time.sleep(5)
爬取到的数据展示
爬虫阶段——Done
转载:https://blog.csdn.net/qq_43201710/article/details/104643688