一文使你成为终身斗图帝,以后妈妈终于不用担心我没图跟别人斗图了,你说我秀不秀,这波操作实在是羡煞旁人啊,以后微信群斗图,就不用偷别人的图
下班闲来无聊,来一波斗图呗,同样我们进入斗图官网,
https://www.doutula.com/article/list
我们 f12 一下,并且刷新一下页面,可以看到以下 list
我们先把这个 user-agents 扒下来
header={
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36",
}
我们随便点进去一张图,看到详情页,我们需要表情包的 title,页面链接如
https://www.doutula.com/article/detail/4531369
导入常用包
import random
import requests
from bs4 import BeautifulSoup
import urllib
import os
我们暂时就爬第一页和第二页
BASE_URL = 'https://www.doutula.com/photo/list/?page='
URL_LIST = []
for x in range(1, 2):
REAL_URL = BASE_URL+str(x)
URL_LIST.append(REAL_URL)
主要的爬虫代码
def get_url(url):
my_headers = [
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14",
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)",
]
header = {
"User-Agent": random.choice(my_headers)
}
re = requests.get(url, headers=header) # 获取DOM 文档
soup = BeautifulSoup(re.content, "lxml") # 使用 lxml 解析内容
IMG_LIST = soup.find_all('img', 'img-responsive lazy image_dta') # 找到图片所在位置
num=1
for img in IMG_LIST:
imgurl = img['data-original']
pic=requests.get(imgurl,headers=header).content
with open('./doutufile/'+str(num)+'.jpg','wb')as f:
f.write(pic)
num=num+1
爬取成功了,给大家看看图吧
好了,祝贺你早日成为斗图帝
需要源代码的,请关注公众号《志学Python》后台回复《斗图帝》获取代码下载链接
转载:https://blog.csdn.net/qq_36772866/article/details/106009547
查看评论