飞道的博客

爬虫(110)一文使你成为终身斗图帝

219人阅读  评论(0)

一文使你成为终身斗图帝,以后妈妈终于不用担心我没图跟别人斗图了,你说我秀不秀,这波操作实在是羡煞旁人啊,以后微信群斗图,就不用偷别人的图

下班闲来无聊,来一波斗图呗,同样我们进入斗图官网,

https://www.doutula.com/article/list

我们 f12 一下,并且刷新一下页面,可以看到以下 list

我们先把这个 user-agents 扒下来

header={
  'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36",
}

我们随便点进去一张图,看到详情页,我们需要表情包的 title,页面链接如

https://www.doutula.com/article/detail/4531369

导入常用包

import random
import requests
from bs4 import BeautifulSoup
import urllib
import os

我们暂时就爬第一页和第二页

BASE_URL = 'https://www.doutula.com/photo/list/?page='
URL_LIST = []
for x in range(1, 2):
    REAL_URL = BASE_URL+str(x)
    URL_LIST.append(REAL_URL)

主要的爬虫代码





def get_url(url):
    my_headers = [
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14",
        "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)",
    ]
    header = {
        "User-Agent": random.choice(my_headers)
    }
    re = requests.get(url, headers=header) # 获取DOM 文档
    soup = BeautifulSoup(re.content, "lxml") # 使用 lxml 解析内容
    IMG_LIST = soup.find_all('img', 'img-responsive lazy image_dta') # 找到图片所在位置
    num=1
    for img in IMG_LIST:
        imgurl = img['data-original']
        
        pic=requests.get(imgurl,headers=header).content
        with open('./doutufile/'+str(num)+'.jpg','wb')as f:
            f.write(pic)
            num=num+1

爬取成功了,给大家看看图吧


好了,祝贺你早日成为斗图帝

需要源代码的,请关注公众号《志学Python》后台回复《斗图帝》获取代码下载链接


转载:https://blog.csdn.net/qq_36772866/article/details/106009547
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场