不爬妹子图的爬虫不是一只好爬虫。----鲁迅
还是一样,我们在爬取妹子图片的时候,首先要分析一下 DOM
这里的img是就封面,如果只抓取封面的话,到这就可以了,但是我们取的是所有图片,
所以这里我们获取的是这个详情页的a链接:mmonly.cc/mmtp/xgmn/181,
-
import requests
-
from lxml
import html
-
for page in
range(
1,
852):
-
url=
'http://www.mmonly.cc/mmtp/list_9_%s.html'%page
-
response=requests.get(url,verify=False).text
-
selector=html.fromstring(response)
-
imgEle=selector.xpath(
'//div[@class="ABox"]/a')
-
for img in imgEle:
-
imgUrl=img.xpath(
'@href')[
0]
-
print(imgUrl)
这样,我们就获取到了所有的主要链接,每页24个,2w+个。
这里用的是xpath
-
import os
-
import urllib
-
import requests
-
from lxml
import html
-
import time
-
-
os.mkdir(
'meizi')#第一次运行新建meizi文件夹,手动建可以注释掉
-
-
for page in
range(
1,
852):
-
url=
'http://www.mmonly.cc/mmtp/list_9_%s.html'%page
-
print(url)
-
response=requests.get(url,verify=False).text
-
-
selector=html.fromstring(response)
-
imgEle=selector.xpath(
'//div[@class="ABox"]/a')
-
print(
len(imgEle))
-
for index,img in enumerate(imgEle):
-
imgUrl=img.xpath(
'@href')[
0]
-
response=requests.get(imgUrl,verify=False).text
-
selector = html.fromstring(response)
-
pageEle = selector.xpath(
'//div[@class="wrapper clearfix imgtitle"]/h1/span/span[2]/text()')[
0]
-
print(pageEle)
-
imgE=selector.xpath(
'//a[@class="down-btn"]/@href')[
0]
-
-
imgName =
'%s_%s_1.jpg' % (page,str(index+
1))
-
coverPath =
'%s/meizi/%s' % (os.getcwd(), imgName)
-
urllib.request.urlretrieve(imgE, coverPath)
-
-
for page_2 in
range(
2,
int(pageEle)+
1):
-
url=imgUrl.replace(
'.html',
'_%s.html' % str(page_2))
-
response = requests.get(url).text
-
selector = html.fromstring(response)
-
imgEle = selector.xpath(
'//a[@class="down-btn"]/@href')[
0]
-
print(imgEle)
-
imgName=
'%s_%s_%s.jpg'%(page,str(index+
1),page_2)
-
coverPath =
'%s/meizi/%s' % (os.getcwd(), imgName)
-
urllib.request.urlretrieve(imgEle, coverPath)
-
time.sleep(
2)
这个网站没有什么反爬虫措施,为避免服务器压力过大,在每次循环后面加了休眠2秒。
成果图如下:
转载:https://blog.csdn.net/qq_36772866/article/details/105384220
查看评论