因个人工作需要,想从网上爬一些美女图片当配图,于是搜到了美桌网,在meinvtag***标签下有一些高清美女图片,正符合我的需要,因此写了个简单爬虫下载。
首先观察网站特点。以meinvtag2标签为例,总共有5个页面,类似于http://www.win4000.com/meinvtag2_1.html,最后的数字代表页数,是爬虫最喜欢的url构成,直接可以开5个线程单独处理。在页面html源码中可以提取到每个相册集的url,打开相册集就可以一张张查看所有图片。图片页面的url构成同样简单,类似于http://www.win4000.com/meinv198397_2.html,但是无法获知某个相册集下有多少张图片,我采取的方法是,从1到50循环获取,当获取不到图片时跳出循环。代码如下
-
import requests
-
from bs4
import BeautifulSoup
-
import threading
-
-
def download_img_from_url(path,url):
-
with open(path,
'wb')
as f:
-
f.write(requests.get(url).content)
-
-
def get_BS(url):
-
html = requests.get(url)
-
try:
-
html.raise_for_status()
-
return BeautifulSoup(html.text,
"lxml")
-
except:
-
return
None
-
-
def download(i):
-
page_url = page_url_format.format(i)
-
bs = get_BS(page_url)
-
lists = bs.find_all(
"div",{
'class':
'tab_box'})
-
tags = lists[
1].find_all(
'a')
-
for tag
in tags:
-
album_url = tag.get(
'href')
-
album_url = album_url[
0:
-5] +
'_'
-
for id
in range(
1,
50):
-
img_page_url = album_url + str(id) +
".html"
-
#print(img_page_url)
-
bs2 = get_BS(img_page_url)
-
if bs2:
-
img_url = bs2.find(
"img", class_=
'pic-large').get(
'data-original')
-
name = img_page_url[
28:
-5]
-
download_img_from_url(save_path.format(name),img_url)
-
else:
-
break
-
-
page_url_format =
'http://www.win4000.com/meinvtag2_{}.html'
-
save_path =
'D:\\image\\{}.jpg'
-
threads = []
-
for i
in range(
1,
6):
#直接每一页开一个线程
-
thread = threading.Thread(target=
lambda:download(i))
-
thread.start()
-
threads.append(thread)
-
-
for thread
in threads:
-
thread.join()
转载:https://blog.csdn.net/w632782651/article/details/105632240
查看评论