之前的爬取苏宁图书信息的时候因为懒得分析图书的价格,所有今天把图书的价格给弄了
图书的价格是动态生成的,不过稍稍花点时间就分析出来了,本来长长的·一大串,慢慢删减慢慢试就剩一个,看下图
然后我在网页源码发现了一串数字(sku:一般就是用来表示商品的,之前看django项目的时候看到过),提取出来放到请求的url就是得到图书的价格了
下面是代码
爬虫代码
# -*- coding: utf-8 -*-
import scrapy
import json
class SuningSpider(scrapy.Spider):
name = 'suning'
allowed_domains = ['suning.com']
start_urls = ['https://list.suning.com/1-502325-0.html']
def parse(self, response):
"""获取图书基本信息"""
#分组
li_list = response.xpath('//div[@id="filter-results"]/ul/li')
for li in li_list:
book_item = {}
book_item["book_title"] = li.xpath('.//p[@class="sell-point"]/a/text()').extract_first()
book_item["book_url"] = "https:" + li.xpath('.//p[@class="sell-point"]/a/@href').extract_first()
#构造价格的url
datasku = li.xpath('.//p[@class="prive-tag"]/em/@datasku').extract_first().split("|||||")
book_item["book_price_url"] = "http://ds.suning.com/ds/generalForTile/{}_-781-2-{}-1--".format(datasku[0], datasku[1])
yield scrapy.Request(
book_item["book_price_url"],
callback=self.book_price,
meta={"book_item": book_item}
)
#构造下一页url
page_num = response.xpath('//div[@id="bottom_pager"]/a[2]/text()').extract_first()
next_url = "https://list.suning.com/1-502325-{}-0-0-0-0-14-0-4.html".format(page_num)
#判断是否有下一页字样,
text = response.xpath('//a[@id="nextPage"]/@title').extract_first()
if text is not None:
yield scrapy.Request(next_url, callback=self.parse)
def book_price(self, response):
"""获取图书的价格"""
book_item = response.meta["book_item"]
rsp_dict = json.loads(response.body.decode())
book_item["价格"] = rsp_dict["rs"][0]["price"]
yield book_item
settings.py代码,弄了一个去重,还有随机ua列表
# 日志提示等级
LOG_LEVEL = "WARNING"
# 一个去重的类,用来将url去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 一个队列
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否持久化
SCHEDULER_PERSIST = True
# redis地址
REDIS_URL = "redis://192.168.1.101:6379"
# user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
# 是否遵守robots.txt协议
ROBOTSTXT_OBEY = False
# 配置项目管道
ITEM_PIPELINES = {
'SN.pipelines.SnPipeline': 300,
}
#下载延迟
DOWNLOAD_DELAY = 0.4
DOWNLOADER_MIDDLEWARES = {
'SN.middlewares.UserAgent': 543,
'SN.middlewares.aa': 543, #查看是否添加了随机User-Agent
}
#User-Agent列表
UA_LIST = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
pipelines.py代码,保存到mongo
# -*- coding: utf-8 -*-
from pymongo import MongoClient
class SnPipeline:
def process_item(self, item, spider):
client = MongoClient(host="127.0.0.1", port=27017)
db = client["sn_book_price"]
db.sn_book_price.insert_one(dict(item)) #将数据插入数据库
print(item)
return item
middlewares.py代码,设置随机User-Agent,还有查看是否成功设置了User-Agent
# -*- coding: utf-8 -*-
from scrapy import signals
import random
class UserAgent:
def process_request(self, request, spider):
ua = random.choice(spider.settings.get("UA_LIST"))
request.headers["User-Agent"] = ua
class aa:
def process_response(self, request, response, spider):
print(request.headers["User-Agent"])
return response
运行截图
mongodb保存数据情况截图
还有乌班图中redis的,
项目截图
整个下来比较难的地方就是分析图书的价格了,因为是动态生成的,然后就是翻页,下一页的url也是动态生成的,但url的规律比较明显,因为弄了url去重,所有就可以自己构造了,刚开始运行的时候没弄反反爬措施,被反爬了,然后就加了随机User-Agent还有下载延迟,爬的多了很容易被发现,没有弄到代理ip,下次一定。
转载:https://blog.csdn.net/qq_44657868/article/details/106199251
查看评论