小言_互联网的博客

scrapy+redis+mongodb爬取苏宁商城图书价格

489人阅读  评论(0)

之前的爬取苏宁图书信息的时候因为懒得分析图书的价格,所有今天把图书的价格给弄了

图书的价格是动态生成的,不过稍稍花点时间就分析出来了,本来长长的·一大串,慢慢删减慢慢试就剩一个,看下图

然后我在网页源码发现了一串数字(sku:一般就是用来表示商品的,之前看django项目的时候看到过),提取出来放到请求的url就是得到图书的价格了

下面是代码

爬虫代码

# -*- coding: utf-8 -*-
import scrapy
import json


class SuningSpider(scrapy.Spider):
    name = 'suning'
    allowed_domains = ['suning.com']
    start_urls = ['https://list.suning.com/1-502325-0.html']

    def parse(self, response):
        """获取图书基本信息"""
        #分组
        li_list = response.xpath('//div[@id="filter-results"]/ul/li')
        for li in li_list:
            book_item = {}
            book_item["book_title"] = li.xpath('.//p[@class="sell-point"]/a/text()').extract_first()
            book_item["book_url"] = "https:" + li.xpath('.//p[@class="sell-point"]/a/@href').extract_first()

            #构造价格的url
            datasku = li.xpath('.//p[@class="prive-tag"]/em/@datasku').extract_first().split("|||||")
            book_item["book_price_url"] = "http://ds.suning.com/ds/generalForTile/{}_-781-2-{}-1--".format(datasku[0], datasku[1])
            yield scrapy.Request(
                book_item["book_price_url"],
                callback=self.book_price,
                meta={"book_item": book_item}
            )

        #构造下一页url
        page_num = response.xpath('//div[@id="bottom_pager"]/a[2]/text()').extract_first()
        next_url = "https://list.suning.com/1-502325-{}-0-0-0-0-14-0-4.html".format(page_num)
        #判断是否有下一页字样,
        text = response.xpath('//a[@id="nextPage"]/@title').extract_first()

        if text is not None:
            yield scrapy.Request(next_url, callback=self.parse)

    def book_price(self, response):
        """获取图书的价格"""
        book_item = response.meta["book_item"]
        rsp_dict = json.loads(response.body.decode())
        book_item["价格"] = rsp_dict["rs"][0]["price"]
        yield book_item


settings.py代码,弄了一个去重,还有随机ua列表

# 日志提示等级
LOG_LEVEL = "WARNING"
# 一个去重的类,用来将url去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 一个队列
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否持久化
SCHEDULER_PERSIST = True
# redis地址
REDIS_URL = "redis://192.168.1.101:6379"
# user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
# 是否遵守robots.txt协议
ROBOTSTXT_OBEY = False
# 配置项目管道
ITEM_PIPELINES = {
    'SN.pipelines.SnPipeline': 300,
}
#下载延迟
DOWNLOAD_DELAY = 0.4

DOWNLOADER_MIDDLEWARES = {
    'SN.middlewares.UserAgent': 543,
    'SN.middlewares.aa': 543,  #查看是否添加了随机User-Agent
}
#User-Agent列表
UA_LIST = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

pipelines.py代码,保存到mongo

# -*- coding: utf-8 -*-
from pymongo import MongoClient


class SnPipeline:
    def process_item(self, item, spider):
        client = MongoClient(host="127.0.0.1", port=27017)
        db = client["sn_book_price"]
        db.sn_book_price.insert_one(dict(item))  #将数据插入数据库
        print(item)
        return item

middlewares.py代码,设置随机User-Agent,还有查看是否成功设置了User-Agent

# -*- coding: utf-8 -*-
from scrapy import signals
import random


class UserAgent:
    def process_request(self, request, spider):
       ua = random.choice(spider.settings.get("UA_LIST"))
       request.headers["User-Agent"] = ua

class aa:
    def process_response(self, request, response, spider):
        print(request.headers["User-Agent"])
        return response
运行截图


mongodb保存数据情况截图

还有乌班图中redis的,

项目截图

整个下来比较难的地方就是分析图书的价格了,因为是动态生成的,然后就是翻页,下一页的url也是动态生成的,但url的规律比较明显,因为弄了url去重,所有就可以自己构造了,刚开始运行的时候没弄反反爬措施,被反爬了,然后就加了随机User-Agent还有下载延迟,爬的多了很容易被发现,没有弄到代理ip,下次一定。

转载:https://blog.csdn.net/qq_44657868/article/details/106199251
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场