飞道的博客

scrapy框架---带你飞向爬虫路(九)

428人阅读  评论(0)

回顾(八)系统学习出门左转一到八

scrapy框架

  • 五大组件+工作流程+常用命令

    1】五大组件
        1.1) 引擎(Engine)
        1.2) 爬虫程序(Spider)
        1.3) 调度器(Scheduler)
        1.4) 下载器(Downloader)
        1.5) 管道文件(Pipeline)
        1.6) 下载器中间件(Downloader Middlewares)
        1.7) 蜘蛛中间件(Spider Middlewares)
        
    【2】工作流程
        2.1) Engine向Spider索要URL,交给Scheduler入队列
        2.2) Scheduler处理后出队列,通过Downloader Middlewares交给Downloader去下载
        2.3) Downloader得到响应后,通过Spider Middlewares交给Spider
        2.4) Spider数据提取:
           a) 数据交给Pipeline处理
           b) 需要跟进URL,继续交给Scheduler入队列,依次循环
        
    【3】常用命令
        3.1) scrapy startproject 项目名
        3.2) scrapy genspider 爬虫名 域名
        3.3) scrapy crawl 爬虫名
    

完成scrapy项目完整流程

  • 完整流程

    1】crapy startproject Tencent
    【2】cd Tencent
    【3】scrapy genspider tencent tencent.com
    【4】items.py(定义爬取数据结构)
        import scrapy
        class TencentItem(scrapy.Item):
            name = scrapy.Field()
            address = scrapy.Field()5】tencent.py(写爬虫文件)
        import scrapy
        from ..items import TencentItem
        
        class TencentSpider(scrapy.Spider):
            name = 'tencent'
            allowed_domains = ['tencent.com']
            start_urls = ['http://tencent.com/']
            def parse(self,response):
                item = TencentItem()
                xxx
                yield item
    
    【6】pipelines.py(数据处理)
        class TencentPipeline(object):
            def process_item(self,item,spider):
                return item
            
    【7】settings.py(全局配置)
        LOG_LEVEL = ''  # DEBUG < INFO < WARNING < ERROR < CRITICAL
        LOG_FILE = ''
        FEED_EXPORT_ENCODING = ''8】run.py 
        from scrapy import cmdline
        cmdline.execute('scrapy crawl tencnet'.split())
    

我们必须记住

  • 熟练记住

    1】响应对象response属性及方法
        1.1) response.text :获取响应内容 - 字符串
        1.2) response.body :获取bytes数据类型
        1.3) response.xpath('')
        1.4) response.xpath('').extract() :提取文本内容,将列表中所有元素序列化为Unicode字符串
        1.5) response.xpath('').extract_first() :序列化提取列表中第1个文本内容
        1.6) response.xpath('').get() : 提取列表中第1个文本内容(等同于extract_first())2】settings.py中常用变量
        2.1) 设置日志级别
             LOG_LEVEL = ''
        2.2) 保存到日志文件(不在终端输出)
             LOG_FILE = ''
        2.3) 设置数据导出编码(主要针对于json文件)
             FEED_EXPORT_ENCODING = 'utf-8'
        2.4) 设置User-Agent
             USER_AGENT = ''
        2.5) 设置最大并发数(默认为16)
             CONCURRENT_REQUESTS = 32
        2.6) 下载延迟时间(每隔多长时间请求一个网页)
             DOWNLOAD_DELAY = 1
        2.7) 请求头
             DEFAULT_REQUEST_HEADERS = {'User-Agent':'Mozilla/'}
        2.8) 添加项目管道
             ITEM_PIPELINES = {'项目目录名.pipelines.类名':优先级}
        2.9) cookie(默认禁用,取消注释-True|False都为开启)
             COOKIES_ENABLED = False
        2.10) 非结构化数据存储路径
             IMAGES_STORE = '/home/tarena/images/'
             FILES_STORE = '/home/tarena/files/'
        2.11) 添加下载器中间件
            DOWNLOADER_MIDDLEWARES = {'项目名.middlewares.类名':200}3】日志级别
        DEBUG < INFO < WARNING < ERROR < CRITICAL
    

爬虫项目启动方式

  • 启动方式

    1】方式一:基于start_urls
        1.1) 从爬虫文件(spider)的start_urls变量中遍历URL地址交给调度器入队列,
        1.2) 把下载器返回的响应对象(response)交给爬虫文件的parse(self,response)函数处理
    
    【2】方式二
        重写start_requests()方法,从此方法中获取URL,交给指定的callback解析函数处理
        2.1) 去掉start_urls变量
        2.2) def start_requests(self):
                 # 生成要爬取的URL地址,利用scrapy.Request()方法交给调度器
    

数据持久化存储

  • MySQL-MongoDB-Json-csv

    ***************************存入MySQL、MongoDB****************************1】在setting.py中定义相关变量
    【2】pipelines.py中新建管道类,并导入settings模块
    	def open_spider(self,spider):
    		# 爬虫开始执行1次,用于数据库连接
            
    	def process_item(self,item,spider):
            # 用于处理抓取的item数据
            return item
        
    	def close_spider(self,spider):
    		# 爬虫结束时执行1次,用于断开数据库连接3】settings.py中添加此管道
    	ITEM_PIPELINES = {'':200}
    
    【注意】 process_item() 函数中一定要 return item
    
    ********************************存入JSON、CSV文件***********************
    scrapy crawl maoyan -o maoyan.csv
    scrapy crawl maoyan -o maoyan.json
    【注意】
        存入json文件时候需要添加变量(settings.py) : FEED_EXPORT_ENCODING = 'utf-8'
    

多级页面抓取之爬虫文件

  • 多级页面攻略

    【场景1】只抓取一级页面的情况
    """
    一级页面: 名称(name)、爱好(likes)
    """
    import scrapy
    from ..items import OneItem
    class OneSpider(scrapy.Spider):
        name = 'One'
        allowed_domains = ['www.one.com']
        start_urls = ['http://www.one.com']
        def parse(self,response):
            # 创建item对象,因为没有继续交给调度器的请求,所以我可以创建在循环之外
            item = OneItem()
            dd_list = response.xpath('//dd')
            for dd in dd_list:
                item['name'] = dd.xpath('./text()').get()
                item['likes'] = dd.xpath('./text()').get()
                
                yield item
    
    
    【场景2】二级页面数据抓取
    """
    一级页面: 名称(name)、详情页链接(url)-需要继续跟进
    二级页面: 详情页内容(content)
    """
    import scrapy
    from ..items import TwoItem
    
    class TwoSpider(scrapy.Spider):
        name = 'two'
        allowed_domains = ['www.two.com']
        start_urls = ['http://www.two.com/']
        def parse(self,response):
            """一级页面解析函数,提取 name 和 url(详情页链接,需要继续请求)"""
            dd_list = response.xpath('//dd')
            for dd in dd_list:
                # 有继续交给调度器入队列的请求,就要创建item对象
                item = TwoItem()
                item['name'] = dd.xpath('./text()').get()
                item['url'] = dd.xpath('./@href').get()
                
                yield scrapy.Request(
                    url=item['url'],meta={'item':item},callback=self.detail_page)
                
        def detail_page(self,response):
            item = response.meta['item']
            item['content'] = response.xpath('//text()').get()
            
            yield item
            
                
    【场景3】三级页面抓取
    """
    一级页面: 名称(one_name)、详情页链接(one_url)-需要继续跟进
    二级页面: 名称(two_name)、下载页链接(two_url)-需要继续跟进
    三级页面: 具体所需内容(content)
    """
    import scrapy
    from ..items import ThreeItem
    
    class ThreeSpider(scrapy.Spider):
        name = 'three'
        allowed_domains = ['www.three.com']
        start_urls = ['http://www.three.com/']
        
        def parse(self,response):
            """一级页面解析函数 - one_name、one_url"""
            dd_list = response.xpath('//dd')
            for dd in dd_list:
                # 有继续发往调度器的请求,创建item对象的时刻到啦!!!
                item = ThreeItem()
                item['one_name'] = dd.xpath('./text()').get()
                item['one_url'] = dd.xpath('./@href').get()
                yield scrapy.Request(
                    url=item['one_url'],meta={'meta_1':item},callback=self.parse_two)
                
        def parse_two(self,response):
            """二级页面解析函数: two_name、two_url"""
            meta1_item = response.meta['meta_1']
            li_list = response.xpath('//li')
            for li in li_list:
                # 有继续交给调度器入队列的请求啦,所以创建item对象的时刻来临了!!!
                item = ThreeItem()
                item['two_name'] = li.xpath('./text()').get()
                item['two_url'] = li.xpath('./@href').get()
                item['one_name'] = meta1_item['one_name']
                item['one_url'] = meta1_item['one_url']
                # 交给调度器入队列
                yield scrapy.Request(
                    url=item['two_url'],meta={'meta_2':item},callback=self.detail_page)
                
        def detail_page(self,response):
            """三级页面解析: 具体内容content"""
            item = response.meta['meta_2']
            # 太好了!提具体内容了,没有继续交给调度器的请求了!所以,我不用再去创建item对象啦
            item['content'] = response.xpath('//text()').get()
            
            # 交给管道文件处理
            yield item
    

腾讯招聘职位信息抓取

  • 1、创建项目+爬虫文件

    scrapy startproject Tencent
    cd Tencent
    scrapy genspider tencent careers.tencent.com
    
    # 一级页面(postId):
    https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1566266592644&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}&pageIndex={}&pageSize=10&language=zh-cn&area=cn
    
    # 二级页面(名称+类别+职责+要求+地址+时间)
    https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1566266695175&postId={}&language=zh-cn
    
  • 2、定义爬取的数据结构

    import scrapy
    
    class TencentItem(scrapy.Item):
        # 名称+类别+职责+要求+地址+时间
        job_name = scrapy.Field()
        job_type = scrapy.Field()
        job_duty = scrapy.Field()
        job_require = scrapy.Field()
        job_address = scrapy.Field()
        job_time = scrapy.Field()
        # 具体职位链接
        job_url = scrapy.Field()
        post_id = scrapy.Field()
    
  • 3、爬虫文件

    # -*- coding: utf-8 -*-
    import scrapy
    from urllib import parse
    import requests
    import json
    from ..items import TencentItem
    
    
    class TencentSpider(scrapy.Spider):
        name = 'tencent'
        allowed_domains = ['careers.tencent.com']
        # 定义常用变量
        one_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1566266592644&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
        two_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1566266695175&postId={}&language=zh-cn'
        headers = {'User-Agent': 'Mozilla/5.0'}
        keyword = input('请输入职位类别:')
        keyword = parse.quote(keyword)
    
        # 重写start_requests()方法
        def start_requests(self):
            total = self.get_total()
            # 生成一级页面所有页的URL地址,交给调度器
            for index in range(1,total+1):
                url = self.one_url.format(self.keyword,index)
                yield scrapy.Request(url=url,callback=self.parse_one_page)
    
        # 获取总页数
        def get_total(self):
            url = self.one_url.format(self.keyword, 1)
            html = requests.get(url=url, headers=self.headers).json()
            count = html['Data']['Count']
            total = count//10 if count%10==0 else count//10 + 1
    
            return total
    
        def parse_one_page(self, response):
            html = json.loads(response.text)
            for one in html['Data']['Posts']:
                # 此处是不是有URL需要交给调度器去入队列了? - 创建item对象!
                item = TencentItem()
                item['post_id'] = one['PostId']
                item['job_url'] = self.two_url.format(item['post_id'])
                # 创建1个item对象,请将其交给调度器入队列
                yield scrapy.Request(url=item['job_url'],meta={'item':item},callback=self.detail_page)
    
        def detail_page(self,response):
            """二级页面: 详情页数据解析"""
            item = response.meta['item']
            # 将响应内容转为python数据类型
            html = json.loads(response.text)
            # 名称+类别+职责+要求+地址+时间
            item['job_name'] = html['Data']['RecruitPostName']
            item['job_type'] = html['Data']['CategoryName']
            item['job_duty'] = html['Data']['Responsibility']
            item['job_require'] = html['Data']['Requirement']
            item['job_address'] = html['Data']['LocationName']
            item['job_time'] = html['Data']['LastUpdateTime']
    
            # 至此: 1条完整数据提取完成,没有继续送往调度器的请求了,交给管道文件
            yield item
    
  • 4、提前建库建表 - MySQL

    create database tencentdb charset utf8;
    use tencentdb;
    create table tencenttab(
    job_name varchar(500),
    job_type varchar(200),
    job_duty varchar(5000),
    job_require varchar(5000),
    job_address varchar(100),
    job_time varchar(100)
    )charset=utf8;
    
  • 5、管道文件

    class TencentPipeline(object):
        def process_item(self, item, spider):
            return item
    
    import pymysql
    from .settings import *
    
    class TencentMysqlPipeline(object):
        def open_spider(self,spider):
            """爬虫项目启动时,连接数据库1次"""
            self.db = pymysql.connect(MYSQL_HOST,MYSQL_USER,MYSQL_PWD,MYSQL_DB,charset=CHARSET)
            self.cursor = self.db.cursor()
    
        def process_item(self,item,spider):
            ins='insert into tencenttab values(%s,%s,%s,%s,%s,%s)'
            job_li = [
                item['job_name'],
                item['job_type'],
                item['job_duty'],
                item['job_require'],
                item['job_address'],
                item['job_time']
            ]
            self.cursor.execute(ins,job_li)
            self.db.commit()
    
            return item
    
        def close_spider(self,spider):
            """爬虫项目结束时,断开数据库1次"""
            self.cursor.close()
            self.db.close()
    
  • 6、settings.py

    ROBOTS_TXT = False
    DOWNLOAD_DELAY = 0.5
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
      'User-Agent': 'Mozilla/5.0',
    }
    ITEM_PIPELINES = {
       'Tencent.pipelines.TencentPipeline': 300,
       'Tencent.pipelines.TencentMysqlPipeline': 500,
    }
    # MySQL相关变量
    MYSQL_HOST = 'localhost'
    MYSQL_USER = 'root'
    MYSQL_PWD = '123456'
    MYSQL_DB = 'tencentdb'
    CHARSET = 'utf8'
    

盗墓笔记小说抓取

  • 目标

    1】URL地址 :http://www.daomubiji.com/2】要求 : 抓取目标网站中盗墓笔记所有章节的所有小说的具体内容,保存到本地文件
        ./data/novel/盗墓笔记1:七星鲁王宫/七星鲁王_第一章_血尸.txt
        ./data/novel/盗墓笔记1:七星鲁王宫/七星鲁王_第二章_五十年后.txt
    
  • 准备工作xpath

    1】一级页面 - 大章节标题、链接:
        1.1) 基准xpath匹配a节点对象列表:  '//li[contains(@id,"menu-item-20")]/a'
        1.2) 大章节标题: './text()'
        1.3) 大章节链接: './@href'2】二级页面 - 小章节标题、链接
        2.1) 基准xpath匹配article节点对象列表: '//article'
        2.2) 小章节标题: './a/text()'
        2.3) 小章节链接: './a/@href'3】三级页面 - 小说内容
        3.1) p节点列表: '//article[@class="article-content"]/p/text()'
        3.2) 利用join()进行拼接: ' '.join(['p1','p2','p3',''])
    

项目实现

  • 1、创建项目及爬虫文件

    scrapy startproject Daomu
    cd Daomu
    scrapy genspider daomu www.daomubiji.com
    
  • 2、定义要爬取的数据结构 - itemspy

    import scrapy
    
    class DaomuItem(scrapy.Item):
        # 1. 一级页面标题+链接
        parent_title = scrapy.Field()
        parent_url = scrapy.Field()
        # 2. 二级页面标题+链接
        son_title = scrapy.Field()
        son_url = scrapy.Field()
        # 3. 目录
        directory = scrapy.Field()
        # 4. 小说内容
        content = scrapy.Field()
    
  • 3、爬虫文件实现数据抓取 - daomu.py

    # -*- coding: utf-8 -*-
    import scrapy
    from ..items import DaomuItem
    import os
    
    class DaomuSpider(scrapy.Spider):
        name = 'daomu'
        allowed_domains = ['www.daomubiji.com']
        start_urls = ['http://www.daomubiji.com/']
    
        def parse(self, response):
            """一级页面解析函数"""
            # 基准xpath
            a_list = response.xpath('//li[contains(@id,"menu-item-20")]/a')
            for a in a_list:
                # 此处是不是有需要继续交给调度器的请求了? - 创建item对象!!!
                item = DaomuItem()
                item['parent_title'] = a.xpath('./text()').get()
                item['parent_url'] = a.xpath('./@href').get()
                item['directory'] = './novel/{}/'.format(item['parent_title'])
                if not os.path.exists(item['directory']):
                    os.makedirs(item['directory'])
    
                # 交给调度器入队列
                yield scrapy.Request(url=item['parent_url'], meta={'meta_1': item}, callback=self.detail_page)
    
        def detail_page(self,response):
            """二级页面解析: 提取小章节名称、链接"""
            meta_1 = response.meta['meta_1']
            # 基准xpath,获取所有章节的节点对象列表
            article_list = response.xpath('//article')
            for article in article_list:
                # 此处是不是有继续交给调度器的请求了? - 创建item对象!!!
                item = DaomuItem()
                item['son_title'] = article.xpath('./a/text()').get()
                item['son_url'] = article.xpath('./a/@href').get()
                item['parent_title'] = meta_1['parent_title']
                item['parent_url'] = meta_1['parent_url']
                item['directory'] = meta_1['directory']
    
                # 交给调度器入队列,创建1个,交1个
                yield scrapy.Request(url=item['son_url'],meta={'meta_2':item},callback=self.get_content)
    
        def get_content(self,response):
            """三级页面解析函数: 获取小说内容"""
            item = response.meta['meta_2']
            # p_list: ['段落1','段落2','段落3']
            p_list = response.xpath('//article[@class="article-content"]//p/text()').extract()
            content = '\n'.join(p_list)
            item['content'] = content
    
            # 没有继续交给调度器的请求了,所以不用创建item对象,直接交给管道文件处理
            yield item
    
  • 4、管道文件实现数据处理 - pipelines.py

    class DaomuPipeline(object):
        def process_item(self, item, spider):
            filename = item['directory'] + item['son_title'].replace(' ','_') + '.txt'
            with open(filename,'w') as f:
                f.write(item['content'])
    
            return item
    
  • 5、全局配置 - setting.py

    ROBOTSTXT_OBEY = False
    DOWNLOAD_DELAY = 0.5
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
      'User-Agent': 'Mozilla/5.0'
    }
    ITEM_PIPELINES = {
       'Daomu.pipelines.DaomuPipeline': 300,
    }
    

图片管道(360图片抓取案例)

  • 目标

    1】URL地址
        1.1) www.so.com -> 图片 -> 美女
        1.2): https://image.so.com/z?ch=beauty
    
    【2】图片保存路径
        ./images/xxx.jpg
    
  • 抓取网络数据包

    1】通过分析,该网站为Ajax动态加载
    【2】F12抓包,抓取到json地址 和 查询参数(QueryString)
        2.1) url = 'https://image.so.com/zjl?ch=beauty&sn={}&listtype=new&temp=1'
        2.2) 查询参数
             ch: beauty
             sn: 0 # 发现sn的值在变,0 30 60 90 120 ... ...
             listtype: new
             temp: 1
    

项目实现

  • 1、创建爬虫项目和爬虫文件

    scrapy startproject So
    cd So
    scrapy genspider so image.so.com
    
  • 2、定义要爬取的数据结构(items.py)

    img_url = scrapy.Field()
    img_title = scrapy.Field()
    
  • 3、爬虫文件实现图片链接+名字抓取

    import scrapy
    import json
    from ..items import SoItem
    
    class SoSpider(scrapy.Spider):
        name = 'so'
        allowed_domains = ['image.so.com']
        # 重写start_requests()方法
        url = 'https://image.so.com/zjl?ch=beauty&sn={}&listtype=new&temp=1'
    
        def start_requests(self):
            for sn in range(0,91,30):
                full_url = self.url.format(sn)
                # 扔给调度器入队列
                yield scrapy.Request(url=full_url,callback=self.parse_image)
    
        def parse_image(self,response):
            html = json.loads(response.text)
            item = SoItem()
            for img_dict in html['list']:
                item['img_url'] = img_dict['qhimg_url']
                item['img_title'] = img_dict['title']
    
                yield item
    
  • 4、管道文件(pipelines.py)

    from scrapy.pipelines.images import ImagesPipeline
    import scrapy
    
    class SoPipeline(ImagesPipeline):
        # 重写get_media_requests()方法
        def get_media_requests(self, item, info):
            yield scrapy.Request(url=item['img_url'],meta={'name':item['img_title']})
    
        # 重写file_path()方法,自定义文件名
        def file_path(self, request, response=None, info=None):
            img_link = request.url
            # request.meta属性
            filename = request.meta['name'] + '.' + img_link.split('.')[-1]
            return filename
    
  • 5、全局配置(settings.py)

    ROBOTSTXT_OBEY = False
    DOWNLOAD_DELAY = 0.1
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
      'User-Agent': 'Mozilla/5.0',
    }
    ITEM_PIPELINES = {
       'So.pipelines.SoPipeline': 300,
    }
    IMAGES_STORE = 'D:/AID1910/spider_day09_code/So/images/'
    
  • 6、运行爬虫(run.py)

    from scrapy import cmdline
    
    cmdline.execute('scrapy crawl so'.split())
    

图片管道使用方法总结

1】爬虫文件: 将图片链接yield到管道
【2】管道文件:
   from scrapy.pipelines.images import ImagesPipeline
   class XxxPipeline(ImagesPipeline):
        def get_media_requests(self,xxx):
            pass
        
        def file_path(self,xxx):
            pass3】settings.py中:
   IMAGES_STORE = '绝对路径'

文件管道使用方法总结

1】爬虫文件: 将文件链接yield到管道
【2】管道文件:
   from scrapy.pipelines.images import FilesPipeline
   class XxxPipeline(FilesPipeline):
        def get_media_requests(self,xxx):
            pass
        
        def file_path(self,xxx):
            return filename
        
【3】settings.py中:
   FILES_STORE = '绝对路径'

scrapy - post请求

  • 方法+参数

    scrapy.FormRequest(
        url=posturl,
        formdata=formdata,
        callback=self.parse
    )
    

有道翻译案例实现

  • 1、创建项目+爬虫文件

    scrapy startproject Youdao
    cd Youdao
    scrapy genspider youdao fanyi.youdao.com
    
  • 2、items.py

    result = scrapy.Field()
    
  • 3、youdao.py

    # -*- coding: utf-8 -*-
    import scrapy
    import time
    import random
    from hashlib import md5
    import json
    from ..items import YoudaoItem
    
    class YoudaoSpider(scrapy.Spider):
        name = 'youdao'
        allowed_domains = ['fanyi.youdao.com']
        word = input('请输入要翻译的单词:')
    
        def start_requests(self):
            post_url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
            salt, sign, ts = self.get_salt_sign_ts(self.word)
            formdata = {
                      'i': self.word,
                      'from': 'AUTO',
                      'to': 'AUTO',
                      'smartresult': 'dict',
                      'client': 'fanyideskweb',
                      'salt': salt,
                      'sign': sign,
                      'ts': ts,
                      'bv': 'cf156b581152bd0b259b90070b1120e6',
                      'doctype': 'json',
                      'version': '2.1',
                      'keyfrom': 'fanyi.web',
                      'action': 'FY_BY_REALTlME'
                }
    	   # 发送post请求的方法
            yield scrapy.FormRequest(url=post_url,formdata=formdata)
    
        def get_salt_sign_ts(self, word):
            # salt
            salt = str(int(time.time() * 1000)) + str(random.randint(0, 9))
            # sign
            string = "fanyideskweb" + word + salt + "n%A-rKaT5fb[Gy?;N5@Tj"
            s = md5()
            s.update(string.encode())
            sign = s.hexdigest()
            # ts
            ts = str(int(time.time() * 1000))
            return salt, sign, ts
    
        def parse(self, response):
            item = YoudaoItem()
            html = json.loads(response.text)
            item['result'] = html['translateResult'][0][0]['tgt']
    
            yield item
    
  • 4、pipelines.py

    class YoudaoPipeline(object):
        def process_item(self, item, spider):
            print('翻译结果:',item['result'])
            return item
    
  • 5、settings.py

    ROBOTSTXT_OBEY = False
    LOG_LEVEL = 'WARNING'
    COOKIES_ENABLED = False
    DEFAULT_REQUEST_HEADERS = {
          "Cookie": "OUTFOX_SEARCH_USER_ID=970246104@10.169.0.83; OUTFOX_SEARCH_USER_ID_NCOO=570559528.1224236; _ntes_nnid=96bc13a2f5ce64962adfd6a278467214,1551873108952; JSESSIONID=aaae9i7plXPlKaJH_gkYw; td_cookie=18446744072941336803; SESSION_FROM_COOKIE=unknown; ___rl__test__cookies=1565689460872",
          "Referer": "http://fanyi.youdao.com/",
          "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
    }
    ITEM_PIPELINES = {
       'Youdao.pipelines.YoudaoPipeline': 300,
    }
    

scrapy添加cookie的三种方式

1】修改 settings.py 文件
    1.1) COOKIES_ENABLED = False  -> 取消注释,开启cookie,检查headers中的cookie
    1.2) DEFAULT_REQUEST_HEADERS = {}   添加Cookie

【2】利用cookies参数
    1.1) settings.py: COOKIES_ENABLED = True # 修改为TRUE后,检查 Request()方法中cookies
    1.2) def start_requests(self):
             yield scrapy.Request(url=url,cookies={},callback=xxx)
             yield scrapy.FormRequest(url=url,formdata=formdata,cookies={},callback=xxx)3】DownloadMiddleware设置中间件
    3.1) settings.py: COOKIES_ENABLED = TRUE  # 找Request()方法中cookies参数
    3.2) middlewares.py
         def process_request(self,request,spider):
             request.cookies={}

scrapy shell的使用

  • 定义

    1】调试蜘蛛的工具
    【2】交互式shell,可在不运行spider的前提下,快速调试 scrapy 代码(主要测试xpath表达式)
    
  • 基本使用

    # scrapy shell URL地址
    *1、request.url     : 请求URL地址
    *2、request.headers :请求头(字典)
    *3、request.meta    :item数据传递,定义代理(字典)
    
    4、response.text    :字符串
    5、response.body    :bytes
    6、response.xpath('')
    7、response.status  : HTTP响应码
      
    # 可用方法
    shelp() : 帮助
    fetch(request) : 从给定的请求中获取新的响应,并更新所有相关对象
    view(response) : 在本地Web浏览器中打开给定的响应以进行检查
    
  • scrapy.Request()参数

    1、url
    2、callback
    3、headers
    4、meta :传递数据,定义代理
    5、dont_filter :是否忽略域组限制
       默认False,检查allowed_domains['']
    6、cookies
    

设置中间件(随机User-Agent)

  • 少量User-Agent切换

    1】方法一 : settings.py
        1.1) USER_AGENT = ''
        1.2) DEFAULT_REQUEST_HEADERS = {}2】方法二 : 爬虫文件
        yield scrapy.Request(url,callback=函数名,headers={})
    
  • 大量User-Agent切换(middlewares.py设置中间件)

    1】获取User-Agent方式
        1.1) 方法1 :新建useragents.py,存放大量User-Agent,random模块随机切换
        1.2) 方法2 :安装fake_useragent模块(sudo pip3 install fack_useragent)
            from fake_useragent import UserAgent
            agent = UserAgent().random 
            
    【2】middlewares.py新建中间件类
    	class RandomUseragentMiddleware(object):
    		def process_request(self,reuqest,spider):
        		agent = UserAgent().random
        		request.headers['User-Agent'] = agent
                
    【3】settings.py添加此下载器中间件
    	DOWNLOADER_MIDDLEWARES = {'' : 优先级}
    

设置中间件(随机代理)

class RandomProxyDownloaderMiddleware(object):
    def process_request(self,request,spider):
    	request.meta['proxy'] = xxx
        
    def process_exception(self,request,exception,spider):
        return request
  • 练习

    有道翻译,将cookie以中间件的方式添加的scrapy项目中
    

思考(十会有答案哦)

1】URL地址:http://www.1ppt.com/xiazai/2】目标:
    2.1) 爬取所有栏目分类下的,所有页的PPT
    2.2) 数据存放: /home/tarena/ppts/工作总结PPT/xxx
                  /home/tarena/ppts/个人简历PPT/xxx
【提示】: 使用FilesPipeline,并重写方法
   

转载:https://blog.csdn.net/qq_37397335/article/details/106201942
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场