书写爬虫主体
class XXXSpider(Scrapy.Spider) //爬虫文件必须继承自scrapy的Spider类
name：标识该爬虫的名称，执行命令需要用到，当有多个爬虫文件存在时名称必须唯一
allowed_domains：允许（限制）可以爬的域名
start_urls：从哪个url开始

def parse(self, response):
下载器返回给引擎时，引擎发送给spider，就是发送到了这里（结构图第⑥步）

爬取古诗文网

点击下一页发现网址域名变为https://www.gushiwen.org/default_2.aspx
把2换成1就到了第一页，爬取规则get

将https://www.gushiwen.org/default_1.aspx放到爬虫文件中的start_urls中作为起始爬取页

运行爬虫

    def parse(self, response):
         print("="*30)
         print(type(response))
         print("="*30)

在pycharm中的terminal框下就可以运行

运行命令： scrapy crawl 爬虫名称

效果查看

在parse函数中显示爬取信息

（等号）
<class ‘scrapy.http.response.html.HtmlResponse’>
（等号）

查看HtmlResponse的源码可以发现它继承了Response类，可以使用xpath，css等方法获取内容

设置执行命令文件

由于每次执行命令都要手动输入（或者方向键）
可以使用一个文件代替手动输入，每次需要执行爬虫时运行该文件即可

在框架外层创建一个start.py文件

爬取古诗内容

获取容器

def parse(self, response):
contentLeft = response.xpath("//div[@class = ‘left’]")//ctrl+alt+v快速生成返回值变量

 def parse(self, response):
    contentLeft = response.xpath("//div[@class = 'left']")
    print("="*30)
    print(type(contentLeft))
    print("="*30)

输出效果

SelectorList继承自List，所以list的方法均可使用

获取古诗主体

def parse(self, response):
    contentLeft = response.xpath("//div[@class = 'left']//div[@class = 'sons']")
    contentList = contentLeft
    for poemContent in contentList:
        title = poemContent.xpath(".//div[@class = 'cont']//b/text()").get().strip()#将selector转换成Unicode，并将文本内容提取出来
        dynasty = poemContent.xpath(".//div[@class = 'cont']//p[@class = 'source']//a[1]//text()").get()
        name = poemContent.xpath(".//div[@class = 'cont']//p[@class = 'source']//a[2]//text()").get()
        poemContext = poemContent.xpath(".//div[@class = 'contson']//text()").getall()
        poemContext = "".join(poemContext).strip()	//将列表变成字符串形式并删除空白

strip():删去文字两端多余空格
getall():获取该层下所有文本，作为一个string列表返回

数据写入本地

items.py自定义使用数据的类型
初始状态该文件有以下内容

class ModuleItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

说的已经很明显了，要定义一个数据类型需要使用 name = scrapy.Field() 这条语句
那么我们可以自定义一些数据，并将其发送给pipeline.py

class ModuleItem(scrapy.Item):
    title = scrapy.Field()
    dynasty = scrapy.Field()
    name = scrapy.Field()
    poemContext = scrapy.Field()
    pass

在爬虫文件中的def parse(self, response):函数下再添加以下语句

#导入items文件中的ModuleItem类
#			from module.module.items import ModuleItem
#			...
            poem = ModuleItem(title = title, dynasty = dynasty, name = name, poemContext = poemContext)
            yield poem

yield poem
将该函数变成生成器，返回给引擎（结构图步骤⑦）引擎发送给pipeline（结构图步骤⑧）

pipeline的使用1

pipeline.py主要使用四个函数

import json

class ModulePipeline(object):
    def __init__(self):				//构造函数，最先执行
        self.fp = open("poems.json", 'w', encoding= 'utf-8' )
    def open_spider(self,spider):		//打开爬虫后第一个执行的函数
        print("爬虫开始。。。")
    def process_item(self, item, spider):		//处理爬虫文件传递进来的数据
        item_json = json.dumps(dict(item)，ensure_ascii = False) #json.dumps将dict转化成str形式，json.loads将str转换成dict形式
        self.fp.write(item_json+"\n")
        return item
    def close_spider(self, spider):			//关闭爬虫前被调用的函数
        self.fp.close() #爬虫结束
        print("爬虫结束")

item_json = json.dumps(item，ensure_ascii = False)
ensure_ascii = True表示用ascii码保存文件，False才能使用中文表示，记得关闭掉

话说这个dict（item）为啥不能直接用item我还是没搞懂，print（item）显示出来的就是一个字典。。。

要想让pipeline.py工作，还必须要在settings.py文件中将

ITEM_PIPELINES = {
‘module.pipelines.ModulePipeline’: 300, #300代表优先级，越小越先执行
}

前面的注释取消掉（ctrl+/）

pipeline的使用2

对于数据的保存，还可以用JsonItemExporter和JsonLinesItemExporter，JsonItemExporter是将所有数据保存在内存中，用一个列表来储存数据，当爬虫结束时统一写入到磁盘中，如果数据量很大的话对内存资源占用比较高
JsonLinesItemExporter是每拿到一个字典数据就将它写入到磁盘中，占用内存资源比较少（老实说我不知道这玩意和刚才实现的那个方法有啥区别，有知道的请告诉我）

直接上代码，比对下就OK

from scrapy.exporters import JsonLinesItemExporter
class ModulePipeline(object):
    def __init__(self):
        self.fp = open("poems.json", 'wb') #exporter使用的是二进制存储，故wb写入
        self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False,encoding = 'utf-8')
    def open_spider(self,spider):
        print("爬虫开始。。。")
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item
    def close_spider(self, spider):
        self.fp.close() #爬虫结束
        print("爬虫结束")

多页爬取

这个网站最多只能爬十页，算个小练习吧
下面看第九页的样式，<a标签中有个href属性后面跟了第十页的链接（的一半）

再来看第十页的

第十页<a标签就没有了href属性，代表这是最后一页了

所以如果我们想进行多页的爬取，一定要在爬虫后面返回给pipeline时使用yield，挂起该函数，而不是结束该函数

获取下一页的url（@href），如果能拿到东西就进行下一页的爬取，递归调用

scrapy.Request(self.main_domain + next_url, callback=self.parse)
这个方法第一个参数是下一次的地址，就是域名＋@href获取到的后面那半部分
第二个参数是下一次需要调用的方法，这里是递归调用自身（记得不要写成parse()，这是把函数当变量传递了）
获取不到下一页的url(@href)就说明爬到最后一页了，return直接结束该函数就可以了

第一篇文章，上一次学习爬虫是在半年前，想要用的时候都给忘光了，群友跟我讲学的时候写写博客，这样再用的时候哪怕已经隔了很长时间了，也能回想起来

链接：1QXoQ2J-uvQVd6ytZ6GsoKg
提取码：tj9m
丢个度盘的源码，指不定什么时候自己电脑里的就找不到了XD

转载：https://blog.csdn.net/qq_40595682/article/details/101797355

查看评论

小言_互联网的博客

小言_互联网的博客

个人资料

文章分类

文章存档

阅读排行

评论排行

推荐文章

Scrapy框架使用（复习专用）（2019年9月30日）

Scrapy框架使用（复习专用）

Pycharm2018中创建Scrapy框架

Scrapy项目文件

items.py

middlewares.py

pipelines.py

settings.py

爬虫文件.py