小言_互联网的博客

Python爬虫框架:scrapy的简单使用教程

350人阅读  评论(0)

Scrapy框架

win安装

  - twisted  依赖 ,安装不了,找whl文件
  - pip3 install wheel
   - pip3 install ****.whl
  - pip3 install pywin32
Django:
        django-admin startproject mysite
        cd mysite
        python manage.py startapp app01
        
    
Scrapy

    # 创建项目
    scrapy startproject sp1
    
    sp1
        - sp1
            - spiders目录
            - middlewares.py    中间件
            - items.py           设置数据存储模板,用于数据格式化  如django的model
            - pipelines.py        持久化
            - settings.py        配置文件
        - scrapy.cfg             项目的主配置信息,
    
    # 创建爬虫
    cd sp1
    scrapy genspider example example.com

    # 执行爬虫,进入project
    scrapy crawl baidu
    scrapy crawl baidu --nolog    # 无日志

   # 展示所有爬虫列表
   scrapy list

Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下

# 小试牛刀
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request


class DigSpider(scrapy.Spider):
   def __init__():
    声明浏览器对象
   def closed(self,spider):
    关闭浏览器对象
    # 在中间件process_response方法中获取浏览器对象(spider.bro)
    # 在中间件process_response方法中,获取页面源码数据bro.page_source
    # 将源码数据赋值给HtmlResponse(url,body,request,encoding)的body参数
    

    name = "dig"  # 通过此名称启动爬虫命令
    allowed_domains = ["chouti.com"]
    start_urls = ['http://dig.chouti.com/', ]
    has_request_set = {}

    def parse(self, response):
        hxs = HtmlXPathSelector(response)  # HtmlXpathSelector用于结构化HTML代码并提供选择器功能
        page_list = hxs.select('//div[@id="dig_lcpage"]//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()
        for page in page_list:
            page_url = 'http://dig.chouti.com%s' % page
            key = self.md5(page_url)
            if key in self.has_request_set:
                pass
            else:
                self.has_request_set[key] = page_url
                obj = Request(url=page_url, method='GET', callback=self.parse)
                yield obj

    @staticmethod
    def md5(val):
        import hashlib
        ha = hashlib.md5()
        ha.update(bytes(val, encoding='utf-8'))
        key = ha.hexdigest()
        return key

HtmlXPathSelector提供了类似beautifulsoup解析html的功能
具体使用方法:

from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse

html = """<!DOCTYPE html>
<html>
    <head lang="en">
        <meta charset="UTF-8">
        <title></title>
    </head>
    <body>
        <ul>
            <li class="item-"><a id='i1' href="link.html">first item</a></li>
            <li class="item-0"><a id='i2' href="link.html">first item</a></li>
            <li class="item-1"><a href="link2.html">second item<span>vv</span></a></li>
        </ul>
        <div><a href="link1.html">second item</a></div> 
    </body>
</html>
"""
# //子子孙孙   /当前孩子
# 先对字符串进行封装,使其成为一个response对象
response = HtmlResponse(url='http://example.com', body=html, encoding='utf-8')
# Selector对象
# hxs = HtmlXPathSelector(response)  已被弃用
# hxs = Selector(response).xpath("//a")  # 找到所有的a标签
# hxs = Selector(response).xpath("//a[@id]")  # 找到所有有id属性的a标签
# hxs = Selector(response).xpath("//a[@id='i1']")  # 找到所有有id属性且是i1的a标签
# hxs = Selector(response).xpath('//a[@href="link.html"][@id="i1"]')  # 多个属性并列
# hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')  # href属性包含link
# hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')  # href属性以link开头
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]')  # 正则匹配 以re:test开始
# 使用extract()变成标签
# hxs = Selector(response).xpath("//a/text()")[0].extract()  # 取所有a标签的text
# hxs = Selector(response).xpath('//a/@href')[0].extract_first()  # 取第一个a标签的href
# print(hxs)

item_list = response.xpath("//div[@id='content-list']/div[@class='item']")
for item in item_list:
    text = item.xpath('.//a/text()').extract_first()
    href = item.xpath('.//a/@href').extract_first()
    print(href, text.strip())

转载:https://blog.csdn.net/fei347795790/article/details/100559371
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场