飞道的博客

Scrapy爬虫:链家全国各省城市房屋数据批量爬取,别再为房屋发愁!

388人阅读  评论(0)

:点击上方[Python爬虫数据分析挖掘]右上角[...][设为星标⭐]

文章目录

  • 1、前言

  • 2、基本环境搭建

  • 3、代码注释分析

  • 3、图片辅助分析

  • 4、完整代码

  • 5、运行结果

1、前言


   
  1. 本文爬取的是链家的二手房信息,相信个位小伙伴看完后一定能自己动手爬取链家的其他模块,
  2. 比如:租房、新房等等模块房屋数据。

话不多说,来到链家首页,点击北京

来到如下页面,这里有全国各个各个省份城市,而且点击某个城市会跳转到以该城市的为定位的页面

点击二手房,来到二手房页面,可以发现链接地址只是在原先的URL上拼接了 /ershoufang/,所以我们之后也可以直接拼接

但注意,以下这种我们不需要的需要排除

多页爬取,规律如下,多的也不用我说了,大家都能看出来

2、基本环境搭建

建立数据库

建表语句


   
  1. CREATE TABLE `lianjia` (
  2. `id` int( 11) NOT NULL AUTO_INCREMENT,
  3. `city` varchar( 100) DEFAULT NULL,
  4. `money` varchar( 100) DEFAULT NULL,
  5. `address` varchar( 100) DEFAULT NULL,
  6. `house_pattern` varchar( 100) DEFAULT NULL,
  7. `house_size` varchar( 100) DEFAULT NULL,
  8. `house_degree` varchar( 100) DEFAULT NULL,
  9. `house_floor` varchar( 100) DEFAULT NULL,
  10. `price` varchar( 50) DEFAULT NULL,
  11. PRIMARY KEY ( `id`)
  12. ) ENGINE= InnoDB AUTO_INCREMENT= 212 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

创建scrapy项目


start.py


   
  1. from scrapy import cmdline
  2. cmdline.execute( "scrapy crawl lianjia". split())

3、代码注释分析

lianjia.py


   
  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. import time
  4. from Lianjia.items import LianjiaItem
  5. class LianjiaSpider(scrapy.Spider):
  6. name = 'lianjia'
  7. allowed_domains = [ 'lianjia.com']
  8. #拥有各个省份城市的URL
  9. start_urls = [ 'https://www.lianjia.com/city/']
  10. def parse(self, response):
  11. #参考图1,找到class值为city_list_ul的ul标签,在获取其下的所有li标签
  12. ul = response.xpath( "//ul[@class='city_list_ul']/li")
  13. #遍历ul,每个省份代表一个li标签
  14. for li in ul:
  15. #参考图2,获取每个省份下的所有城市的li标签
  16. data_ul = li.xpath( ".//ul/li")
  17. #遍历得到每个城市
  18. for li_data in data_ul:
  19. #参考图3,获取每个城市的URL和名称
  20. city = li_data.xpath( ".//a/text()").get()
  21. #拼接成为二手房链接
  22. page_url = li_data.xpath( ".//a/@href").get() + "/ershoufang/"
  23. #多页爬取
  24. for i in range( 3):
  25. url = page_url + "pg" + str(i+ 1)
  26. print(url)
  27. yield scrapy.Request(url=url,callback=self.pageData,meta={ "info":city})
  28. def pageData(self,response):
  29. print( "="* 50)
  30. #获取传过来的城市名称
  31. city = response.meta.get( "info")
  32. #参考图4,找到class值为sellListContent的ul标签,在获取其下的所有li标签
  33. detail_li = response.xpath( "//ul[@class='sellListContent']/li")
  34. #遍历
  35. for page_li in detail_li:
  36. #参考图5,获取class值判断排除多余的广告
  37. if page_li.xpath( "@class").get() == "list_app_daoliu":
  38. continue
  39. #参考图6,获取房屋总价
  40. money = page_li.xpath( ".//div[@class='totalPrice']/span/text()").get()
  41. money = str(money) + "万"
  42. #参考图7
  43. address = page_li.xpath( ".//div[@class='positionInfo']/a/text()").get()
  44. #参考图8,获取到房屋的全部数据,进行分割
  45. house_data = page_li.xpath( ".//div[@class='houseInfo']/text()").get().split( "|")
  46. #房屋格局
  47. house_pattern = house_data[ 0]
  48. #面积大小
  49. house_size = house_data[ 1].strip()
  50. #装修程度
  51. house_degree = house_data[ 3].strip()
  52. #楼层
  53. house_floor = house_data[ 4].strip()
  54. #单价,参考图9
  55. price = page_li.xpath( ".//div[@class='unitPrice']/span/text()").get().replace( "单价", "")
  56. time.sleep( 0.5)
  57. item = LianjiaItem(city=city,money=money,address=address,house_pattern=house_pattern,house_size=house_size,house_degree=house_degree,house_floor=house_floor,price=price)
  58. yield item

3、图片辅助分析

图1

图2

图3

图4

图5

图6


图7

图8

图9


4、完整代码

lianjia.py


   
  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. import time
  4. from Lianjia.items import LianjiaItem
  5. class LianjiaSpider(scrapy.Spider):
  6. name = 'lianjia'
  7. allowed_domains = [ 'lianjia.com']
  8. start_urls = [ 'https://www.lianjia.com/city/']
  9. def parse(self, response):
  10. ul = response.xpath( "//ul[@class='city_list_ul']/li")
  11. for li in ul:
  12. data_ul = li.xpath( ".//ul/li")
  13. for li_data in data_ul:
  14. city = li_data.xpath( ".//a/text()").get()
  15. page_url = li_data.xpath( ".//a/@href").get() + "/ershoufang/"
  16. for i in range( 3):
  17. url = page_url + "pg" + str(i+ 1)
  18. print(url)
  19. yield scrapy.Request(url=url,callback=self.pageData,meta={ "info":city})
  20. def pageData(self,response):
  21. print( "="* 50)
  22. city = response.meta.get( "info")
  23. detail_li = response.xpath( "//ul[@class='sellListContent']/li")
  24. for page_li in detail_li:
  25. if page_li.xpath( "@class").get() == "list_app_daoliu":
  26. continue
  27. money = page_li.xpath( ".//div[@class='totalPrice']/span/text()").get()
  28. money = str(money) + "万"
  29. address = page_li.xpath( ".//div[@class='positionInfo']/a/text()").get()
  30. #获取到房屋的全部数据,进行分割
  31. house_data = page_li.xpath( ".//div[@class='houseInfo']/text()").get().split( "|")
  32. #房屋格局
  33. house_pattern = house_data[ 0]
  34. #面积大小
  35. house_size = house_data[ 1].strip()
  36. #装修程度
  37. house_degree = house_data[ 3].strip()
  38. #楼层
  39. house_floor = house_data[ 4].strip()
  40. #单价
  41. price = page_li.xpath( ".//div[@class='unitPrice']/span/text()").get().replace( "单价", "")
  42. time.sleep( 0.5)
  43. item = LianjiaItem(city=city,money=money,address=address,house_pattern=house_pattern,house_size=house_size,house_degree=house_degree,house_floor=house_floor,price=price)
  44. yield item

items.py


   
  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. class LianjiaItem(scrapy.Item):
  4. #城市
  5. city = scrapy.Field()
  6. #总价
  7. money = scrapy.Field()
  8. #地址
  9. address = scrapy.Field()
  10. # 房屋格局
  11. house_pattern = scrapy.Field()
  12. # 面积大小
  13. house_size = scrapy.Field()
  14. # 装修程度
  15. house_degree = scrapy.Field()
  16. # 楼层
  17. house_floor = scrapy.Field()
  18. # 单价
  19. price = scrapy.Field()

pipelines.py


   
  1. import pymysql
  2. class LianjiaPipeline:
  3. def __init__(self):
  4. dbparams = {
  5. 'host': '127.0.0.1',
  6. 'port': 3306,
  7. 'user': 'root', #数据库账号
  8. 'password': 'root', #数据库密码
  9. 'database': 'lianjia', #数据库名称
  10. 'charset': 'utf8'
  11. }
  12. #初始化数据库连接
  13. self.conn = pymysql.connect(**dbparams)
  14. self.cursor = self.conn.cursor()
  15. self._sql = None
  16. def process_item(self, item, spider):
  17. #执行sql
  18. self.cursor.execute( self.sql,(item[ 'city'],item[ 'money'],item[ 'address'],item[ 'house_pattern'],item[ 'house_size'],item[ 'house_degree']
  19. ,item[ 'house_floor'],item[ 'price']))
  20. self.conn.commit() #提交
  21. return item
  22. @property
  23. def sql(self):
  24. if not self. _sql:
  25. #数据库插入语句
  26. self._sql = "" "
  27. insert into lianjia(id,city,money,address,house_pattern,house_size,house_degree,house_floor,price)
  28. values(null,%s,%s,%s,%s,%s,%s,%s,%s)
  29. " ""
  30. return self._sql
  31. return self._sql

settings.py


   
  1. # -*- coding: utf-8 -*-
  2. BOT_NAME = 'Lianjia'
  3. SPIDER_MODULES = ['Lianjia.spiders']
  4. NEWSPIDER_MODULE = 'Lianjia.spiders'
  5. LOG_LEVEL= "ERROR"
  6. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  7. #USER_AGENT = 'Lianjia (+http://www.yourdomain.com)'
  8. # Obey robots.txt rules
  9. ROBOTSTXT_OBEY = False
  10. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  11. #CONCURRENT_REQUESTS = 32
  12. # Configure a delay for requests for the same website (default: 0)
  13. # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
  14. # See also autothrottle settings and docs
  15. #DOWNLOAD_DELAY = 3
  16. # The download delay setting will honor only one of:
  17. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  18. #CONCURRENT_REQUESTS_PER_IP = 16
  19. # Disable cookies (enabled by default)
  20. #COOKIES_ENABLED = False
  21. # Disable Telnet Console (enabled by default)
  22. #TELNETCONSOLE_ENABLED = False
  23. # Override the default request headers:
  24. DEFAULT_REQUEST_HEADERS = {
  25. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  26. 'Accept-Language': 'en',
  27. "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36 Edg/84.0.522.63"
  28. }
  29. # Enable or disable spider middlewares
  30. # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  31. #SPIDER_MIDDLEWARES = {
  32. # 'Lianjia.middlewares.LianjiaSpiderMiddleware': 543,
  33. #}
  34. # Enable or disable downloader middlewares
  35. # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
  36. #DOWNLOADER_MIDDLEWARES = {
  37. # 'Lianjia.middlewares.LianjiaDownloaderMiddleware': 543,
  38. #}
  39. # Enable or disable extensions
  40. # See https://docs.scrapy.org/en/latest/topics/extensions.html
  41. #EXTENSIONS = {
  42. # 'scrapy.extensions.telnet.TelnetConsole': None,
  43. #}
  44. # Configure item pipelines
  45. # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
  46. ITEM_PIPELINES = {
  47. 'Lianjia.pipelines.LianjiaPipeline': 300,
  48. }
  49. # Enable and configure the AutoThrottle extension (disabled by default)
  50. # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
  51. #AUTOTHROTTLE_ENABLED = True
  52. # The initial download delay
  53. #AUTOTHROTTLE_START_DELAY = 5
  54. # The maximum download delay to be set in case of high latencies
  55. #AUTOTHROTTLE_MAX_DELAY = 60
  56. # The average number of requests Scrapy should be sending in parallel to
  57. # each remote server
  58. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  59. # Enable showing throttling stats for every response received:
  60. #AUTOTHROTTLE_DEBUG = False
  61. # Enable and configure HTTP caching (disabled by default)
  62. # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  63. #HTTPCACHE_ENABLED = True
  64. #HTTPCACHE_EXPIRATION_SECS = 0
  65. #HTTPCACHE_DIR = 'httpcache'
  66. #HTTPCACHE_IGNORE_HTTP_CODES = []
  67. #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5、运行结果

全部数据远远大于518条,我爬取一会就停下来了,这里只是个演示。

- END -

各种爬虫源码获取方式

识别文末二维码,回复:爬虫源码

欢迎关注公众号:Python爬虫数据分析挖掘,方便及时阅读最新文章

记录学习python的点点滴滴;

回复【开源源码】免费获取更多开源项目源码;


转载:https://blog.csdn.net/lyc2016012170/article/details/110153167
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场