小言_互联网的博客

深圳公交线路爬虫

328人阅读  评论(0)

       最近由于一门课的project需要用到爬虫,因此在网上找了个教程,边看边学,写了一个爬虫,爬取了深圳市公交路线的线路及站点信息。为了防止以后要用到该爬虫或者作进一步的改进时忘记当初的思路,因此写个博客记录一下。

       首先,该爬虫所用的库主要是Requests+BeautifulSoup,Requests库提供了获取网页的函数,BeautifulSoup库帮助我们解析网页,能够让我们快速找到返回的网页中所需要的信息。除此之外,还用了os库输出获取的信息,pandas库用来将获取的信息转换成.csv文件所需要的格式。下面是这次爬虫的全部代码:

       

import requests
from bs4 import BeautifulSoup
import os
import pandas as pd

kv = {'user-agent':'MoMozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36zilla/5.0'}#headers中的浏览器信息

def getHTMLText(url): #获取网页信息
    try:
        coo = 'thw=cn; v=0; cna=5X1VFf9fTXQCATGNwJx/mYM8; t=0c7d094551823e1719118c805f9e3725; cookie2=112db93e4fac2151b08a825efb50cff4; _tb_token_=e5b3755745e50; lgc=jhcatharnice; dnk=jhcatharnice; tracknick=jhcatharnice; tg=0; uc3=id2=UU20sZyBQC8Xew%3D%3D&lg2=Vq8l%2BKCLz3%2F65A%3D%3D&nk2=CdsbI0szN7jN44qS&vt3=F8dByuHZ4aYXVENr0EQ%3D; csg=e3608010; skt=68182e0838039878; existShop=MTU2OTY4MTAwOQ%3D%3D; _cc_=UIHiLt3xSw%3D%3D; enc=RoEaVC%2FoJDN9u%2FVTHODuMp7ya5g7uIO8uGLYLuEVPlGEZ%2B0v8mjhGMVlRC5BBXOWQnV%2FIV1guEmy6QphjwEYhQ%3D%3D; alitrackid=www.taobao.com; lastalitrackid=www.taobao.com; hng=CN%7Czh-CN%7CCNY%7C156; _m_h5_tk=6824bbc20aae3928b8b83cfa1467330a_1570685537172; _m_h5_tk_enc=aba98fcd08acb92d9fa0598d7259b9a5; mt=ci=-1_0; uc1=cookie14=UoTbnV5ry%2B5y8g%3D%3D&cookie15=UtASsssmOIJ0bQ%3D%3D; JSESSIONID=D585DBCC43FB5FC4D65B9B6369951931; l=cBrebiBVqBRoLH4yBOfZlurza77T0CRflsPzaNbMiIB19mCaCd326HwBgG3wL3QQE9fEFexzzRH22RFeW94Z9KbgjKtrCyCl.; isg=BPr6FvfGc-oFXP-Ym_wquJXcSyYcq36F-wHz-gT3Bg0t95Yx7jlvlbjNR8OO5_Yd'#cookie信息
        cookies = {}
        for line in coo.split(';'):
            name,value = line.strip().split('=', 1)
            cookies[name] = value
        r = requests.get(url,cookies = cookies,headers = kv)
        r.raise_for_status
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillNetworkList(url,html):#获取所有公交路线信息
    glist = []
    soup = BeautifulSoup(html,"html.parser")
    all_a = soup.find(name = 'div',attrs = {'class':'list'}).find_all('a')
    print(all_a[0].string)
    for a in all_a:#获取公交车线路列表
        href = a['href'] #取出a标签的href 属性
        next_url = url + href
        next_html = getHTMLText(next_url)
        next_soup = BeautifulSoup(next_html,"html.parser")
        nextall_a = next_soup.find(name = 'div',attrs = {'class':'list clearfix'}).find_all('a')
        print(nextall_a[0].string)
        for next_a in nextall_a:#获取具体公交车路线名称及站点信息
            title = next_a.text#获取公交车路线名称
            href1 = next_a['href']
            bus_url = url + href1
            bus_html = getHTMLText(bus_url)
            bus_soup = BeautifulSoup(bus_html,'html.parser')
            bus_name = title
            bus_station = bus_soup.find_all(name = 'ol')
            name_list = []#保存公交线路站点信息
            name_list.append(title)
            for station_list in bus_station:#获取站点信息
                s = station_list.find_all(name = 'a')
                for station_name in s:
                    if station_name.text in name_list:
                        break
                    name_list.append(station_name.text)
                    for i in range(1,len(name_list)-1):
                        glist.append((name_list[i],name_list[i+1]))
            print(name_list)
    return glist

def SaveText(glist,filename,mode = 'w'):#输出路线信息
    name=['source','target']
    test=pd.DataFrame(columns=name,data=glist)
    test.to_csv(filename,encoding='gbk')

    
def main():
    start_url =  "https://shenzhen.8684.cn/"
    start_html = getHTMLText(start_url)#获取初始网页
    glist = fillNetworkList(start_url,start_html)#获取公交线路及站点信息
    SaveText(glist,'/Users/apple/desktop/Spiderforexercise/QuanzhouBusStation.csv')#保存信息

main()
 

       以上是所有的代码,所用的环境是python3.7,首先用Requests库的get方法获取网页,为了反爬,设置了user-agent和cookie,然后用BeautifulSoup库的BeautifulSoup方法解析网页,最后将所需要的信息保存到.csv文件中。其中获取所需要的信息时需要查看网页的源码,看所需要的信息存在哪个标签下,经过对网页的分析,发现公交线路的具体存在<div class = list clearfix> 标签中;而所有的公交线路保存在<div class = list>标签中。因此只需要获取每条线路的url,然后分析出线路具体信息就行了。这是一个比较简单的爬虫,后续可以获取每个站点在实际生活中的位置,然后就可以得到实际的线路图了。

 


转载:https://blog.csdn.net/NOtargetSaltyfish/article/details/102526158
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场