Python每日一练(22)-2020最新版拉勾网python岗位数据爬取及分析

2020-09-14 15:11 718人阅读评论(0)

1. 实例描述

Python 在近几年越来越火爆，很多学生开始学习 Python，社会人士也蠢蠢欲动准备转行，对这个新兴职业充满期待。在感性背后，本文我们来理性看待下目前的 Python 岗位需求。

2. 数据获取

数据源：通过爬虫的方式，从拉勾网爬取 Python 频道下各个城市的招聘岗位，公司，职位福利，薪资等相关数据。详细数据字段与内容预览为：

2.1 技术要点

利用 requests 模块获取拉钩网页面的 python 信息，获取需要的内容，通过 sqlalchemy 创建数据表以及将内容保存到数据库中，再通过 pyecharts，matplotlib，seaborn 等模块进行数据可视化展示。

2.1 爬取思路分析

通过请求url https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput= 然后正则表达式解析拿到所有的城市及对应的code
通过 requests.session 访问 url https://www.lagou.com/jobs/list_python/p-city_2?px=default#filterBox (2 代表北京要爬取其他城市换成对应的数字即可) 将 cookie 保存在 session对象 中。然后在通过 session 去请求 url https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false 注意这个请求是一个 POST 请求并且会有 Referer 反爬。所以请求头一定要加上 Referer 。如下图所示：
解析得到的数据
不要一次性大规模数据爬取，拉钩有请求次数的限制，到了一定的次数程序会自动停止。建议可以一次10多个城市 10多个城市的进行请求。如果 IP 被识别，读者还可以自行购买动态代理进行请求。

2.2 示例代码

数据爬取代码如下：

import requests
import re
import json
import time
import multiprocessing
from lagou_spider.handle_insert_data import lagou_mysql


class HandleLaGou(object):
    def __init__(self, ):
        self.lagou_session = requests.Session()
        self.headers = {
   
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
        }
        self.city_dict = dict()  # 用于存储所有城市的名字和对应的代码编号
        self.city_name_list = list()  # 用于存储所有城市的名字
        self.api_url = "http://dynamic.goubanjia.com/dynamic/get/4fb8e6f11b87aab615e95c55691b669e.html"
        self.ip_port = ""

    # 获取全国所有城市code的方法
    def handle_city_code(self):
        city_url = "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput="
        city_result = self.handle_request(method='GET', url=city_url)
        city_code = re.search(r'global.cityNumMap = (.*);', city_result)
        if city_code:
            self.city_dict = json.loads(city_code.group(1))
            self.city_name_list = self.city_dict.keys()
        self.lagou_session.cookies.clear()

    def handle_city(self, city):
        first_request_url = "https://www.lagou.com/jobs/list_python/p-city_%s?px=default" % self.city_dict[city]
        first_response = self.handle_request(method="GET", url=first_request_url)
        try:
            total_page = int(re.search(r'span totalNum.*?(\d+)</span>', first_response).group(1))
            print(city)
        # 有些城市没有该岗位信息 造成异常所以直接return
        except Exception as e:
            return
        else:
            for i in range(1, total_page + 1):
                # post请求要携带的参数
                data = {
   
                    'first': 'true',
                    'pn': i,
                    'kd': 'python'
                }
                page_url = "https://www.lagou.com/jobs/positionAjax.json?px=default&city=%s&needAddtionalResult=false" % city
                referer_url = "https://www.lagou.com/jobs/list_python/p-city_%s?px=default" % self.city_dict[city]
                self.headers["Referer"] = referer_url.encode("utf8")
                response = self.handle_request(method="POST", url=page_url, data=data, info=city)
                lagou_data = json.loads(response)
                job_list = lagou_data["content"]["positionResult"]["result"]
                for job in job_list:
                    lagou_mysql.insert_item(job)

    def handle_request(self, method, url, data=None, info=None):
        while True:
            if method == "GET":
                response = self.lagou_session.get(url=url, headers=self.headers)
            elif method == "POST":
                response = self.lagou_session.post(url=url, headers=self.headers, data=data)
            response.encoding = "utf8"
            if "频繁" in response.text:
                print(response.text)
                # 需要先清除cookies信息 然后在重新获取cookies信息
                self.lagou_session.cookies.clear()
                first_request_url = "https://www.lagou.com/jobs/list_python/p-city_%s?px=default" % self.city_dict[info]
                self.handle_request(method="GET", url=first_request_url)
                time.sleep(10)
                continue
            return response.text


if __name__ == '__main__':
    lagou = HandleLaGou()
    lagou.handle_city_code()
    pool = multiprocessing.Pool(5)
    # city_list = list(set(lagou.city_name_list))
    city_list = ["宁波", "常州", "沈阳", "石家庄", "昆明", "南昌",
                 "南宁", "哈尔滨", "海口", "中山", "惠州", "贵阳", "长春", "太原", "嘉兴", "泰安", "昆山", "烟台", "兰州", "泉州"]

    for city_name in city_list:
        lagou.handle_city(city_name)
        # pool.apply_async(lagou.handle_city, args=(city_name,))
    # pool.close()  # 关闭进程池，关闭后pool不再接受新的任务请求
    # pool.join()  # 等待子进程结束

创建数据库表代码如下：

from sqlalchemy import create_engine, Integer, String, Float
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column  # 导入

# 1.初始化数据库的连接
engine = create_engine("mysql+pymysql://root:mysql@127.0.0.1:3306/lagou_spider?charset=utf8")
# 操作数据库，需要我们创建一个session
DBSession = sessionmaker(bind=engine)  # 创建DBSession类型
Base = declarative_base()  # 创建对象的基类


# 创建Lagoutables对象 要继承上面创建的基类
class Lagoutables(Base):
    # 表的名字
    __tablename__ = 'lagou_data1'
    # 表的结构
    id = Column(Integer, primary_key=True, autoincrement=True)  # id,设置为主键和自动增长
    positionID = Column(Integer, nullable=True)  # 岗位ID,非空字段
    longitude = Column(Float, nullable=True)  # 经度
    latitude = Column(Float, nullable=True)  # 纬度
    positionName = Column(String(length=50), nullable=False)  # 岗位名称
    workYear = Column(String(length=20), nullable=False)  # 工作年限
    education = Column(String(length=20), nullable=False)  # 学历
    jobNature = Column(String(length=20), nullable=True)  # 岗位性质
    financeStage = Column(String(length=30), nullable=True)  # 公司类型
    companySize = Column(String(length=30), nullable=True)  # 公司规模
    industryField = Column(String(length=30), nullable=True)  # 业务方向
    city = Column(String(length=10), nullable=False)  # 所在城市
    positionAdvantage = Column(String(length=200), nullable=True)  # 岗位标签
    companyShortName = Column(String(length=50), nullable=True)  # 公司简称
    companyFullName = Column(String(length=200), nullable=True)  # 公司全称
    district = Column(String(length=20), nullable=True)  # 公司所在区
    companyLabelList = Column(String(length=200), nullable=True)  # 公司福利标签
    salary = Column(String(length=20), nullable=False)  # 工资
    crawl_date = Column(String(length=20), nullable=False)  # 抓取日期


if __name__ == '__main__':
    # 创建数据表
    Lagoutables.metadata.create_all(engine)

数据入库的代码如下：

from lagou_spider.create_lagou_table import Lagoutables
from lagou_spider.create_lagou_table import DBSession
import time


class HandleLagouData(object):
    def __init__(self):
        self.mysql_session = DBSession()  # 实例化session信息

    # 数据的存储方法
    def insert_item(self, item):
        date = time.strftime("%Y-%m-%d", time.localtime())  # 今天
        # 存储的数据结构
        data = Lagoutables(
            positionID=item['positionId'],  # 岗位ID
            longitude=item['longitude'],  # 经度
            latitude=item['latitude'],  # 纬度
            positionName=item['positionName'],  # 岗位名称
            workYear=item['workYear'],  # 工作年限
            education=item['education'],  # 学历
            jobNature=item['jobNature'],  # 岗位性质
            financeStage=item['financeStage'],  # 公司类型
            companySize=item['companySize'],  # 公司规模
            industryField=item['industryField'],  # 业务方向
            city=item['city'],  # 所在城市
            positionAdvantage=item['positionAdvantage'],  # 岗位标签
            companyShortName=item['companyShortName'],  # 公司简称
            companyFullName=item['companyFullName'],  # 公司全称
            district=item['district'],  # 公司所在区
            companyLabelList=','.join(item['companyLabelList']),  # 公司福利标签
            salary=item['salary'],  # 薪资
            crawl_date=date  # 抓取日期
        )
        # 在存储数据之前，先来查询一下表里是否有这条岗位信息
        query_result = self.mysql_session.query(Lagoutables).filter(Lagoutables.crawl_date == date,
                                                                    Lagoutables.positionID == item[
                                                                        'positionId']).first()
        if query_result:
            print('该岗位信息已存在%s:%s:%s' % (item['positionId'], item['city'], item['positionName']))
        else:
            self.mysql_session.add(data)  # 插入数据
            self.mysql_session.commit()  # 提交数据到数据库
            print('新增岗位信息%s' % item['positionId'])

lagou_mysql = HandleLagouData()

3. 拉勾网招聘数据分析

3.1 需求现状

样本容量：3995。在经验方面，主要为1-5年，占67%。其中，3-5年工作经验需求量最大。Python 是最近几年才开始火爆，目前的人才要求在经验上并无特别长的要求，大部分是1-5年。so，如果你想从事 Python，并不晚哦。

pyecharts 绘制饼图代码如下：

from pyecharts import options as opts
from pyecharts.charts import Pie
from pyecharts.render import make_snapshot
from snapshot_selenium import snapshot


def pie_base() -> Pie:
    c = (
        Pie()
            .add("", [list(z) for z in zip(['1-3年', '3-5年', '不限', '5-10年', '应届毕业生', '1年以下', '10年以上'],
                                           [1016, 1642, 452, 499, 346, 24, 16])])
            .set_global_opts(
            title_opts=opts.TitleOpts(title="岗位经验占比分析"))
            .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
    )
    return c


make_snapshot(snapshot, pie_base().render(), "bar0.png")

3.2 行业分布

企业可以同属多个行业，我们在数据处理时，如果一个企业属于N个行业，我们按照N个企业进行计算。 Python 在移动互联网，金融，电商需求量大，传统行业需求量较少。催生了一批专门做数据服务的企业，这块需求量也较大。
代码如下：

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_excel('lagou.xlsx')
df1 = df.head(20)
plt.rcParams['font.sans-serif'] = ['SimHei']  # 解决中文乱码
x = df1['业务方向']
y = df1['数量']
# 调整图表距左的空白
plt.subplots_adjust(left=0.2)
# 4个方向的坐标轴上的刻度线是否显示
plt.tick_params(bottom=False, left=False)
# 添加刻度标签
plt.yticks(range(20))
# 图表标题
plt.title('2020年Python行业对比分析')
plt.barh(x, y, color='Turquoise')  # 柱子蓝绿色
plt.savefig("test.png")
plt.show()

3.3 公司发展

从公司发展阶段来说，在大公司用到 Python 的比例比较多。

matplotlib 绘制饼图代码如下：

import pandas as pd
from matplotlib import pyplot as plt

plt.rcParams['font.sans-serif'] = ['SimHei']  # 解决中文乱码
plt.figure(figsize=(5, 3))  # 设置画布大小

labels = ['成熟型(不需要融资)', '上市公司', '初创型(未融资)', '成长型(A轮)', '成长型(B轮)', '初创型(天使轮)', '成熟型(C轮)', '成熟型(D轮及以上)']
sizes = [1161, 845, 775, 368, 328, 199, 180, 139]
# 设置饼形图每块的颜色
colors = ['slateblue', 'green', 'magenta', 'cyan', 'darkorange', 'lawngreen', 'pink', 'gold']
plt.pie(sizes,  # 绘图数据
        labels=labels,  # 添加区域水平标签
        colors=colors,  # 设置饼图的自定义填充色
        autopct='%.1f%%',  # 设置百分比的格式，这里保留一位小数
        # radius =1 , # 设置饼图的半径
        pctdistance=0.85,
        startangle=180,
        textprops={
   'fontsize': 9, 'color': 'k'},  # 设置文本标签的属性值
        wedgeprops={
   'width': 0.4, 'edgecolor': 'k'})
plt.title('python岗位公司规模情况分析')
plt.savefig('公司规模.png')
plt.show()

3.4 学历要求

目前 Python 岗位主要要求是 本科 及以上学历，大部分都是要求本科学历，要求硕士及博士学历的并不多，人工智能相对于对学历要求比较高。大专学历特别优秀也是可以的。

seaborn 绘制简单柱形图代码如下：

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('darkgrid')
plt.rcParams['font.sans-serif'] = ['SimHei']  # 解决中文乱码
x = ["本科", "不限", "大专", "硕士", "博士"]
y = [3119, 221, 440, 209, 6]
sns.barplot(x, y)
plt.title("Python岗位学历分布对比")
plt.legend(["学历"])  # 图例
plt.savefig("柱状图.png")
plt.show()

3.5 职位诱惑

如果读者时间较多的话，可以在进行其他数据的分析，如薪资，技能要求等。看完本篇文章，其他招聘类网站也是一样的。以上内容仅为技术学习交流使用，请勿采集数据进行商用，否则后果自负，与博主无关，如有侵权，联系博主删除，编写不易，手留余香~。

转载：https://blog.csdn.net/xw1680/article/details/108145168

查看评论

小言_互联网的博客

小言_互联网的博客

个人资料

文章分类

文章存档

阅读排行

评论排行

推荐文章