爬虫请求模块

2020-05-19 18:30 1131人阅读评论(0)

1. urllib.request

1.1 版本

python2 ：urllib2、urllib
python3 ：把urllib和urllib2合并,urllib.request

1.2 常用方法

urllib.request.urlopen(“网址”) 作用：向网站发起一个请求并获取响应
字节流 = response.read()
字符串 = response.read().decode(“utf-8”)
urllib.request.Request"网址",headers=“字典”

1.3 响应对象 response 的方法

read() 读取服务器响应的内容
getcode() 返回HTTP的响应码
geturl() 返回实际数据的URL(防止重定向问题)

import urllib.request

url = 'https://www.yuque.com/docs/share/95002244-0097-4da2-b44c-53a9f69ce3a0?#'
# response 是响应对象
response = urllib.request.urlopen(url)
print(response.getcode(),response.geturl())

结果显示：

使用流程：

利用Resquest()方法构建请求对象

利用urlopen()方法获取响应对象

利用响应对象中的read().decode(‘utf-8’)中的内容

举例：进入百度官网（http://ww.badu.com）

import urllib.request

url = 'https://www.baidu.com/'

headers = {	# 在任意百度页面下获取
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}

# 创建请求对象
req = urllib.request.Request(url,headers=headers)

# 获取响应对象
response =urllib.request.urlopen(req)

# 读取响应对象内容
html = response.read().decode('utf-8')

print(html)

2. urllib.parse

设置编码
urlencode() --传入的是字典
quote() – 传入的是单个字符串

2.1 urlencode()

2.2 quote()

3. 请求方式

3.1 GET

特点：查询参数在URL地址中显示

3.2 POST

在Request方法中添加data参数
- dataurllib.request.Request(url,data=data,headers=headers)
- data ：表单数据以bytes类型提交,不能是str

4. request模块

4.1 安装

pip install requests
在开发工具中安装
- File | Settings | Project: Python_Space | Python Interpreter
- 点击 ‘+’ 号，输入要安装的模块，点击下方 Install Package

4.2 request常用方法

requests.get(url)
requests.post(url,data=None,headers=……)

4.3 响应对象response的方法

response.text 返回unicode格式的数据(str)
response.content 返回字节流数据(二进制)
response.content.decode(‘utf-8’) 手动进行解码
response.url 返回url
response.encode() = ‘编码’

4.4 发送post请求

以有道翻译为例

import json

import requests

url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}

key = input("请输入你要翻译的内容：")

data = {
    'i': key,
    'from': 'AUTO',
    'to': 'AUTO',
    'smartresult': 'dict',
    'client': 'fanyideskweb',
    'salt': '15883243260471',
    'sign': '1c78fd284dd4167388b13dd5a1823367',
    'ts': '1588324326047',
    'bv': '70244e0061db49a9ee62d341c5fed82a',
    'doctype': 'json',
    'version': '2.1',
    'keyfrom': 'fanyi.web',
    'action': 'FY_BY_REALTlME',
        }

response = requests.post(url,data=data,headers=headers) # 发送post请求

response.encoding= 'utf-8'

html = response.text # print(type(html))  /<class 'str'>
# print(html) /{"type":"EN2ZH_CN","errorCode":0,"elapsedTime":1,"translateResult":[[{"src":"age","tgt":"年龄"}]]} 需转换为字典类型方便提取结果

# json.loads() 可将  str 类型的 html  ——> dict 类型

result = json.loads(html)

# print(type(result)) # <class 'dict'>

print(result['translateResult'][0][0]['tgt'])

4.5requests设置代理

使用requests添加代理只需要在请求方法中(get/post)传递proxies参数就可以了
代理网站
西刺免费代理IP：http://www.xicidaili.com/
快代理：http://www.kuaidaili.com/
代理云：http://www.dailiyun.com/

# 代理网站
    # 西刺免费代理IP：http://www.xicidaili.com/
    # 快代理：http://www.kuaidaili.com/
    # 代理云：http://www.dailiyun.com/ -- 比较麻烦
import requests

# 设置代理
proxy = {
    'http':'116.196.85.150:3128'
}#从以上代理网站中寻找可用代理ip


url = 'http://www.httpbin.org/ip'

res = requests.get(url,proxies = proxy) #代理测试

print(res.text)

4.6 cookie

cookie ：通过在客户端记录的信息确定用户身份
HTTP是一种无连接协议,客户端和服务器交互仅仅限于请求/响应过程,结束后断开,下一次请求时,服务器会认为是一个新的客户端,为了维护他们之间的连接,让服务器知道这是前一个用户发起的请求,必须在一个地方保存客户端信息。

4.7 session

session ：通过在服务端记录的信息确定用户身份这里这个session就是一个指的是会话

4.8 处理不信任的SSL证书

什么是SSL证书？
- SSL证书是数字证书的一种，类似于驾驶证、护照和营业执照的电子副本。因为配置在服务器上，也称为SSL服务器证书。SSL 证书就是遵守 SSL协议，由受信任的数字证书颁发机构CA，在验证服务器身份后颁发，具有服务器身份验证和数据传输加密功能

import requests

url ='https://inv-veri.chinatax.gov.cn/'

# response = requests.get(url)
# print(response.text) # requests.exceptions.SSLError: HTTPSConnectionPool(host='inv-veri.chinatax.gov.cn', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1056)')))

response =  requests.get(url,verify = False) # 指定尝试连接

print(response.text)# 成功

5. requests源码分析

requests模块API源码分析

转载：https://blog.csdn.net/qq_42149144/article/details/105841811

查看评论

小言_互联网的博客

小言_互联网的博客

个人资料

文章分类

文章存档

阅读排行

评论排行

推荐文章

爬虫请求模块

文章目录

1. urllib.request

1.1 版本

1.2 常用方法

1.3 响应对象 response 的方法

2. urllib.parse

2.1 urlencode()

2.2 quote()

3. 请求方式

3.1 GET

3.2 POST

4. request模块

4.1 安装

4.2 request常用方法

4.3 响应对象response的方法

4.4 发送post请求

4.5requests设置代理

4.6 cookie

4.7 session

4.8 处理不信任的SSL证书

5. requests源码分析

* 以上用户言论只代表其个人观点，不代表本网站的观点或立场