1.下载安装
pip install bs4
2.导入
from bs4 import BeautifulSoup as bs
3.装载HTML文档
soup = bs(doc, 'lxml')
#doc是一个HTML文档字符串,可以自动补全 lxml是指定该文档的解析方式 python自带的解析器是parser
4.将文档数转换成字符串格式
soup.prettify()
5.BeautifulSoup查找文档元素
(1)find()
查找一个元素节点,返回第一个满足要求的节点信息
(2)find_all()
find_all(self, name=None, attrs={
}, recursive=True, text=None, limit=None, **kwargs)
self表明它是一个类成员函数;name是要查找的标签元素名称;attrs表示元素的属性,一个字典;recursive是默认True,全范围查找该节点下面的子树;…
(3)返回的都是列表,每个元素都是一个bs4.element.Tag
对象
(4)获取包含的文本值:tag.text
6.BeautifulSoup遍历文档树
tag.parent:获取tag节点的父节点
tag.children:获取tag节点的所有子节点,包括element,text等类型的子节点
tag.desendants:获取tag节点的所有子孙节点,包括element,text等类型的子节点
tag.next_sibling:tag临近的下一个兄弟节点
tag.previous_sibling:tag临近的前一个兄弟节点
7.BeautifulSoup使用css语法查找元素
(1)tag.select(css)
:tag是HTML文档中的一个元素节点
css一般结构:[tagName][attName][=value]
全是可选的,表示元素名称,元素属性,元素属性的值
(2)属性的语法:
[attName]
选取带有指定属性的每个元素
[attName=value]
选取带有指定属性和值的每个元素
[aattName^=value]
:匹配属性值以value开头的每个元素
[attName$=value]
:匹配属性值以value结尾的每个元素
[attName*=value]
:匹配属性值包含value的每个元素
(3)遍历:
css有多个节点时,空格分开:
soup.select("div p")
:查找div节点下所有子孙p节点的信息
soup.select("div > p")
:查找div节点下所有直接子节点p的信息
soup.select("div ~ p")
:查找div后面所有同级别兄弟节点p的信息
soup.select("div + p")
:查找前一个节点后面所有同级别兄弟节点的信息
9.字符编码问题
import urllib.request
from bs4 import BeautifulSoup as bs
from bs4 import UnicodeDammit
data = urllib.request.urlopen(url)
data=data.read()
dammit = UnicodeDammit(data,['gbk','utf-8'])
data = dammit.unicode_markup
soup = bs(data,'lxml')
tags = soup.select("div[class='属性值'] span.....")
for tag in tags:
print(tag)
8.实例:爬取中国天气网数据兰州7天的
import urllib.request
from bs4 import BeautifulSoup
'''
字符编码转换
data=data.read()
dammit = UnicodeDammit(data,['gbk','utf-8'])
data = dammit.unicode_markup
'''
url = "http://www.weather.com.cn/weather/101160101.shtml"
headers = {
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Mobile Safari/537.36'
}
req = urllib.request.Request(url=url,headers=headers)
print(type(req))
data = urllib.request.urlopen(req)
print(type(data))
data = data.read()
print(type(data))
data = data.decode()
print(type(data))
soup = BeautifulSoup(data,'lxml')
lis = soup.select("ul[class='t clearfix'] li")
print(lis)
for li in lis:
try:
dtime = lis.select("h1")[0].text
weather = lis.select("p[class='wea']")[0].text
tem = lis.select("p[class='tem'] span")[0].text+"/"+lis.select("p[class='tem'] i")[0].text
print(dtime,weather,tem)
except Exception as err:
print(err)
》》》》》》结果集:
<class 'urllib.request.Request'>
<class 'http.client.HTTPResponse'>
<class 'bytes'>
<class 'str'>
[<li class="sky skyid lv3 on">
<h1>20日(今天)</h1>
<big class="png40"></big>
<big class="png40 n01"></big>
<p class="wea" title="多云">多云</p>
<p class="tem">
<i>8℃</i>
</p>
<p class="win">
<em>
<span class="E" title="东风"></span>
</em>
<i><3级</i>
</p>
<div class="slid"></div>
</li>, <li class="sky skyid lv2">
<h1>21日(明天)</h1>
<big class="png40 d00"></big>
<big class="png40 n00"></big>
<p class="wea" title="晴">晴</p>
<p class="tem">
<span>26℃</span>/<i>9℃</i>
</p>
<p class="win">
<em>
<span class="E" title="东风"></span>
<span class="E" title="东风"></span>
</em>
<i><3级</i>
</p>
<div class="slid"></div>
</li>, <li class="sky skyid lv1">
<h1>22日(后天)</h1>
<big class="png40 d01"></big>
<big class="png40 n01"></big>
<p class="wea" title="多云">多云</p>
<p class="tem">
<span>29℃</span>/<i>11℃</i>
</p>
<p class="win">
<em>
<span class="E" title="东风"></span>
<span class="SE" title="东南风"></span>
</em>
<i>3-4级转<3级</i>
</p>
<div class="slid"></div>
</li>, <li class="sky skyid lv3">
<h1>23日(周五)</h1>
<big class="png40 d07"></big>
<big class="png40 n07"></big>
<p class="wea" title="小雨">小雨</p>
<p class="tem">
<span>23℃</span>/<i>9℃</i>
</p>
<p class="win">
<em>
<span class="E" title="东风"></span>
<span class="NE" title="东北风"></span>
</em>
<i><3级</i>
</p>
<div class="slid"></div>
</li>, <li class="sky skyid lv3">
<h1>24日(周六)</h1>
<big class="png40 d07"></big>
<big class="png40 n07"></big>
<p class="wea" title="小雨">小雨</p>
<p class="tem">
<span>14℃</span>/<i>7℃</i>
</p>
<p class="win">
<em>
<span class="NE" title="东北风"></span>
<span class="NE" title="东北风"></span>
</em>
<i><3级</i>
</p>
<div class="slid"></div>
</li>, <li class="sky skyid lv3">
<h1>25日(周日)</h1>
<big class="png40 d02"></big>
<big class="png40 n01"></big>
<p class="wea" title="阴转多云">阴转多云</p>
<p class="tem">
<span>23℃</span>/<i>8℃</i>
</p>
<p class="win">
<em>
<span class="NE" title="东北风"></span>
<span class="NE" title="东北风"></span>
</em>
<i><3级</i>
</p>
<div class="slid"></div>
</li>, <li class="sky skyid lv2">
<h1>26日(周一)</h1>
<big class="png40 d01"></big>
<big class="png40 n07"></big>
<p class="wea" title="多云转小雨">多云转小雨</p>
<p class="tem">
<span>26℃</span>/<i>8℃</i>
</p>
<p class="win">
<em>
<span class="W" title="西风"></span>
<span class="NE" title="东北风"></span>
</em>
<i><3级</i>
</p>
<div class="slid"></div>
</li>]
转载:https://blog.csdn.net/qq_43636709/article/details/115915511
查看评论