飞道的博客

BeautifulSoup详解

299人阅读  评论(0)


1.下载安装

pip install bs4

2.导入

from bs4 import BeautifulSoup as bs

3.装载HTML文档

soup = bs(doc, 'lxml')
#doc是一个HTML文档字符串,可以自动补全  lxml是指定该文档的解析方式 python自带的解析器是parser

4.将文档数转换成字符串格式

 soup.prettify()

5.BeautifulSoup查找文档元素

(1)find() 查找一个元素节点,返回第一个满足要求的节点信息
(2)find_all()

find_all(self, name=None, attrs={
   }, recursive=True, text=None, limit=None, **kwargs)

self表明它是一个类成员函数;name是要查找的标签元素名称;attrs表示元素的属性,一个字典;recursive是默认True,全范围查找该节点下面的子树;…
(3)返回的都是列表,每个元素都是一个bs4.element.Tag对象
(4)获取包含的文本值:tag.text

6.BeautifulSoup遍历文档树

tag.parent:获取tag节点的父节点
tag.children:获取tag节点的所有子节点,包括element,text等类型的子节点
tag.desendants:获取tag节点的所有子孙节点,包括element,text等类型的子节点
tag.next_sibling:tag临近的下一个兄弟节点
tag.previous_sibling:tag临近的前一个兄弟节点

7.BeautifulSoup使用css语法查找元素

(1)tag.select(css):tag是HTML文档中的一个元素节点
css一般结构:[tagName][attName][=value] 全是可选的,表示元素名称,元素属性,元素属性的值
(2)属性的语法:
[attName]选取带有指定属性的每个元素
[attName=value]选取带有指定属性和值的每个元素
[aattName^=value]:匹配属性值以value开头的每个元素
[attName$=value]:匹配属性值以value结尾的每个元素
[attName*=value]:匹配属性值包含value的每个元素
(3)遍历:
css有多个节点时,空格分开:
soup.select("div p"):查找div节点下所有子孙p节点的信息
soup.select("div > p"):查找div节点下所有直接子节点p的信息
soup.select("div ~ p"):查找div后面所有同级别兄弟节点p的信息
soup.select("div + p"):查找前一个节点后面所有同级别兄弟节点的信息

9.字符编码问题

    import urllib.request
    from bs4 import BeautifulSoup as bs
    from bs4 import UnicodeDammit
    data = urllib.request.urlopen(url)
    data=data.read()
    dammit = UnicodeDammit(data,['gbk','utf-8'])
    data = dammit.unicode_markup
    soup = bs(data,'lxml')
    tags = soup.select("div[class='属性值'] span.....")
    for tag in tags:
        print(tag)

8.实例:爬取中国天气网数据兰州7天的


import urllib.request
from bs4 import BeautifulSoup
'''
字符编码转换
    data=data.read()
    dammit = UnicodeDammit(data,['gbk','utf-8'])
    data = dammit.unicode_markup
'''

url = "http://www.weather.com.cn/weather/101160101.shtml"
headers = {
   
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Mobile Safari/537.36'
}
req = urllib.request.Request(url=url,headers=headers)
print(type(req))
data = urllib.request.urlopen(req)
print(type(data))
data = data.read()
print(type(data))
data = data.decode()
print(type(data))
soup = BeautifulSoup(data,'lxml')
lis = soup.select("ul[class='t clearfix'] li")
print(lis)
for li in lis:
        try:
            dtime = lis.select("h1")[0].text
            weather = lis.select("p[class='wea']")[0].text
            tem = lis.select("p[class='tem'] span")[0].text+"/"+lis.select("p[class='tem'] i")[0].text
            print(dtime,weather,tem)
        except Exception as err:
            print(err)

》》》》》》结果集:
<class 'urllib.request.Request'>
<class 'http.client.HTTPResponse'>
<class 'bytes'>
<class 'str'>
[<li class="sky skyid lv3 on">
<h1>20日(今天)</h1>
<big class="png40"></big>
<big class="png40 n01"></big>
<p class="wea" title="多云">多云</p>
<p class="tem">
<i>8</i>
</p>
<p class="win">
<em>
<span class="E" title="东风"></span>
</em>
<i>&lt;3</i>
</p>
<div class="slid"></div>
</li>, <li class="sky skyid lv2">
<h1>21日(明天)</h1>
<big class="png40 d00"></big>
<big class="png40 n00"></big>
<p class="wea" title="晴"></p>
<p class="tem">
<span>26</span>/<i>9</i>
</p>
<p class="win">
<em>
<span class="E" title="东风"></span>
<span class="E" title="东风"></span>
</em>
<i>&lt;3</i>
</p>
<div class="slid"></div>
</li>, <li class="sky skyid lv1">
<h1>22日(后天)</h1>
<big class="png40 d01"></big>
<big class="png40 n01"></big>
<p class="wea" title="多云">多云</p>
<p class="tem">
<span>29</span>/<i>11</i>
</p>
<p class="win">
<em>
<span class="E" title="东风"></span>
<span class="SE" title="东南风"></span>
</em>
<i>3-4级转&lt;3</i>
</p>
<div class="slid"></div>
</li>, <li class="sky skyid lv3">
<h1>23日(周五)</h1>
<big class="png40 d07"></big>
<big class="png40 n07"></big>
<p class="wea" title="小雨">小雨</p>
<p class="tem">
<span>23</span>/<i>9</i>
</p>
<p class="win">
<em>
<span class="E" title="东风"></span>
<span class="NE" title="东北风"></span>
</em>
<i>&lt;3</i>
</p>
<div class="slid"></div>
</li>, <li class="sky skyid lv3">
<h1>24日(周六)</h1>
<big class="png40 d07"></big>
<big class="png40 n07"></big>
<p class="wea" title="小雨">小雨</p>
<p class="tem">
<span>14</span>/<i>7</i>
</p>
<p class="win">
<em>
<span class="NE" title="东北风"></span>
<span class="NE" title="东北风"></span>
</em>
<i>&lt;3</i>
</p>
<div class="slid"></div>
</li>, <li class="sky skyid lv3">
<h1>25日(周日)</h1>
<big class="png40 d02"></big>
<big class="png40 n01"></big>
<p class="wea" title="阴转多云">阴转多云</p>
<p class="tem">
<span>23</span>/<i>8</i>
</p>
<p class="win">
<em>
<span class="NE" title="东北风"></span>
<span class="NE" title="东北风"></span>
</em>
<i>&lt;3</i>
</p>
<div class="slid"></div>
</li>, <li class="sky skyid lv2">
<h1>26日(周一)</h1>
<big class="png40 d01"></big>
<big class="png40 n07"></big>
<p class="wea" title="多云转小雨">多云转小雨</p>
<p class="tem">
<span>26</span>/<i>8</i>
</p>
<p class="win">
<em>
<span class="W" title="西风"></span>
<span class="NE" title="东北风"></span>
</em>
<i>&lt;3</i>
</p>
<div class="slid"></div>
</li>]



转载:https://blog.csdn.net/qq_43636709/article/details/115915511
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场