爬虫笔记20：鼠标行为链、selenium补充、设置浏览器无界面模式、爬取猫眼电影、爬取京东_飞道的博客

爬虫笔记20：鼠标行为链、selenium补充、设置浏览器无界面模式、爬取猫眼电影、爬取京东

2021-06-01 09:02 518人阅读评论(0)

一、鼠标行为链
有时候在页面中的操作可能要有很多步，那么这时候可以使用鼠标行为链类ActionChains来完成。
步骤：
（1）from selenium.webdriver import ActionChains
（2）actions = ActionChains(driver) #实例化
（3）行为过程
（4）actions.perform() #提交行为链

需要注意的：
1 、千万不要忘记提交鼠标行为链
2 、需要注意鼠标行为链里面的点击操作

行为举例：
click_and_hold(element)：点击但不松开鼠标。
context_click(element)：右键点击。
double_click(element)：双击。
更多方法请参考：http://selenium-python.readthedocs.io/api.html

from selenium import webdriver
from selenium.webdriver import ActionChains
import time


driver = webdriver.Chrome(r'C:\Users\Administrator\Desktop\chromedriver_win32\chromedriver.exe')
driver.get('https://www.baidu.com/')

# 定位输入框
inputTag = driver.find_element_by_id('kw')

# 定位百度一下的按钮
buttonTag = driver.find_element_by_id('su')

# 实例化对象
actions = ActionChains(driver)

# 往输入框里面输入内容
actions.send_keys_to_element(inputTag,'python')

time.sleep(1)

# 把鼠标移动到按钮上
actions.move_to_element(buttonTag)
actions.click()

# 提交行为链
actions.perform()

结果：

二、selenium补充
1、drvier.page_source 获取html结构的源码；
selenium提取数据的方式，是以页面最终渲染以后，以前端页面为基准的，和响应内容没有什么关系。
2、find() 在html结构中查找某个字符串是否存在（数字就证明存在，具体是什么数并没有什么规律；如果不存在会返回 -1）

这可以应用在翻页中（具体参见下文四、爬取猫眼电影）。

3、元素.get_attribute() ：获取节点（元素）的属性值
比如：要获取网站https://maoyan.com/board/4的第一张图片，如下：

步骤：

通过页面分析，我们知道上图红框中的src的属性值就是这张图片的地址。

from selenium import webdriver
import time

driver = webdriver.Chrome(r'C:\Users\Administrator\Desktop\chromedriver_win32\chromedriver.exe')
driver.get('https://maoyan.com/board/4')
time.sleep(1)

#通过copy Xpath并定位到上图的红框，也就是该img标签
img_tag = driver.find_element_by_xpath('//*[@id="app"]/div/div/div[1]/dl/dd[1]/a/img[2]')	
print(img_tag.get_attribute('src'))	#获取该img标签中src的属性值

结果：

复制该结果链接，在新网页中打开得到：

4、元素.text 获取节点（元素）的文本内容 (包括子节点和后代节点)

三、设置浏览器无界面模式（程序没有问题再回头去设置）

options = webdriver.ChromeOptions()
options.add_argument('--headless')

driver = webdriver.Chrome(r'C:\Users\01\Desktop\chromedriver.exe',options=options)

因为设置了无界面模式，所以driver.get(‘https://maoyan.com/board/4’)并没有看到有该链接的网页打开。

四、爬取猫眼电影
需求：使用selenium获取猫眼电影top100（https://maoyan.com/board/4）电影信息电影排名电影名称主演上映时间评分

第一步：明确目标url：https://maoyan.com/board/4，分析页面结构，选择合适的技术点
通过分析：数据都是在dl标签里面， dl标签里面每一个dd标签它就是一部电影。

所以我们可以通过Xpath定位到每一个dd标签，用text把文本内容打印出来看看，是什么情况。

我们先搞定一页的电影数据，再去搞定其它页的。
翻页的处理：
在第一页时，可以看到，按钮‘下一页’的定位如下：

在第10页时，下一页按钮没有了：

所以，我们可以通过find_element_by_link_text(‘下一页’)，定位文字连接（可以理解成：文字上的超链接）。或者也可以用万能的Xpath定位：driver.find_element_by_xpath(’//*[@id=“app”]/div/div/div[2]/ul/li[8]/a[text()=“视频学习”]’).click()

from selenium import webdriver


options = webdriver.ChromeOptions()
options.add_argument('--headless')

driver = webdriver.Chrome(r'C:\Users\01\Desktop\chromedriver.exe',options=options)
driver.get('https://maoyan.com/board/4')

def get_one_page():
    dd_lst = driver.find_elements_by_xpath('//*[@id="app"]/div/div/div[1]/dl/dd')
    for dd in dd_lst:
        # text属性 获取当前dd节点的子节点以及后代节点的文本内容
        # 视情况而分析 验证我们打印的内容有什么规律吗？如果有 就进一步操作
        # print(dd.text)
        # print('*'*80)
        one_film_info_lst = dd.text.split('\n')
        item = {}
        try:
            item['rank'] = one_film_info_lst[0].strip()
            item['name'] = one_film_info_lst[1].strip()
            item['actor'] = one_film_info_lst[2].strip()
            item['time'] = one_film_info_lst[3].strip()
            item['score'] = one_film_info_lst[4].strip()
        except:
            pass
        print(item)

while True:

    get_one_page()

    try:
        # 找不到下一页按钮， 就会抛出异常 ，此时就证明是最后一页了
        driver.find_element_by_link_text('下一页').click()
    except Exception as e:
        driver.quit()
        break

结果：

五、爬取京东

xpath定位搜索框：//[@id=“key”]
xpath定位搜索按钮：//[@id=“search”]/div/div[2]/button

第一步页面分析

我们发现：所有的数据都是在一个ul标签 ul标签下面每一个li标签对应的就是一本书。

我们拖动拖动条(滚轮)的时候页面又加载了数据。

所以，当我们进入这个页面的时候，把这个拖动条拖动一下，拖到最下面，然后等它加载一会儿，等页面元素加载完了之后，我们再去抓取

（每页上来先加载30个当我们进行拖动的时候它又会加载30个也就是每页其实是60个数据）

如何拖动拖动条？

我们会用到一个加载js的方法。

driver.execute_script( 'window.scrollTo(0,document.body.scrollHeight)')

execute_script：执行Javascript语句。（未来当我们遇到点击不到按钮时，还会用到它）
0 是从去起始位置开始，document.body.scrollHeight 整个窗口的高度。

from selenium import webdriver
import time

class JdSpider():
    def __init__(self):
        # 设置无界面
        self.options = webdriver.ChromeOptions()
        # 设置无界面功能  --headless 浏览器无界面  --xxxx
        self.options.add_argument('--headless')
        self.driver = webdriver.Chrome(r'C:\Users\Administrator\Desktop\chromedriver_win32\chromedriver.exe',options=self.options)
        self.driver.get('https://www.jd.com/')
        # 定位输入框和按钮
        self.driver.find_element_by_xpath('//*[@id="key"]').send_keys('爬虫书')
        time.sleep(1)
        self.driver.find_element_by_xpath('//*[@id="search"]/div/div[2]/button').click()
        time.sleep(1)

    def pares_html(self):

        # 进入这个页面的时候，把这个拖动条拖动一下，拖到最下面
        # 0 是从去起始位置开始
        # document.body.scrollHeight 整个窗口的高度
        self.driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
        time.sleep(3)

        # 提取每本书数据 千万不要忘记写li
        li_lst = self.driver.find_elements_by_xpath('//*[@id="J_goodsList"]/ul/li')
        for li in li_lst:
            print(li.text)
            print('*'*50)

    def main(self):
        self.pares_html()

if __name__ == '__main__':
    spider = JdSpider()
    spider.main()

结果：

我们从结果中可以看到，每一个li.text的内容是杂乱无章，毫无规律的。所以不同于上一个案例，这里我们需要用xpath具体定位我们需要的价格、出版社等信息。
比如：

价格： li.find_element_by_xpath(’.//div[@class=“p-price”]/strong’).text.strip()

完成第一页之后，我们就要翻页了。
当我们来到最后一页时，发现下一页按钮点击不了。

它的属性是：class=“pn-next disabled”
而正常的下一页的属性是：class=“pn-next”

所以，当我们用driver.page_source.find(‘pn-next disable’)找不到时，就是正常页，可以点击下一页按钮来翻页。而下一页按钮可以用xpath定位：//*[@id=“J_bottomPage”]/span[1]/a[9]

from selenium import webdriver
import time


class JdSpider():
    def __init__(self):
        # 设置无界面
        self.options = webdriver.ChromeOptions()
        # 设置无界面功能  --headless 浏览器无界面  --xxxx
        self.options.add_argument('--headless')
        self.driver = webdriver.Chrome(r'C:\Users\Administrator\Desktop\chromedriver_win32\chromedriver.exe',options=self.options)
        self.driver.get('https://www.jd.com/')
        # 定位输入框和按钮
        self.driver.find_element_by_xpath('//*[@id="key"]').send_keys('爬虫书')
        time.sleep(1)
        self.driver.find_element_by_xpath('//*[@id="search"]/div/div[2]/button').click()
        time.sleep(1)


    def pares_html(self):

        # 进入这个页面的时候，把这个拖动条拖动一下，拖到最下面
        # 0 是从去起始位置开始
        # document.body.scrollHeight 整个窗口的高度
        self.driver.execute_script(
            'window.scrollTo(0,document.body.scrollHeight)'
        )
        time.sleep(3)

        # 提取每本书数据 千万不要忘记写li
        li_lst = self.driver.find_elements_by_xpath('//*[@id="J_goodsList"]/ul/li')
        for li in li_lst:
            # print(li.text)
            # print('*'*50)
            try:
                item = {}
                item['price'] = li.find_element_by_xpath('.//div[@class="p-price"]/strong').text.strip()
                item['name'] = li.find_element_by_xpath('.//div[@class="p-name"]/a/em').text.strip()
                item['commit'] = li.find_element_by_xpath('.//div[@class="p-commit"]/strong').text.strip()
                item['shop'] = li.find_element_by_xpath('.//div[@class="p-shopnum"]/a').text.strip()
                print(item)
            except Exception as e:
                print(e)


    def main(self):
        while True:
            self.pares_html()
            #
            if self.driver.page_source.find('pn-next disable') == -1:
                self.driver.find_element_by_xpath('//*[@id="J_bottomPage"]/span[1]/a[9]').click()
                time.sleep(1)
            else:
                self.driver.quit()
                break

if __name__ == '__main__':
    spider = JdSpider()
    spider.main()

结果：

转载：https://blog.csdn.net/weixin_49167820/article/details/117404584

查看评论

飞道的博客

飞道的博客

个人资料

文章分类

文章存档

阅读排行

评论排行

推荐文章

爬虫笔记20：鼠标行为链、selenium补充、设置浏览器无界面模式、爬取猫眼电影、爬取京东

* 以上用户言论只代表其个人观点，不代表本网站的观点或立场