爬虫项目实战六：爬取腾讯视频_小言_互联网的博客

爬虫项目实战六：爬取腾讯视频

2020-09-07 00:21 1353人阅读评论(0)

爬取腾讯视频

目标
项目准备
网站分析
反爬分析
代码实现
效果显示

目标

爬取腾讯视频，获取电视剧或电影链接，调用解析接口以达到观看VIP视频的效果。

项目准备

软件：Pycharm
第三方库：requests,fake_useragent,selenium,lxml
网站地址：https://v.qq.com/

网站分析

打开网站。
输入庆余年

https://v.qq.com/x/search/?q=%E5%BA%86%E4%BD%99%E5%B9%B4&stag=102&smartbox_ab=
https://v.qq.com/x/search/q=%E6%B2%89%E7%9D%A1%E9%AD%94%E5%92%922&stag=102&smartbox_ab=

会发现只有q=之后发生变化。变化的内容就是输入的剧名。

反爬分析

同一个ip地址去多次访问会面临被封掉的风险，这里采用fake_useragent，产生随机的User-Agent请求头进行访问。

代码实现

1.导入相对应的第三方库，定义一个class类继承object，定义init方法继承self，主函数main继承self。

import requests
from lxml import etree
from selenium import webdriver
from fake_useragent import UserAgent
class tencent_movie(object):
    def __init__(self):
        ua = UserAgent(verify_ssl=False)
        for i in range(1, 100):
            self.headers = {
                'User-Agent': ua.random
            }
    def main(self):
        pass
if __name__ == '__main__':
    spider=tencent_movie()
    spider.main()

2.发送请求,获取网页。

    def get_html(self,url):
        response=requests.get(url,headers=self.headers)
        html=response.content.decode('utf-8')
        return html

3.解析网页获取链接地址。

    def parse_html(self,html):
        target=etree.HTML(html)
        #初步获取链接
        links = target.xpath('//h2[@class="result_title"]/a/@href')
        host=links[0]
        res = requests.get(host, headers=self.headers)
        con = res.content.decode('utf-8')
        new_html = etree.HTML(con)
        first_select = int(input('1.电视剧\n2.电影\n'))
        if (first_select == 1):
       		#获取电视剧每集的标题
            titles=new_html.xpath('//div[@class="mod_episode"]/span/a/span/text()')
            #获取每集的链接
            new_links=new_html.xpath('//div[@class="mod_episode"]/span/a/@href')
            for title in titles:
                print('第%s集'%title)
            select = int(input('你要看第几集：(输入数字即可)'))
            new_link = new_links[select - 1]
            last_host = 'https://api.akmov.net/?url=' + new_link
        else:
        	#电影是直接可以得到链接地址
            last_host = 'https://api.akmov.net/?url=' + host#解析网址和链接拼接

4.自动化控制浏览器打开网址。

self.driver = webdriver.Chrome(executable_path=r'C:\Users\acer\AppData\Local\Google\Chrome\Application\chromedriver.exe')
        self.driver.maximize_window()#全屏显示
        self.driver.get(last_host)

5.主函数及函数调用。

    def main(self):
        name = str(input('请输入电视剧或电影名：'))
        url = 'https://v.qq.com/x/search/?q={}&stag=0&smartbox_ab='.format(name)
        html = self.get_html(url)
        self.parse_html(html)

效果显示

这时浏览器会自动打开该网址。

可以观看了。
电视剧效果演示

也会控制浏览器自动打开。
完整代码如下：

import requests
from lxml import etree
from selenium import webdriver
from fake_useragent import UserAgent
class tencent_movie(object):
    def __init__(self):
        ua = UserAgent(verify_ssl=False)
        for i in range(1, 100):
            self.headers = {
                'User-Agent': ua.random
            }
    def get_html(self,url):
        response=requests.get(url,headers=self.headers)
        html=response.content.decode('utf-8')
        return html
    def parse_html_tengxun(self,html):
        target=etree.HTML(html)
        links = target.xpath('//h2[@class="result_title"]/a/@href')
        host=links[0]
        res = requests.get(host, headers=self.headers)
        con = res.content.decode('utf-8')
        new_html = etree.HTML(con)
        first_select = int(input('1.电视剧\n2.电影\n'))
        if (first_select == 1):
            titles=new_html.xpath('//div[@class="mod_episode"]/span/a/span/text()')
            new_links=new_html.xpath('//div[@class="mod_episode"]/span/a/@href')
            for title in titles:
                print('第%s集'%title)
            select = int(input('你要看第几集：(输入数字即可)'))
            new_link = new_links[select - 1]
            last_host = 'https://api.akmov.net/?url=' + new_link
        else:
            last_host = 'https://api.akmov.net/?url=' + host
        self.driver = webdriver.Chrome(executable_path=r'C:\Users\acer\AppData\Local\Google\Chrome\Application\chromedriver.exe')
        self.driver.maximize_window()
        self.driver.get(last_host)
    def main(self):
        name = str(input('请输入电视剧或电影名：'))
        url = 'https://v.qq.com/x/search/?q={}&stag=0&smartbox_ab='.format(name)
        html = self.get_html(url)
        self.parse_html_tengxun(html)
if __name__ == '__main__':
    spider=tencent_movie()
    spider.main()

声明：仅做自己学习参考使用。

转载：https://blog.csdn.net/qq_44862120/article/details/107531719

查看评论

小言_互联网的博客

小言_互联网的博客

个人资料

文章分类

文章存档

阅读排行

评论排行

推荐文章