【Python爬虫】—— 多线程基本原理_小言_互联网的博客

【Python爬虫】—— 多线程基本原理

2020-06-07 14:10 383人阅读评论(0)

多线程的含义

进程可以理解为是一个可以独立运行的程序单位。

比如：

打开一个浏览器，就开启了一个浏览器进程。
打开一个文本编辑器，就开启了一个文本编辑器进程。

一个进程中可以同时处理很多事情。

比如：

浏览器中可以在多个选项卡中打开多个页面，有的页面在播放音乐，有的页面在播放视频，有的网页在播放动画，可以同时运行，互不干扰。

为什么能同时做到同时运行这么多的任务呢？

任务对应着线程的执行。

进程是线程的集合，是由一个或多个线程构成的。
线程是操作系统进行运算调度的最小单位，是进程中的一个最小运行单元。

并发和并行

并发（concurrency）

指同一时刻只能有一条指令执行，但多个线程的对应的指令被快速轮换地执行，宏观上看起来多个线程在同时运行，但微观上只是这个处理器在连续不断地、在多个线程之间切换和执行。

在单处理器和多处理器系统中都可以存在，仅靠一个核，就可以实现并发。

并行（parallel）

指同一时刻有多条指令在多个处理器上同时执行，并行必须要依赖于多个处理器，不论宏观上还是微观上，多个线程都是在同一时刻一起执行的。

只能在多处理器系统中存在，如果计算机处理器只有一个核，就不可能实现并行。

多线程适用场景

在一个程序进程中，有些操作是比较耗时或者需要等待的。

比如：

等待数据库的查询结果的返回
等待网页结果的响应

使用单线程：
处理器必须要等到这些操作完成之后才能继续往下执行其他操作，而这个线程在等待的过程中，处理器明显是可以来执行其他操作的。

使用多线程：
处理器就可以在某个线程等待时，去执行其他的线程，从而从整体上提高执行效率。

网络爬虫就是一个非常典型的例子
爬虫在向服务器发起请求之后，有一段时间必须要等待服务器的响应返回，这种任务就属于 IO 密集型任务。

但不是所有的任务都是 IO 密集型任务
有一种任务叫作计算密集型任务，也可以称之为 CPU 密集型任务，就是任务的运行一直需要处理器的参与。

这时如果开启多线程，一个处理器从一个计算密集型任务切换到切换到另一个计算密集型任务上，处理器依然不会停下来，始终会忙于计算。

如果任务不全是计算密集型任务，可以使用多线程来提高程序整体的执行效率，尤其对于网络爬虫这种 IO 密集型任务来说，使用多线程会大大提高程序整体的爬取效率。

Python 实现多线程

在 Python 中，实现多线程的模块叫作 threading，是 Python 自带的模块。

使用 threading 实现多线程的方法：

Thread 直接创建子线程
首先可以使用 Thread 类来创建一个线程，创建时需要指定 target 参数为运行的方法名称，如果被调用的方法需要传入额外的参数，则可以通过 Thread 的 args 参数来指定。

import threading
import time


def target(second):
    print(f'Threading {threading.current_thread().name} is running')
    print(f'Threading {threading.current_thread().name} sleep {second}s')
    time.sleep(second)
    print(f'Threading {threading.current_thread().name} is ended')


print(f'Threading {threading.current_thread().name} is running')

for i in [1, 5]:
    thread = threading.Thread(target=target, args=[i])
    thread.start()
    
print(f'Threading {threading.current_thread().name} is ended')

运行结果：

Threading MainThread is running
Threading Thread-1 is running
Threading Thread-1 sleep 1s
Threading Thread-2 is running
Threading Thread-2 sleep 5s
Threading MainThread is ended
Threading Thread-1 is ended
Threading Thread-2 is ended

如果想要主线程等待子线程运行完毕之后才退出，可以让每个子线程对象都调用下 join 方法：

threads = []

for i in [1, 5]:
    thread = threading.Thread(target=target, args=[i])
    threads.append(thread)
    thread.start()
    
for thread in threads:
    thread.join()

运行结果：

Threading MainThread is running
Threading Thread-1 is running
Threading Thread-1 sleep 1s
Threading Thread-2 is running
Threading Thread-2 sleep 5s
Threading Thread-1 is ended
Threading Thread-2 is ended
Threading MainThread is ended

继承 Thread 类创建子线程
另外也可以通过继承 Thread 类的方式创建一个线程，该线程需要执行的方法写在类的 run 方法里面即可。上面的例子的等价改写为：

import threading
import time


class MyThread(threading.Thread):
    def __init__(self, second):
        threading.Thread.__init__(self)
        self.second = second
    
    def run(self):
        print(f'Threading {threading.current_thread().name} is running')
        print(f'Threading {threading.current_thread().name} sleep {self.second}s')
        time.sleep(self.second)
        print(f'Threading {threading.current_thread().name} is ended')


print(f'Threading {threading.current_thread().name} is running')

threads = []

for i in [1, 5]:
    thread = MyThread(i)
    threads.append(thread)
    thread.start()
    
for thread in threads:
    thread.join()
    
print(f'Threading {threading.current_thread().name} is ended')

运行结果：

Threading MainThread is running
Threading Thread-1 is running 
Threading Thread-1 sleep 1s 
Threading Thread-2 is running 
Threading Thread-2 sleep 5s 
Threading Thread-1 is ended 
Threading Thread-2 is ended 
Threading MainThread is ended

守护线程

在线程中有一个叫作守护线程的概念，如果一个线程被设置为守护线程，那么意味着这个线程是“不重要”的，这意味着，如果主线程结束了而该守护线程还没有运行完，那么它将会被强制结束。

在 Python 中我们可以通过 setDaemon 方法来将某个线程设置为守护线程：

import threading
import time


def target(second):
    print(f'Threading {threading.current_thread().name} is running')
    print(f'Threading {threading.current_thread().name} sleep {second}s')
    time.sleep(second)
    print(f'Threading {threading.current_thread().name} is ended')


print(f'Threading {threading.current_thread().name} is running')
t1 = threading.Thread(target=target, args=[2])
t1.start()
t2 = threading.Thread(target=target, args=[5])
t2.setDaemon(True)
t2.start()
print(f'Threading {threading.current_thread().name} is ended')

运行结果：

Threading MainThread is running 
Threading Thread-1 is running 
Threading Thread-1 sleep 2s 
Threading Thread-2 is running 
Threading Thread-2 sleep 5s 
Threading MainThread is ended 
Threading Thread-1 is ended

这里并没有调用 join 方法，如果让 t1 和 t2 都调用 join 方法，主线程就会仍然等待各个子线程执行完毕再退出，不论其是否是守护线程。

互斥锁

在一个进程中的多个线程是共享资源的，比如在一个进程中，有一个全局变量 count 用来计数，现在声明多个线程，每个线程运行时都给 count 加 1，代码实现如下：

import threading
import time


count = 0

class MyThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        global count
        temp = count + 1
        time.sleep(0.001)
        count = temp

threads = []

for _ in range(1000):
    thread = MyThread()
    thread.start()
    threads.append(thread)

for thread in threads:
    thread.join()
    
print(f'Final count: {count}')

运行结果：

Final count: 69

由于 count 这个值是共享的，每个线程都可以在执行 temp = count 这行代码时拿到当前 count 的值，但是这些线程中的一些线程可能是并发或者并行执行的，这就导致不同的线程拿到的可能是同一个 count 值，最后导致有些线程的 count 的加 1 操作并没有生效，导致最后的结果偏小。

所以，如果多个线程同时对某个数据进行读取或修改，就会出现不可预料的结果。为了避免这种情况，我们需要对多个线程进行同步，要实现同步，我们可以对需要操作的数据进行加锁保护，这里就需要用到 threading.Lock 了。

加锁保护

某个线程在对数据进行操作前，需要先加锁，这样其他的线程发现被加锁了之后，就无法继续向下执行，会一直等待锁被释放，只有加锁的线程把锁释放了，其他的线程才能继续加锁并对数据做修改，修改完了再释放锁。

这样可以确保同一时间只有一个线程操作数据，多个线程不会再同时读取和修改同一个数据。

Python多线程的问题

GIL 全称为 Global Interpreter Lock，译为全局解释器锁。

在 Python 多线程下，每个线程的执行方式如下：

获取 GIL
执行对应线程的代码
释放 GIL

可见，某个线程想要执行，必须先拿到 GIL，可以把 GIL 看作是通行证，并且在一个 Python 进程中，GIL 只有一个。拿不到通行证的线程，就不允许执行。这样就会导致，即使是多核条件下，一个 Python 进程下的多个线程，同一时刻也只能执行一个线程。

对于爬虫这种 IO 密集型任务来说，这个问题影响并不大；而对于计算密集型任务来说，由于 GIL 的存在，多线程总体的运行效率相比可能反而比单线程更低。

Reference：https://kaiwu.lagou.com/course/courseInfo.htm?courseId=46#/detail/pc?id=1666

转载：https://blog.csdn.net/weixin_45961774/article/details/106530146

查看评论

小言_互联网的博客

小言_互联网的博客

个人资料

文章分类

文章存档

阅读排行

评论排行

推荐文章