飞道的博客

【Python】【爬虫】爬取小说5000章,遇到的爬虫问题与解决思路

352人阅读  评论(0)

爬虫问题分析

回顾

之前写了一个爬取小说网站的多线程爬虫,操作流程如下:

先爬取小说介绍页,获取所有章节信息(章节名称,章节对应阅读链接),然后使用多线程的方式(pool = Pool(50)),通过章节的阅读链接爬取章节正文并保存为本地markdown文件。(代码见文末 run01.python)

爬取100章,用了10秒

限制爬取101章,从运行程序到结束程序,用时9秒

Redis+MongoDB,无多线程

最近学了Redis和MongoDB,要求爬取后将章节链接放在redis,然后通过读取redis的章节链接来进行爬取。(代码见文末run02.python)

…不用测试了,一章一章读真的太慢了!

爬取101章用时两分钟!

Redis+MongoDB+多线程

爬取101章,只需8秒!

爬取4012章,用时1分10秒!

问题与解析

懒得打字,我就录成视频发在小破站上面了。(小破站搜:萌狼蓝天)

[爬狼]Python爬虫经验分享第1节:代码文件简单介绍

[爬狼]Python爬虫经验分享第2节:编码问题的处理

[爬狼]Python爬虫经验分享第3节:多线程爬小说的顺序问题解决方案分享

[爬狼]Python爬虫经验分享第4节:爬取过于频繁被拦截的解决方案

其他的去我小破站主页翻

代码20221020

run01.py


   
  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/9/28
  7. # https://www.lingdianksw8.com/31/31596/
  8. import datetime
  9. import re
  10. import random
  11. from multiprocessing import Pool
  12. import requests
  13. import bs4
  14. import os
  15. os.environ[ 'NO_PROXY'] = "www.lingdianksw8.com"
  16. def Log_text( lx="info", *text):
  17. lx.upper()
  18. with open( "log.log", "a+", encoding= "utf-8") as f:
  19. f.write( "\n[" + str(datetime.datetime.now()) + "]" + "[" + lx + "]")
  20. for i in text:
  21. f.write(i)
  22. f.close()
  23. # 调试输出
  24. def log( message, i="info"):
  25. if type(message) == type( ""):
  26. i.upper()
  27. print( "[", i, "] [", str( type(message)), "]", message)
  28. elif type(message) == type([]):
  29. count = 0
  30. for j in message:
  31. print( "[", i, "] [", str(count), "] [", str( type(message)), "]", j)
  32. count += 1
  33. else:
  34. print( "[", i, "] [", str( type(message)), "]", end= " ")
  35. print(message)
  36. # 获取源码
  37. def getCode( url, methods="post"):
  38. """
  39. 获取页面源码
  40. :param methods: 请求提交方式
  41. :param url:书籍首页链接
  42. :return:页面源码
  43. """
  44. # 设置请求头
  45. user_agent = [
  46. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  47. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
  48. "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  49. "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
  50. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
  51. "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
  52. "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
  53. "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
  54. "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
  55. "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
  56. "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
  57. "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
  58. "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
  59. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
  60. "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
  61. "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
  62. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
  63. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
  64. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
  65. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
  66. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
  67. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
  68. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
  69. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
  70. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
  71. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
  72. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
  73. "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
  74. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
  75. "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
  76. "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
  77. "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
  78. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
  79. "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"
  80. ]
  81. headers = {
  82. 'User-Agent': random.choice(user_agent),
  83. # "user-agent": user_agent[random.randint(0, len(user_agent) - 1)]
  84. }
  85. # 获取页面源码
  86. result = requests.request(methods, url, headers=headers, allow_redirects= True)
  87. log( "cookie" + str(result.cookies.values()))
  88. tag = 0
  89. log( "初始页面编码为:" + result.encoding)
  90. if result.encoding != "gbk":
  91. log( "初始页面编码非gbk,需要进行重编码操作", "warn")
  92. tag = 1
  93. try:
  94. result = requests.request(methods, url, headers=headers, allow_redirects= True, cookies=result.cookies)
  95. except:
  96. return "InternetError"
  97. result_text = result.text
  98. # print(result_text)
  99. if tag == 1:
  100. result_text = recoding(result)
  101. log( "转码编码完成,当前编码为gbk")
  102. return result_text
  103. def recoding( result):
  104. try:
  105. result_text = result.content.decode( "gbk",errors= 'ignore')
  106. except:
  107. # UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 6917:
  108. try:
  109. result_text = result.content.decode( "").encode( "unicode_escape").decode( "gbk",errors= 'ignore')
  110. except:
  111. try:
  112. result_text = result.content.decode( "gb18030",errors= 'ignore')
  113. except:
  114. result_text = result.text
  115. return result_text
  116. # 分析数据
  117. def getDict( code):
  118. """
  119. 分析网页源码,获取数据,并存储为以字典元素构成的列表返回
  120. :param code:网页源码
  121. :return:List
  122. """
  123. # 通过正则的方式缩小范围
  124. code = re.findall( "正文卷</dt>(.*?)</dl>", code, re.S)[ 0]
  125. # log(code)
  126. # obj = bs4.BeautifulSoup(markup=code,features="html.parser")
  127. obj = bs4.BeautifulSoup(markup=code, features= "lxml")
  128. # log("输出结果")
  129. # log(obj.find_all("a"))
  130. # 通过上面调试输出可知得到的是个列表
  131. tag = obj.find_all( "a")
  132. log( "tag长度为:" + str( len(tag)))
  133. result = []
  134. count = 0
  135. for i in range( len(tag)):
  136. count += 1
  137. link = tag[i][ "href"]
  138. text = tag[i].get_text()
  139. result.append({ "title": text, "link": "https://www.lingdianksw8.com" + link})
  140. return result
  141. # 文章内容
  142. def getContent( url):
  143. code = getCode(url, "get")
  144. if code== "InternetError":
  145. return "InternetError", ""
  146. try:
  147. code = code.replace( "<br />", "\n")
  148. code = code.replace( "&nbsp;", " ")
  149. code = code.replace( " ", " ")
  150. except Exception as e:
  151. # AttributeError: 'tuple' object has no attribute 'replace'
  152. Log_text( "error", "[run01-161~163]"+ str(e))
  153. # with open("temp.txt","w+",encoding="utf-8") as f:
  154. # f.write(code)
  155. obj = bs4.BeautifulSoup(markup=code, features= "lxml")
  156. titile = obj.find_all( "h1")[ 0].text
  157. try:
  158. content = obj.find_all( "div", attrs={ "class": "showtxt"})[ 0].text
  159. except:
  160. return None, None
  161. # with open("temp.txt", "w+", encoding="utf-8") as f:
  162. # f.write(content)
  163. # log(content)
  164. try:
  165. g = re.findall(
  166. "(:.*?https://www.lingdianksw8.com.*?天才一秒记住本站地址:www.lingdianksw8.com。零点看书手机版阅读网址:.*?.com)",
  167. content, re.S)[ 0]
  168. log(g)
  169. content = content.replace(g, "")
  170. except:
  171. Log_text( "error", "清除广告失败!章节" + titile + "(" + url + ")")
  172. log(content)
  173. return titile, content
  174. def docToMd( name, title, content):
  175. with open(name + ".md", "w+", encoding= "utf-8") as f:
  176. f.write( "## " + title + "/n" + content)
  177. f.close()
  178. return 0
  179. # 多线程专供函数 - 通过链接获取文章
  180. def thead_getContent( link):
  181. # 根据链接获取文章内容
  182. Log_text( "info", "尝试获取" + str(link))
  183. title, content = getContent( str(link)) # 从文章内获取到标题和内容
  184. Log_text( "success", "获取章节" + title + "完成")
  185. docToMd(title, title, content)
  186. Log_text( "success", "写出章节" + title + "完成")
  187. # 操作汇总
  188. def run( url):
  189. with open( "log1.log", "w+", encoding= "utf-8") as f:
  190. f.write( "")
  191. f.close()
  192. Log_text( "info", "开始获取小说首页...")
  193. code = getCode(url)
  194. Log_text( "success", "获取小说首页源代码完成,开始分析...")
  195. index = getDict(code) # 获取到[{章节名称title:链接link}]
  196. links = []
  197. # lineCount限制要爬取的数量
  198. lineCount = 0
  199. for i in index:
  200. if lineCount > 10:
  201. break
  202. lineCount += 1
  203. links.append(i[ "link"])
  204. print( "链接状态")
  205. print( type(links))
  206. print(links)
  207. Log_text( "success", "分析小说首页完成,数据整理完毕,开始获取小说内容...")
  208. pool = Pool( 50) # 多线程
  209. pool. map(thead_getContent, links)
  210. if __name__ == '__main__':
  211. start = datetime.datetime.today()
  212. Log_text( "===【日志】[多线程-]开始新的测试 =|=|=|= " + str(start))
  213. run( r"https://www.lingdianksw8.com/31/31596")
  214. # getContent("http://www.lingdianksw8.com/31/31596/8403973.html")
  215. end = datetime.datetime.today()
  216. Log_text( "===【日志】[多线程]测试结束 =|=|=|= " + str(end))
  217. Log_text( "===【日志】[多线程]测试结束 =|=|=|= 用时" + str(end - start))
  218. print( "")

run02.py


   
  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/9/28
  7. # https://www.lingdianksw8.com/31/31596/
  8. """
  9. 1.通过run01获取章节的链接,将链接存储到Redis中
  10. 2.从Redis获取章节链接并爬取
  11. """
  12. import re
  13. import pymongo
  14. from lxml import html
  15. import run01 as xrilang
  16. import redis
  17. import datetime
  18. client = redis.StrictRedis()
  19. def getLinks():
  20. xrilang.Log_text( "===【日志】开始获取章节名称和链接")
  21. code = xrilang.getCode( "https://www.lingdianksw8.com/61153/61153348/", "get")
  22. source = re.findall( "正文卷</dt>(.*?)</dl>", code, re.S)[ 0]
  23. selector = html.fromstring(source)
  24. title_list = selector.xpath( "//dd/a/text()")
  25. url_list = selector.xpath( "//dd/a/@href")
  26. client.flushall() # 清空Redis全部内容,避免重复运行造成的数据重复
  27. xrilang.Log_text( "===【日志】开始获取标题")
  28. for title in title_list:
  29. xrilang.log(title)
  30. client.lpush( 'title_queue', title)
  31. xrilang.Log_text( "===【日志】开始获取章节链接")
  32. for url in url_list:
  33. xrilang.log(url)
  34. client.lpush( 'url_queue', url)
  35. xrilang.log(client.llen( 'url_queue'))
  36. xrilang.Log_text( "===【日志】获取章节链接结束,共"+ str(client.llen( 'url_queue'))+ "条")
  37. def getContent():
  38. xrilang.Log_text( "===【日志】开始获取章节内容")
  39. database = pymongo.MongoClient()[ 'book']
  40. collection = database[ 'myWifeSoBeautifull']
  41. startTime=datetime.datetime.today()
  42. xrilang.log( "开始"+ str(startTime))
  43. linkCount= 0
  44. datas=[]
  45. while client.llen( "url_queue")> 0:
  46. # 爬取101章
  47. if linkCount > 10:
  48. break
  49. linkCount += 1
  50. url = client.lpop( "url_queue").decode()
  51. title = client.lpop( "title_queue").decode()
  52. xrilang.log(url)
  53. # 获取文章内容并保存到数据库
  54. content_url = "https://www.lingdianksw8.com"+url
  55. name,content = xrilang.getContent(content_url)
  56. if name!= None and content!= None:
  57. datas.append({ "title":title, "name":name, "content":content})
  58. collection.insert_many(datas)
  59. if __name__ == '__main__':
  60. start = datetime.datetime.today()
  61. xrilang.Log_text( "===【日志】[redis+MongoDB无多线程]开始新的测试 =|=|=|= " + str(start))
  62. getLinks()
  63. getContent()
  64. end = datetime.datetime.today()
  65. xrilang.Log_text( "===【日志】[redis+MongoDB无多线程]测试结束 =|=|=|= " + str(end))
  66. xrilang.Log_text( "===【日志】[redis+MongoDB无多线程]测试结束 =|=|=|= 用时" + str(end-start))
  67. print( "")

run03.py


   
  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/9/28
  7. # https://www.lingdianksw8.com/31/31596/
  8. """
  9. 1.通过run01获取章节的链接,将链接存储到Redis中
  10. 2.从Redis获取章节链接并爬取
  11. """
  12. import re
  13. import time
  14. from multiprocessing.dummy import Pool
  15. import pymongo
  16. from lxml import html
  17. import run01 as xrilang
  18. import redis
  19. import datetime
  20. client = redis.StrictRedis()
  21. database = pymongo.MongoClient()[ 'book']
  22. collection = database[ 'myWifeSoBeautifull']
  23. def getLinks():
  24. xrilang.Log_text( "===【日志】开始获取章节名称和链接")
  25. code = xrilang.getCode( "https://www.lingdianksw8.com/61153/61153348/", "get")
  26. source = re.findall( "正文卷</dt>(.*?)</dl>", code, re.S)[ 0]
  27. selector = html.fromstring(source)
  28. url_list = selector.xpath( "//dd/a/@href")
  29. client.flushall() # 清空Redis全部内容,避免重复运行造成的数据重复
  30. xrilang.Log_text( "===【日志】开始获取章节链接")
  31. i= 0
  32. for url in url_list:
  33. xrilang.log(url)
  34. client.lpush( 'url_queue', url)
  35. i+= 1
  36. client.lpush( 'sort_queue', i) # 解决多线程爬虫导致的顺序问题
  37. xrilang.log(client.llen( 'url_queue'))
  38. xrilang.Log_text( "===【日志】获取章节链接结束,共"+ str(client.llen( 'url_queue'))+ "条")
  39. def getContent( durl):
  40. url = durl[ "url"]
  41. isort=durl[ "isort"]
  42. content_url = "https://www.lingdianksw8.com" + url
  43. title, content = xrilang.getContent(content_url)
  44. if title != "InternetError":
  45. if title != None and content != None:
  46. xrilang.log( "获取"+title+ "成功")
  47. collection.insert_one({ "isort":isort, "title": title, "content": content})
  48. else:
  49. # 没有成功爬取的添加回redis,等待下次爬取
  50. client.lpush( 'url_queue', url)
  51. client.lpush( 'sort_queue', isort) # 解决多线程爬虫导致的顺序问题
  52. # 等待5秒
  53. time.sleep( 1000)
  54. else:
  55. # 没有成功爬取的添加回redis,等待下次爬取
  56. client.lpush( 'url_queue', url)
  57. client.lpush( 'sort_queue', isort) # 解决多线程爬虫导致的顺序问题
  58. # 等待5秒
  59. time.sleep( 5000)
  60. def StartGetContent():
  61. xrilang.Log_text( "===【日志】开始获取章节内容")
  62. startTime = datetime.datetime.today()
  63. xrilang.log( "开始"+ str(startTime))
  64. urls=[]
  65. # xrilang.log(client.llen("url_queue"))
  66. while client.llen( "url_queue")> 0:
  67. url = client.lpop( "url_queue").decode()
  68. isort= client.lpop( "sort_queue").decode()
  69. #urls.append(url)
  70. urls.append({ "url":url, "isort":isort})
  71. # xrilang.log(urls)
  72. pool = Pool( 500) # 多线程
  73. pool. map(getContent,urls)
  74. endTime=datetime.datetime.today()
  75. xrilang.log( "【结束】"+ str(endTime))
  76. xrilang.Log_text( "===【日志】开始获取章节结束,用时"+ str(endTime-startTime))
  77. if __name__ == '__main__':
  78. start = datetime.datetime.today()
  79. xrilang.Log_text( "===【日志】[redis+MongoDB+多线程]开始新的测试 =|=|=|= " + str(start))
  80. getLinks()
  81. StartGetContent()
  82. end = datetime.datetime.today()
  83. xrilang.Log_text( "===【日志】[redis+MongoDB+多线程]测试结束 =|=|=|= " + str(end))
  84. xrilang.Log_text( "===【日志】[redis+MongoDB+多线程]测试结束 =|=|=|= 用时" + str(end-start))
  85. print( "")

mongoQ.py


   
  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/10/20
  7. import pymongo
  8. database = pymongo.MongoClient()[ 'book']
  9. collection = database[ 'myWifeSoBeautifull']
  10. result = collection.find().collation({ "locale": "zh", "numericOrdering": True}).sort( "isort")
  11. with open( "list.txt", "a+", encoding= "utf-8") as f:
  12. for i in result:
  13. f.writelines(i[ "isort"]+ " "+i[ "title"]+ "\n")

代码20221019

run01.py


   
  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/9/28
  7. # https://www.lingdianksw8.com/31/31596/
  8. import datetime
  9. import re
  10. import random
  11. from multiprocessing import Pool
  12. import requests
  13. import bs4
  14. import os
  15. os.environ[ 'NO_PROXY'] = "www.lingdianksw8.com"
  16. def Log_text( lx="info", *text):
  17. lx.upper()
  18. with open( "log.log", "a+", encoding= "utf-8") as f:
  19. f.write( "\n[" + str(datetime.datetime.now()) + "]" + "[" + lx + "]")
  20. for i in text:
  21. f.write(i)
  22. f.close()
  23. # 调试输出
  24. def log( message, i="info"):
  25. if type(message) == type( ""):
  26. i.upper()
  27. print( "[", i, "] [", str( type(message)), "]", message)
  28. elif type(message) == type([]):
  29. count = 0
  30. for j in message:
  31. print( "[", i, "] [", str(count), "] [", str( type(message)), "]", j)
  32. count += 1
  33. else:
  34. print( "[", i, "] [", str( type(message)), "]", end= " ")
  35. print(message)
  36. # 获取源码
  37. def getCode( url, methods="post"):
  38. """
  39. 获取页面源码
  40. :param methods: 请求提交方式
  41. :param url:书籍首页链接
  42. :return:页面源码
  43. """
  44. # 设置请求头
  45. user_agent = [
  46. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  47. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
  48. "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  49. "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
  50. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
  51. "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
  52. "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
  53. "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
  54. "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
  55. "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
  56. "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
  57. "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
  58. "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
  59. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
  60. "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
  61. "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
  62. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
  63. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
  64. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
  65. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
  66. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
  67. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
  68. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
  69. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
  70. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
  71. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
  72. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
  73. "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
  74. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
  75. "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
  76. "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
  77. "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
  78. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
  79. "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"
  80. ]
  81. headers = {
  82. 'User-Agent': random.choice(user_agent),
  83. # "user-agent": user_agent[random.randint(0, len(user_agent) - 1)]
  84. }
  85. # 获取页面源码
  86. result = requests.request(methods, url, headers=headers, allow_redirects= True)
  87. log( "cookie" + str(result.cookies.values()))
  88. tag = 0
  89. log( "初始页面编码为:" + result.encoding)
  90. if result.encoding == "gbk" or result.encoding == "ISO-8859-1":
  91. log( "初始页面编码非UTF-8,需要进行重编码操作", "warn")
  92. tag = 1
  93. try:
  94. result = requests.request(methods, url, headers=headers, allow_redirects= True, cookies=result.cookies)
  95. except:
  96. return "InternetError", ""
  97. result_text = result.text
  98. # print(result_text)
  99. if tag == 1:
  100. result_text = recoding(result)
  101. log( "转码编码完成,当前编码为gbk")
  102. return result_text
  103. def recoding( result):
  104. try:
  105. result_text = result.content.decode( "gbk",errors= 'ignore')
  106. except:
  107. # UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 6917:
  108. try:
  109. result_text = result.content.decode( "").encode( "unicode_escape").decode( "gbk",errors= 'ignore')
  110. except:
  111. try:
  112. result_text = result.content.decode( "gb18030",errors= 'ignore')
  113. except:
  114. result_text = result.text
  115. return result_text
  116. # 分析数据
  117. def getDict( code):
  118. """
  119. 分析网页源码,获取数据,并存储为以字典元素构成的列表返回
  120. :param code:网页源码
  121. :return:List
  122. """
  123. # 通过正则的方式缩小范围
  124. code = re.findall( "正文卷</dt>(.*?)</dl>", code, re.S)[ 0]
  125. # log(code)
  126. # obj = bs4.BeautifulSoup(markup=code,features="html.parser")
  127. obj = bs4.BeautifulSoup(markup=code, features= "lxml")
  128. # log("输出结果")
  129. # log(obj.find_all("a"))
  130. # 通过上面调试输出可知得到的是个列表
  131. tag = obj.find_all( "a")
  132. log( "tag长度为:" + str( len(tag)))
  133. result = []
  134. count = 0
  135. for i in range( len(tag)):
  136. count += 1
  137. link = tag[i][ "href"]
  138. text = tag[i].get_text()
  139. result.append({ "title": text, "link": "https://www.lingdianksw8.com" + link})
  140. return result
  141. # 文章内容
  142. def getContent( url):
  143. code = getCode(url, "get")
  144. try:
  145. code = code.replace( "<br />", "\n")
  146. code = code.replace( "&nbsp;", " ")
  147. code = code.replace( " ", " ")
  148. except Exception as e:
  149. # AttributeError: 'tuple' object has no attribute 'replace'
  150. Log_text( "error", "[run01-161~163]"+ str(e))
  151. # with open("temp.txt","w+",encoding="utf-8") as f:
  152. # f.write(code)
  153. obj = bs4.BeautifulSoup(markup=code, features= "lxml")
  154. titile = obj.find_all( "h1")[ 0].text
  155. try:
  156. content = obj.find_all( "div", attrs={ "class": "showtxt"})[ 0].text
  157. except:
  158. return None, None
  159. # with open("temp.txt", "w+", encoding="utf-8") as f:
  160. # f.write(content)
  161. # log(content)
  162. try:
  163. g = re.findall(
  164. "(:.*?https://www.lingdianksw8.com.*?天才一秒记住本站地址:www.lingdianksw8.com。零点看书手机版阅读网址:.*?.com)",
  165. content, re.S)[ 0]
  166. log(g)
  167. content = content.replace(g, "")
  168. except:
  169. Log_text( "error", "清除广告失败!章节" + titile + "(" + url + ")")
  170. log(content)
  171. return titile, content
  172. def docToMd( name, title, content):
  173. with open(name + ".md", "w+", encoding= "utf-8") as f:
  174. f.write( "## " + title + "/n" + content)
  175. f.close()
  176. return 0
  177. # 多线程专供函数 - 通过链接获取文章
  178. def thead_getContent( link):
  179. # 根据链接获取文章内容
  180. Log_text( "info", "尝试获取" + str(link))
  181. title, content = getContent( str(link)) # 从文章内获取到标题和内容
  182. Log_text( "success", "获取章节" + title + "完成")
  183. docToMd(title, title, content)
  184. Log_text( "success", "写出章节" + title + "完成")
  185. # 操作汇总
  186. def run( url):
  187. with open( "log1.log", "w+", encoding= "utf-8") as f:
  188. f.write( "")
  189. f.close()
  190. Log_text( "info", "开始获取小说首页...")
  191. code = getCode(url)
  192. Log_text( "success", "获取小说首页源代码完成,开始分析...")
  193. index = getDict(code) # 获取到[{章节名称title:链接link}]
  194. links = []
  195. # lineCount限制要爬取的数量
  196. lineCount = 0
  197. for i in index:
  198. if lineCount > 100:
  199. break
  200. lineCount += 1
  201. links.append(i[ "link"])
  202. print( "链接状态")
  203. print( type(links))
  204. print(links)
  205. Log_text( "success", "分析小说首页完成,数据整理完毕,开始获取小说内容...")
  206. pool = Pool( 50) # 多线程
  207. pool. map(thead_getContent, links)
  208. if __name__ == '__main__':
  209. start = datetime.datetime.today()
  210. Log_text( "===【日志】[多线程-]开始新的测试 =|=|=|= " + str(start))
  211. run( r"https://www.lingdianksw8.com/31/31596")
  212. # getContent("http://www.lingdianksw8.com/31/31596/8403973.html")
  213. end = datetime.datetime.today()
  214. Log_text( "===【日志】[多线程]测试结束 =|=|=|= " + str(end))
  215. Log_text( "===【日志】[多线程]测试结束 =|=|=|= 用时" + str(end - start))
  216. print( "")

run02.py


   
  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/9/28
  7. # https://www.lingdianksw8.com/31/31596/
  8. """
  9. 1.通过run01获取章节的链接,将链接存储到Redis中
  10. 2.从Redis获取章节链接并爬取
  11. """
  12. import re
  13. import pymongo
  14. from lxml import html
  15. import run01 as xrilang
  16. import redis
  17. import datetime
  18. client = redis.StrictRedis()
  19. def getLinks():
  20. xrilang.Log_text( "===【日志】开始获取章节名称和链接")
  21. code = xrilang.getCode( "https://www.lingdianksw8.com/61153/61153348/", "get")
  22. source = re.findall( "正文卷</dt>(.*?)</dl>", code, re.S)[ 0]
  23. selector = html.fromstring(source)
  24. title_list = selector.xpath( "//dd/a/text()")
  25. url_list = selector.xpath( "//dd/a/@href")
  26. client.flushall() # 清空Redis全部内容,避免重复运行造成的数据重复
  27. xrilang.Log_text( "===【日志】开始获取标题")
  28. for title in title_list:
  29. xrilang.log(title)
  30. client.lpush( 'title_queue', title)
  31. xrilang.Log_text( "===【日志】开始获取章节链接")
  32. for url in url_list:
  33. xrilang.log(url)
  34. client.lpush( 'url_queue', url)
  35. xrilang.log(client.llen( 'url_queue'))
  36. xrilang.Log_text( "===【日志】获取章节链接结束,共"+ str(client.llen( 'url_queue'))+ "条")
  37. def getContent():
  38. xrilang.Log_text( "===【日志】开始获取章节内容")
  39. database = pymongo.MongoClient()[ 'book']
  40. collection = database[ 'myWifeSoBeautifull']
  41. startTime=datetime.datetime.today()
  42. xrilang.log( "开始"+ str(startTime))
  43. linkCount= 0
  44. datas=[]
  45. while client.llen( "url_queue")> 0:
  46. # 爬取101章
  47. if linkCount > 10:
  48. break
  49. linkCount += 1
  50. url = client.lpop( "url_queue").decode()
  51. title = client.lpop( "title_queue").decode()
  52. xrilang.log(url)
  53. # 获取文章内容并保存到数据库
  54. content_url = "https://www.lingdianksw8.com"+url
  55. name,content = xrilang.getContent(content_url)
  56. if name!= None and content!= None:
  57. datas.append({ "title":title, "name":name, "content":content})
  58. collection.insert_many(datas)
  59. if __name__ == '__main__':
  60. start = datetime.datetime.today()
  61. xrilang.Log_text( "===【日志】[redis+MongoDB无多线程]开始新的测试 =|=|=|= " + str(start))
  62. getLinks()
  63. getContent()
  64. end = datetime.datetime.today()
  65. xrilang.Log_text( "===【日志】[redis+MongoDB无多线程]测试结束 =|=|=|= " + str(end))
  66. xrilang.Log_text( "===【日志】[redis+MongoDB无多线程]测试结束 =|=|=|= 用时" + str(end-start))
  67. print( "")

run03.py


   
  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/9/28
  7. # https://www.lingdianksw8.com/31/31596/
  8. """
  9. 1.通过run01获取章节的链接,将链接存储到Redis中
  10. 2.从Redis获取章节链接并爬取
  11. """
  12. import re
  13. import time
  14. from multiprocessing.dummy import Pool
  15. import pymongo
  16. from lxml import html
  17. import run01 as xrilang
  18. import redis
  19. import datetime
  20. client = redis.StrictRedis()
  21. database = pymongo.MongoClient()[ 'book']
  22. collection = database[ 'myWifeSoBeautifull']
  23. def getLinks():
  24. xrilang.Log_text( "===【日志】开始获取章节名称和链接")
  25. code = xrilang.getCode( "https://www.lingdianksw8.com/61153/61153348/", "get")
  26. source = re.findall( "正文卷</dt>(.*?)</dl>", code, re.S)[ 0]
  27. selector = html.fromstring(source)
  28. url_list = selector.xpath( "//dd/a/@href")
  29. client.flushall() # 清空Redis全部内容,避免重复运行造成的数据重复
  30. xrilang.Log_text( "===【日志】开始获取章节链接")
  31. i= 0
  32. for url in url_list:
  33. xrilang.log(url)
  34. client.lpush( 'url_queue', url)
  35. i+= 1
  36. client.lpush( 'sort_queue', i) # 解决多线程爬虫导致的顺序问题
  37. xrilang.log(client.llen( 'url_queue'))
  38. xrilang.Log_text( "===【日志】获取章节链接结束,共"+ str(client.llen( 'url_queue'))+ "条")
  39. def getContent( durl):
  40. url = durl[ "url"]
  41. isort=durl[ "isort"]
  42. content_url = "https://www.lingdianksw8.com" + url
  43. title, content = xrilang.getContent(content_url)
  44. if title != None and content != None:
  45. if (title != "InternetError"):
  46. xrilang.log( "获取"+title+ "成功")
  47. collection.insert_one({ "isort":isort, "title": title, "content": content})
  48. else:
  49. # 没有成功爬取的添加回redis,等待下次爬取
  50. client.lpush( 'url_queue', url)
  51. client.lpush( 'sort_queue', isort) # 解决多线程爬虫导致的顺序问题
  52. # 等待5秒
  53. time.sleep( 5000)
  54. def StartGetContent():
  55. xrilang.Log_text( "===【日志】开始获取章节内容")
  56. startTime = datetime.datetime.today()
  57. xrilang.log( "开始"+ str(startTime))
  58. urls=[]
  59. # xrilang.log(client.llen("url_queue"))
  60. while client.llen( "url_queue")> 0:
  61. url = client.lpop( "url_queue").decode()
  62. isort= client.lpop( "sort_queue").decode()
  63. #urls.append(url)
  64. urls.append({ "url":url, "isort":isort})
  65. # xrilang.log(urls)
  66. pool = Pool( 500) # 多线程
  67. pool. map(getContent,urls)
  68. endTime=datetime.datetime.today()
  69. xrilang.log( "【结束】"+ str(endTime))
  70. xrilang.Log_text( "===【日志】开始获取章节结束,用时"+ str(endTime-startTime))
  71. if __name__ == '__main__':
  72. start = datetime.datetime.today()
  73. xrilang.Log_text( "===【日志】[redis+MongoDB+多线程]开始新的测试 =|=|=|= " + str(start))
  74. getLinks()
  75. StartGetContent()
  76. end = datetime.datetime.today()
  77. xrilang.Log_text( "===【日志】[redis+MongoDB+多线程]测试结束 =|=|=|= " + str(end))
  78. xrilang.Log_text( "===【日志】[redis+MongoDB+多线程]测试结束 =|=|=|= 用时" + str(end-start))
  79. print( "")

转载:https://blog.csdn.net/ks2686/article/details/127436518
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场