MENU

Python爬虫实战-爬取小说

August 8, 2019 • Python阅读设置

爬取纵横中文小说网

首先这里我们用到了requests库和BeautifulSoup库,我们可以通过pip命令来安装

  • requests库的安装
pip install requests
  • BeautifulSoup库的安装
pip install beautifulsoup4

之后我们通过requests中的get方法获得网页的源代码

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    }
r = requests.get("http://book.zongheng.com/store/c0/c0/b0/u0/p0/v9/s9/t0/u0/i1/ALL.html",headers=headers)

print(r.text)

通过分析得到对应的类,
pc.png
通过地址栏分析得到每次翻页的时候都是网站链接中的p0在改变,所以可以定义一个循环,这里我们先爬前100的信息

 for x in range(0,101):
    print(x)
    url = "http://book.zongheng.com/store/c0/c0/b0/u0/p%x/v9/s9/t0/u0/i1/ALL.html" % x

之后将其保存到文件中,利用open方法

with open("小说信息.txt", "a", encoding="utf-8") as f:
f.write(books)

,到此便大功告成,这里只是提供一个大概的思路,具体的步骤还是要大家来学习的

最后附上完整代码:

import time
import requests
from bs4 import BeautifulSoup

def get_content(url,headers):
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, "lxml")
    time.sleep(3)
    bookinfo = soup.find_all("div", class_="bookinfo")
    for x in bookinfo:
        print('*'*30)
        bookname = x.find("a").text
        book_zuozhe = x.find("div", class_="bookilnk").find("a").text
        book_time = x.find("div", class_="bookilnk").find_all("span")[1].text
        book_title = x.find("div", class_="bookintro").text
        print("名字:", bookname)
        print("作者:", book_zuozhe)
        print("简介:", book_title)
        print("状态:", book_time)
        books =  "名字:"+ bookname + "作者:" + book_zuozhe + "简介:" + book_title + "状态:" + book_time
        with open("信息.txt", "a", encoding="utf-8") as f:
            f.write(books)

def main():
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/65.0.3325.181 Safari/537.36'
        }
    for x in range(0,101):

        print(x)
        url = "http://book.zongheng.com/store/c0/c0/b0/u0/p%x/v9/s9/t0/u0/i1/ALL.html" % x
        #print(url)
        get_content(url, headers)

if __name__ == '__main__':
    main()