【python爬虫】纵横中文网python实战

发布时间：2023-10-14 09:30

文章目录

前言
- 往期知识点
- 最终效果
- 基本开发环境
- ✨分析网页
- 思路分析
- 实现步骤
- 实现结果
- 完整代码

前言

作为一个python学习者，那么今天我就教大家用python实现写一个小实战，轻轻松松的把网址的所需要的信息保存到你的电脑里。

往期知识点

往期内容回顾

【python】字典使用教程（超级详细）不看你怎么够别人卷
【python教程】保姆版教使用pymysql模块连接MySQL实现增删改查
selenium自动化测试实战案例哔哩哔哩信息至Excel
舍友打一把游戏的时间，我实现了一个selenium自动化测试并把数据保存到MySQL

最终效果

看一下实现的效果
$\"【python爬虫】纵横中文网python实战_第1张图片\"$
$\"【python爬虫】纵横中文网python实战_第2张图片\"$

基本开发环境

pycharm
Python 3.8

主要相关模块

request
BeautifulSoup
csv

✨分析网页

在实现之前第一步还是先对网页进行分析，确定网页是静态的还是动态的，知己知彼才好下手，是吧！以避开爬取难点，节约时间，我们打开网页右键检查输入关键字发现可以找到信息，我们大致可以确定这个网站是静态的。那么我们就可以根据普通的方法对网页进行抓取。
$\"【python爬虫】纵横中文网python实战_第3张图片\"$
既然我们确定了网页是静态的，那再继续分析网页看看还有我们什么所需要的信息，比如我们翻页看看会有怎样的变化，这里我们发现URLp这里的数字变了，这不代表着要实现翻页我们只需变化这数字不就ok。
/p2/
/p3/
/p4/
$\"在这里插入图片描述\"$
而这些小说都存在div=class=“rankpage_box” 下面的每一个div标签中，后面通过BeautifulSoup拿到它们就能获取里面所需的信息了。
$\"【python爬虫】纵横中文网python实战_第4张图片\"$

思路分析

1、确定想要实现的网址及入口url
2、在入口url通过解析获取小说所有章节名称及各详情页href
3、得到所有章节详情页的地址发起请求
4、提取详情页所需信息
5、将全部信息保存至excl

实现步骤

导入相对应的库，发起请求。
注意：现在各大网站都有反爬机制，所以我们要对我们的爬虫进行伪装，让它模仿浏览器访问，这样网站就检测不到访问他的是爬虫程序啦。所以我们要给爬虫设置请求头，将网页的User-Agent复制到代码里
$\"【python爬虫】纵横中文网python实战_第5张图片\"$
拿到网页源码后BeautifulSoup实例化对象，找到全部小说的div，遍历提取里面所需信息，这里提取了（书名，图片封面，月票数，详情页网址）
$\"【python爬虫】纵横中文网python实战_第6张图片\"$
有了详情页对发起请求，实例化对象，提取详情页提取其他信息，把字典信息填进列表最后来个返回值给函数。
$\"【python爬虫】纵横中文网python实战_第7张图片\"$
调用其他函数进行其相关的操作，这里是将信息保存至Excel中。
$\"【python爬虫】纵横中文网python实战_第8张图片\"$
封面的保存
$\"【python爬虫】纵横中文网python实战_第9张图片\"$
最后在主函数中设置翻页实现，一共10页的内容。
$\"【python爬虫】纵横中文网python实战_第10张图片\"$

实现结果

$\"【python爬虫】纵横中文网python实战_第11张图片\"$

建议在网络良好下进行代码的运行。
$\"【python爬虫】纵横中文网python实战_第12张图片\"$

完整代码

import requests
from bs4 import BeautifulSoup
import re
import csv
from fake_useragent import UserAgent
import os.path

num = 1
class spider(object):
    # 魔法方法
    def __init__(self):
        headers = {\'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36\'}
        # 对象属性
        self.header = headers


    def content(self,url):
        try:
            # 发起请求
            response = requests.get(url,self.header)
            if response.status_code == 200:
                return response.text
        except Exception as e:
            print(e)


    def get_data(self,response):
        #
        all_list = []
        # 实例化
        soup = BeautifulSoup(response,\'lxml\')
        # 全部div
        all_data = soup.find(\'div\',class_=\"rankpage_box\").find_all(\'div\',class_=\"rank_d_list borderB_c_dsh clearfix\")
        # 遍历提取信息
        for i in all_data:
            item = {}   # 字典
            item[\'title\'] = i.find(\'div\',class_=\"rank_d_b_name\").find(\'a\').text
            item[\'images\'] = i.find(\'a\').find(\'img\').get(\'src\')
            item[\'num\'] = i.find(class_=\"rank_d_b_ticket\").text
            # 详情页
            detalis = i.find(\'a\').get(\'href\')


            # 详情页请求
            new_rsponse = requests.get(url=detalis,headers=self.header).text
            # 实例化
            new_di = BeautifulSoup(new_rsponse,\'lxml\')
            # 提取信息
            try:
                item[\'scroc\'] = new_di.find(class_=\"nums\").text.replace(\' \',\'\')
            except:
                item[\'scroc\'] = \'NOT\'
            try:
                item[\'manages\'] = new_di.find(class_=\"book-dec Jbook-dec hide\").find(\'p\').text
            except:
                item[\'manages\'] = \'NOT\'
            # 获取字典图片
            images = item.get(\'images\')
            #
            self.save_images(images)
            #
            all_list.append(item)
        return all_list


    def save_csv(self,all_list):
        # 打开文件
        with open(\'纵横网.csv\',mode=\'a+\',newline=\'\',encoding=\'utf-8\')as f:
            writer = csv.DictWriter(
                f,fieldnames=[\'书名\',\'图片\',\'月票\',\'评分信息\',\'详情\']
            )
            writer.writeheader()
            # 写入内容
            for i in all_list:
                writer.writerow(
                    {
                        \'书名\':i[\'title\'],
                        \'图片\': i[\'images\'],
                        \'月票\': i[\'num\'],
                        \'评分信息\': i[\'scroc\'],
                        \'详情\': i[\'manages\'],
                    }
                )


    def save_images(self,images):
        global num
        #
        if not os.path.exists(\'./纵横小说/\'):
            os.mkdir(\'./纵横小说/\')
        # 请求
        images_re = requests.get(images,self.header).content
        # 保存
        with open(\'./纵横小说/\' + str(num) + \'.jpg\',mode=\'wb\')as f:
            f.write(images_re)
            num += 1


    def main(self):
        for i in range(1,11):
            url = f\'http://www.zongheng.com/rank/details.html?rt=1&d=1&p={i}\'
            print(f\'================保存第{i}页的内容=============\')
            response = self.content(url)
            all_list = self.get_data(response)
            self.save_csv(all_list)


if __name__ == \'__main__\':
    mood = spider()
    mood.main()

\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0