编写爬虫下载公众号上好看的壁纸

技术分享 1年前 (2025-01-21) 0 999+

关注

前言

很多年前我还在大学的时候，曾经写过一篇类似的文章，不过当时是采集某游戏官网上好看的壁纸。

最近微信公众号总是给我推荐各种壁纸，里面有不少好看的，不过一张张保存太麻烦了，索性写个爬虫自动下载。

这个爬虫的功能点

简单列一下这次项目涉及到的功能点，不过并不会每个都写在本文里，主要还是爬虫部分。

其他功能如果有同学感兴趣，后续我再分享。

获取指定公众号的所有文章
下载文章里符合规则的壁纸
过滤无关图片，如引导关注小图标
数据持久化（试用异步ORM和轻量级NoSQL）
图片分析（尺寸信息、感知哈希、文件MD5）
所有运行过程都有进度条展示，非常友好

爬虫相关文章

这几年我写过不少跟爬虫有关的文章，

项目结构

依然是使用 pdm 这个工具来作为依赖管理。

本项目用到的依赖有这些

dependencies = [     "requests>=2.32.3",     "bs4>=0.0.2",     "loguru>=0.7.3",     "tqdm>=4.67.1",     "tinydb>=4.8.2",     "pony>=0.7.19",     "tortoise-orm[aiosqlite]>=0.23.0",     "orjson>=3.10.14",     "aerich[toml]>=0.8.1",     "pillow>=11.1.0",     "imagehash>=4.3.1", ]

还有一个dev依赖，用来观测数据库（试用了轻量级NoSQL，没有可视化的方法）

[dependency-groups] dev = [     "jupyterlab>=4.3.4", ]

数据持久化

每次这种项目我都会试用不同的数据持久化方案

对于关系型数据库，我上一次是用了peewee这个ORM

后面发现主要问题是不支持自动迁移（也许现在已经支持了，但我使用时是几年前了）

其他还行，凑合用。

这次我一开始并没有做持久化，但几次关机导致进度丢失，要写一堆规则去匹配，实在是麻烦。

后面直接全部重构了。

我先后尝试了 tinydb（单文件文档型NoSQL）、pony（关系型ORM）、tortoise-orm

最终选择了 tortoise-orm，原因是语法和Django ORM很像，不想走出舒适圈了。

模型定义

from tortoise.models import Model from tortoise import fields   class Article(Model):     id = fields.IntField(primary_key=True)     raw_id = fields.TextField()     title = fields.TextField()     url = fields.TextField()     created_at = fields.DatetimeField()     updated_at = fields.DatetimeField()     html = fields.TextField()     raw_json = fields.JSONField()      def __str__(self):         return self.title   class Image(Model):     id = fields.IntField(primary_key=True)     article = fields.ForeignKeyField('models.Article', related_name='images')     url = fields.TextField()     is_downloaded = fields.BooleanField(default=False)     downloaded_at = fields.DatetimeField(null=True)     local_file = fields.TextField(null=True)     size = fields.IntField(null=True, description='unit: bytes')     width = fields.IntField(null=True)     height = fields.IntField(null=True)     image_hash = fields.TextField(null=True)     md5_hash = fields.TextField(null=True)      def __str__(self):         return self.url

这俩模型能满足本项目的所有需求了，甚至还能进一步实现后续功能，如：相似图片识别、图片分类等。

获取指定公众号的所有文章

这种方法需要有一个公众号。

通过公众号里添加「超链接」的功能来获取文章列表。

具体操作见参考资料。

准备工作

这里只提几个关键点，进入超链接菜单后，按F12抓包

主要看 /cgi-bin/appmsg 这个接口，需要提取其中的

Cookie
token
fakeid - 公众号ID，base64编码

前两个每次登录都不一样，可以考虑使用 selenium 搭配本地代理来抓包自动更新，详情参考我之前写过的文章: Selenium爬虫实践（踩坑记录）之ajax请求抓包、浏览器退出

代码实现

我将操作封装为 class

class ArticleCrawler:     def __init__(self):         self.url = "接口地址，根据抓包地址来"         self.cookie = ""         self.headers = {             "Cookie": self.cookie,             "User-Agent": "填写合适的UA",         }         self.payload_data = {} # 根据实际抓包拿到的数据来         self.session = requests.Session()         self.session.headers.update(self.headers)      def fetch_html(self, url):         """获取文章 HTML"""         try:             response = self.session.get(url, timeout=10)             response.raise_for_status()             return response.text         except Exception as e:             logger.error(f"Failed to fetch HTML for {url}: {e}")             return None      @property     def total_count(self):         """获取文章总数"""         content_json = self.session.get(self.url, params=self.payload_data).json()         try:             count = int(content_json["app_msg_cnt"])             return count         except Exception as e:             logger.error(e)             logger.warning(f'response json: {content_json}')          return None      async def crawl_list(self, count, per_page=5):         """获取文章列表并存入数据库"""         logger.info(f'正在获取文章列表，total count: {count}')          created_articles = []          page = int(math.ceil(count / per_page))         for i in tqdm(range(page), ncols=100, desc="获取文章列表"):             payload = self.payload_data.copy()             payload["begin"] = str(i * per_page)             resp_json = self.session.get(self.url, params=payload).json()             articles = resp_json["app_msg_list"]              # 存入             for item in articles:                 # 检查是否已经存在，避免重复插入                 if await Article.filter(raw_id=item['aid']).exists():                     continue                  created_item = await Article.create(                     raw_id=item['aid'],                     title=item['title'],                     url=item['link'],                     created_at=datetime.fromtimestamp(item["create_time"]),                     updated_at=datetime.fromtimestamp(item["update_time"]),                     html='',                     raw_json=item,                 )                 created_articles.append(created_item)              time.sleep(random.uniform(3, 6))          logger.info(f'created articles: {len(created_articles)}')      async def crawl_all_list(self):         return self.crawl_list(self.total_count)      async def crawl_articles(self, fake=False):         # 这里根据实际情况，筛选出壁纸文章         qs = (             Article.filter(title__icontains='壁纸')             .filter(Q(html='') | Q(html__isnull=True))         )          count = await qs.count()          logger.info(f'符合条件的没有HTML的文章数量: {count}')          if fake: return          with tqdm(                 total=count,                 ncols=100,                 desc="⬇ Downloading articles",                 # 可选颜色 [hex (#00ff00), BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE]                 colour='green',                 unit="page",                 bar_format='{l_bar}{bar} | {n_fmt}/{total_fmt} pages [{rate_fmt}]',         ) as pbar:             async for article in qs:                 article: Article                 article.html = self.fetch_html(article.url)                 await article.save()                 pbar.update(1)                 time.sleep(random.uniform(2, 5))

这段代码做了啥？

应该说是这个类有什么功能。

获取指定公众号的文章总数
循环按页获取公众号的文章，包括文章标题、地址、内容
将文章存入数据库

代码解析

其中关键就是 crawl_list 方法

其实代码是比较粗糙的，没有错误处理，而且每个循环里都会去访问数据库，性能肯定是不咋样的。

正确的做法是先把数据库里已有的文章ID读取出来，然后就不会每次循环都查询数据库了。

不过是简单的爬虫就没去优化了。

然后每次循环使用 time.sleep(random.uniform(3, 6)) 随机暂停一段时间。

进度条

这里使用了 tqdm 库来实现进度条（python 生态似乎有更简单的进度条库，我之前用过，不过大多是基于 tqdm 封装的）

bar_format 参数用法：使用 bar_format 来自定义进度条的格式，可以显示已处理文件数量、总文件数量、处理速度等。

{l_bar} 是进度条的左侧部分，包含描述和百分比。
{bar} 是实际的进度条。
{n_fmt}/{total_fmt} 显示当前进度和总数。
{rate_fmt} 显示处理速率。

解析网页

前面只是把文章的 HTML 下载下来，还得从网页里提取出图片地址。

这时候就需要写一个解析的方法了

def parse_html(html: str) -> list:     soup = BeautifulSoup(html, 'html.parser')     img_elements = soup.select('img.wxw-img')      images = []      for img_element in img_elements:         img_url = img_element['data-src']         images.append(img_url)      return images

简单使用 css selector 来提取图片

提取图片

还记得模型有个 Image 吧？

到目前为止还没用上。

这一小节就来提取并存入数据库

async def extract_images_from_articles(): 	# 根据实际情况写查询     qs = (         Article.filter(title__icontains='壁纸')         .exclude(Q(html='') | Q(html__isnull=True))     )      article_count = await qs.count()      with tqdm(             total=article_count,             ncols=100,             desc="⬇ extract images from articles",             colour='green',             unit="article",             bar_format='{l_bar}{bar} | {n_fmt}/{total_fmt} articles [{rate_fmt}]',     ) as pbar:         async for article in qs:             article: Article             images = parse_html(article.html)             for img_url in images:                 if await Image.filter(url=img_url).exists():                     continue                  await Image.create(                     article=article,                     url=img_url,                 )              pbar.update(1)      logger.info(f'article count: {article_count}, image count: {await Image.all().count()}')

这个方法先把数据库里的文章读取出来，然后从文章的 HTML 里提取出图片，最后把所有图片存入数据库。

这里代码同样存在循环里反复查询数据库的问题，不过我懒得优化了…

下载图片

类似的，我编写了 ImageCrawler 类

class ImageCrawler:     def __init__(self):         self.session = requests.Session()         self.session.headers.update(headers)         self.images_dir = os.path.join('output', 'images')         os.makedirs(self.images_dir, exist_ok=True)      def download_image(self, url):         img_path = os.path.join(self.images_dir, f'{time.time()}.{extract_image_format_re(url)}')         img_fullpath = os.path.join(os.getcwd(), img_path)          try:             response = self.session.get(url)             with open(img_fullpath, 'wb') as f:                 f.write(response.content)              return img_path         except Exception as e:             logger.error(e)          return None

这个代码就简单多了，就单纯下载图片。

图片的文件名我使用了时间戳。

不过要实际把图片采集下来，还没那么简单。

接下来写一个下载图片的方法

async def download_images():     images = await Image.filter(is_downloaded=False)      if not images:         logger.info(f'no images to download')         return      c = ImageCrawler()      with tqdm(             total=len(images),             ncols=100,             desc="⬇ download images",             colour='green',             unit="image",             bar_format='{l_bar}{bar} | {n_fmt}/{total_fmt} images [{rate_fmt}]',     ) as pbar:         for image in images:             image: Image             img_path = c.download_image(image.url)             if not img_path:                 continue              image.is_downloaded = True             image.local_file = img_path             await image.save()              pbar.update(1)             time.sleep(random.uniform(1, 3))

筛选未下载的图片，下载之后更新数据库，把图片的下载路径存进去。

把程序运行起来

最后需要把程序的各部分像糖葫芦一样串起来。

这次用到了异步，所有会有些不一样

async def main():     await init()     await extract_images_from_articles()     await download_images()

最后在程序入口调用

if __name__ == '__main__':     run_async(main())

run_async 方法是 tortoise-orm 提供的，可以等待异步方法运行完成，并回收数据库连接。

开发记录

我将 git 提交记录导出之后简单整理下，形成这个开发记录表格。

Date & Time	Message
2025-01-18 19:02:21	🍹image_crawler小修改
2025-01-18 18:09:11	🍹更新了cookie；crawl_articles方法增加fake功能；crawl_list方法完成之后会显示更新了多少文章
2025-01-12 15:48:15	🥤hash_size改成了32，感觉速度没多大变化
2025-01-12 15:13:06	🍟加上了多种哈希算法支持
2025-01-12 15:00:43	🍕图片分析脚本搞定，现在图片信息完整填充好了
2025-01-11 23:41:14	🌭修复了个bug，今晚可以挂着一直下载了
2025-01-11 23:36:46	🍕完成了下载图片的逻辑（未测试）；加入pillow和imagehash库，后续再做图片的识别功能，先下载吧。
2025-01-11 23:25:26	🥓图片爬虫初步重构，把图片链接从文章html里提取出来了；想要使用aerich做migration，还没完成
2025-01-11 22:27:04	🍔又完成一个功能：采集文章的HTML并存入数据库
2025-01-11 21:19:19	🥪成功把article_crawler改造为使用tortoise-orm

如何导出这样的记录？

使用 git 命令导出提交记录

git log --pretty=format:"- %s (%ad)" --date=iso

这里使用了 markdown 的列表格式

生成之后再根据需求调整为表格即可。

小结

爬虫没什么好说的，这种简单的直接信手拈来，不是我吹，什么语言都是随便写，毕竟爬虫也是很多程序课程入门级别的内容，实在没啥难度，有意思的在于每次写爬虫都搭配一些新的东西来尝试，或者用不同的技术栈甚至设备来尝试爬虫（像我之前把爬虫放到手机上跑一样），也许将来可以把爬虫放到单片机上运行？（似乎不太可行，内存和存储空间都太小了，树莓派倒是可以，但这算是个小服务器。）