圆月山庄资源网 Design By www.vgjia.com
本文实例讲述了Python自定义scrapy中间模块避免重复采集的方法。分享给大家供大家参考。具体如下:
from scrapy import log from scrapy.http import Request from scrapy.item import BaseItem from scrapy.utils.request import request_fingerprint from myproject.items import MyItem class IgnoreVisitedItems(object): """Middleware to ignore re-visiting item pages if they were already visited before. The requests to be filtered by have a meta['filter_visited'] flag enabled and optionally define an id to use for identifying them, which defaults the request fingerprint, although you'd want to use the item id, if you already have it beforehand to make it more robust. """ FILTER_VISITED = 'filter_visited' VISITED_ID = 'visited_id' CONTEXT_KEY = 'visited_ids' def process_spider_output(self, response, result, spider): context = getattr(spider, 'context', {}) visited_ids = context.setdefault(self.CONTEXT_KEY, {}) ret = [] for x in result: visited = False if isinstance(x, Request): if self.FILTER_VISITED in x.meta: visit_id = self._visited_id(x) if visit_id in visited_ids: log.msg("Ignoring already visited: %s" % x.url, level=log.INFO, spider=spider) visited = True elif isinstance(x, BaseItem): visit_id = self._visited_id(response.request) if visit_id: visited_ids[visit_id] = True x['visit_id'] = visit_id x['visit_status'] = 'new' if visited: ret.append(MyItem(visit_id=visit_id, visit_status='old')) else: ret.append(x) return ret def _visited_id(self, request): return request.meta.get(self.VISITED_ID) or request_fingerprint(request)
希望本文所述对大家的Python程序设计有所帮助。
圆月山庄资源网 Design By www.vgjia.com
广告合作:本站广告合作请联系QQ:858582 申请时备注:广告合作(否则不回)
免责声明:本站文章均来自网站采集或用户投稿,网站不提供任何软件下载或自行开发的软件! 如有用户或公司发现本站内容信息存在侵权行为,请邮件告知! 858582#qq.com
免责声明:本站文章均来自网站采集或用户投稿,网站不提供任何软件下载或自行开发的软件! 如有用户或公司发现本站内容信息存在侵权行为,请邮件告知! 858582#qq.com
圆月山庄资源网 Design By www.vgjia.com
暂无评论...
更新日志
2024年11月08日
2024年11月08日
- 雨林唱片《赏》新曲+精选集SACD版[ISO][2.3G]
- 罗大佑与OK男女合唱团.1995-再会吧!素兰【音乐工厂】【WAV+CUE】
- 草蜢.1993-宝贝对不起(国)【宝丽金】【WAV+CUE】
- 杨培安.2009-抒·情(EP)【擎天娱乐】【WAV+CUE】
- 周慧敏《EndlessDream》[WAV+CUE]
- 彭芳《纯色角3》2007[WAV+CUE]
- 江志丰2008-今生为你[豪记][WAV+CUE]
- 罗大佑1994《恋曲2000》音乐工厂[WAV+CUE][1G]
- 群星《一首歌一个故事》赵英俊某些作品重唱企划[FLAC分轨][1G]
- 群星《网易云英文歌曲播放量TOP100》[MP3][1G]
- 方大同.2024-梦想家TheDreamer【赋音乐】【FLAC分轨】
- 李慧珍.2007-爱死了【华谊兄弟】【WAV+CUE】
- 王大文.2019-国际太空站【环球】【FLAC分轨】
- 群星《2022超好听的十倍音质网络歌曲(163)》U盘音乐[WAV分轨][1.1G]
- 童丽《啼笑姻缘》头版限量编号24K金碟[低速原抓WAV+CUE][1.1G]