Skip to content Skip to sidebar Skip to footer

How To Access All Scraped Items In Scrapy Item Pipeline?

I have an item that has got a rank field that has to be build from analyzing other item class. I don't want to use database or other backend to store them - I just need to access a

Solution 1:

I think signals might help. I did something similar here

https://github.com/dm03514/CraigslistGigs/blob/master/craigslist_gigs/pipelines.py

It seems kind of hacky but In your spider you can create a property which will store all your Scraped items. In your pipeline you can register a method to Be called on spider closed signal. This method takes a spider instance as parameter. You can then access the spider property that contains all your scraped items

Solution 2:

This pipeline will make sure all Items have a rank.

classMyPipeline(object):

    def process_item(self, item, spider):
        item['rank'] = item.get('rank') or '1'return item

Solution 3:

You can collect all scraped items using Extensions and Signals.

from scrapy import signals


classItemCollectorExtension:
    def__init__(self):
        self.items = []

    @classmethoddeffrom_crawler(cls, crawler):
        extension = cls()

        crawler.signals.connect(extension.add_item, signal=signals.item_scraped)
        crawler.signals.connect(extension.spider_closed, signal=signals.spider_closed)

        return extension

    defspider_closed(self):
        print(self.items)  # Replace with your codedefadd_item(self, item):
        self.items.append(item)

Now, every time a new item is successfully scraped, it is added to self.items. When all items have been collected, and the spider is closing, the spider_closed function is called. Here, you can access all the collected items.

Don't forget to enable the Extension in settings.py.

Post a Comment for "How To Access All Scraped Items In Scrapy Item Pipeline?"