Using SelectorLib with Scrapy

SelectorLib is just a python package, it works with Scrapy too. To demostrate that we’ll setup a spider in scrapy to crawl https://scrapeme.live/shop

  1. Buidling the YML Files
    1. Listing Page
    2. Product Page
  2. Setting up Scrapy Project
  3. Spider Code
  4. Full Code
  5. Next

Buidling the YML Files

For this site, we need to extract data from 2 types of pages. A listing page and a product page. We can build these using the chrome extension. Look at the Product Page and Listing page examples.

Listing Page

Here is the YAML file we will use for the Listing Page - https://scrapeme.live/shop

product_page:
    css: a.woocommerce-LoopProduct-link
    multiple: true
    type: Link
next_page:
    css: a.next
    type: Link

The data it extracts would be of this strucutre

{
  "product_page": [
    "https://scrapeme.live/shop/Bulbasaur/",
    "https://scrapeme.live/shop/Ivysaur/",
    "https://scrapeme.live/shop/Venusaur/",
    "https://scrapeme.live/shop/Charmander/",
    "https://scrapeme.live/shop/Charmeleon/",
    "https://scrapeme.live/shop/Charizard/",
    "https://scrapeme.live/shop/Squirtle/",
    "https://scrapeme.live/shop/Wartortle/",
    "https://scrapeme.live/shop/Blastoise/",
    "https://scrapeme.live/shop/Caterpie/",
    "https://scrapeme.live/shop/Metapod/",
    "https://scrapeme.live/shop/Butterfree/",
    "https://scrapeme.live/shop/Weedle/",
    "https://scrapeme.live/shop/Kakuna/",
    "https://scrapeme.live/shop/Beedrill/",
    "https://scrapeme.live/shop/Pidgey/"
  ],
  "next_page": "https://scrapeme.live/shop/page/2/"
}

Product Page

For product pages with a structure similar to https://scrapeme.live/shop/Bulbasaur/

name:
    css: h1.product_title
    type: Text
image:
    css: .woocommerce-product-gallery__wrapper img
    type: Attribute
    attribute: src
price:
    css: 'p.price span.woocommerce-Price-amount'
    type: Text
short_description:
    css: 'div.woocommerce-product-details__short-description p'
    type: Text
stock:
    css: p.stock
    type: Text
sku:
    css: span.sku
    type: Text
categories:
    css: 'span.posted_in a'
    multiple: true
    type: Text
tags:
    css: 'span.tagged_as a'
    multiple: true
    type: Text
description:
    css: 'div.woocommerce-Tabs-panel.woocommerce-Tabs-panel--description p'
    type: Text
additional_information:
    css: 'table.shop_attributes tr'
    multiple: true
    type: Text
    children:
        info:
            css: th
            type: Text
        value:
            css: td
            type: Text
related_products:
    css: li.product
    multiple: true
    type: Text
    children:
        name:
            css: h2.woocommerce-loop-product__title
            type: Text
        image:
            css: img.attachment-woocommerce_thumbnail
            type: Attribute
            attribute: src
        price:
            css: span.price
            type: Text
        url:
            css: a.woocommerce-LoopProduct-link
            type: Link

and extracts data in this structure

{
  "name": "Bulbasaur",
  "image": "https://scrapeme.live/wp-content/uploads/2018/08/001.png",
  "price": "£63.00",
  "short_description": "Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun’s rays, the seed grows progressively larger.",
  "stock": "45 in stock",
  "sku": "4391",
  "categories": [
    "Pokemon",
    "Seed"
  ],
  "tags": [
    "bulbasaur",
    "Overgrow",
    "Seed"
  ],
  "description": "Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun’s rays, the seed grows progressively larger.",
  "additional_information": [
    {
      "info": "Weight",
      "value": "15.2 kg"
    },
    {
      "info": "Dimensions",
      "value": "2 x 2 x 2 cm"
    }
  ],
  "related_products": [
    {
      "name": "Charmeleon",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/005-350x350.png",
      "price": "£165.00",
      "url": "https://scrapeme.live/shop/Charmeleon/"
    },
    {
      "name": "Fearow",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/022-350x350.png",
      "price": "£95.00",
      "url": "https://scrapeme.live/shop/Fearow/"
    },
    {
      "name": "Blastoise",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/009-350x350.png",
      "price": "£76.00",
      "url": "https://scrapeme.live/shop/Blastoise/"
    }
  ]
}

Setting up Scrapy Project

We’ll just assume you already know how to setup scrapy and use it to create scrapy.

Create a scrapy project and a spider

scrapy startproject scrapeme_shop
cd scrapeme_shop
scrapy genspider scrapeme scrapeme.live
mkdir selectorlib_yaml

Copy the two YML files into the selectorlib_yaml folder.

Your scrapeme_shop folder would look similar to

├── scrapeme_shop
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── selectorlib_yaml
│   │   ├── ListingPage.yml
│   │   └── ProductPage.yml
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── scrapeme.py
└── scrapy.cfg

Spider Code

The code is concise.

# -*- coding: utf-8 -*-
import scrapy
import os 
import selectorlib

class ScrapemeSpider(scrapy.Spider):
    name = 'scrapeme'
    allowed_domains = ['scrapeme.live']
    start_urls = ['http://scrapeme.live/shop/']
    # Create Extractor for listing page
    listing_page_extractor = selectorlib.Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__),'../selectorlib_yaml/ListingPage.yml'))
    # Create Extractor for product page
    product_page_extractor = selectorlib.Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__),'../selectorlib_yaml/ProductPage.yml'))

    def parse(self, response):
        # Extract data using Extractor
        data = self.listing_page_extractor.extract(response.text)
        if 'next_page' in data: 
            yield scrapy.Request(data['next_page'],callback=self.parse)
        for p in data['product_page']:
            yield scrapy.Request(p,callback=self.parse_product)
    
    def parse_product(self, response):
        # Extract data using Extractor
        product = self.product_page_extractor.extract(response.text)
        if product: 
            yield product

Full Code

You can find the full scrapy project in Github

View Code in Github

Next

Using Formatters