Using SelectorLib with Scrapy
SelectorLib is just a python package, it works with Scrapy too. To demostrate that we’ll setup a spider in scrapy to crawl https://scrapeme.live/shop
Buidling the YML Files
For this site, we need to extract data from 2 types of pages. A listing page and a product page. We can build these using the chrome extension. Look at the Product Page and Listing page examples.
Listing Page
Here is the YAML file we will use for the Listing Page - https://scrapeme.live/shop
product_page:
css: a.woocommerce-LoopProduct-link
multiple: true
type: Link
next_page:
css: a.next
type: Link
The data it extracts would be of this strucutre
{
"product_page": [
"https://scrapeme.live/shop/Bulbasaur/",
"https://scrapeme.live/shop/Ivysaur/",
"https://scrapeme.live/shop/Venusaur/",
"https://scrapeme.live/shop/Charmander/",
"https://scrapeme.live/shop/Charmeleon/",
"https://scrapeme.live/shop/Charizard/",
"https://scrapeme.live/shop/Squirtle/",
"https://scrapeme.live/shop/Wartortle/",
"https://scrapeme.live/shop/Blastoise/",
"https://scrapeme.live/shop/Caterpie/",
"https://scrapeme.live/shop/Metapod/",
"https://scrapeme.live/shop/Butterfree/",
"https://scrapeme.live/shop/Weedle/",
"https://scrapeme.live/shop/Kakuna/",
"https://scrapeme.live/shop/Beedrill/",
"https://scrapeme.live/shop/Pidgey/"
],
"next_page": "https://scrapeme.live/shop/page/2/"
}
Product Page
For product pages with a structure similar to https://scrapeme.live/shop/Bulbasaur/
name:
css: h1.product_title
type: Text
image:
css: .woocommerce-product-gallery__wrapper img
type: Attribute
attribute: src
price:
css: 'p.price span.woocommerce-Price-amount'
type: Text
short_description:
css: 'div.woocommerce-product-details__short-description p'
type: Text
stock:
css: p.stock
type: Text
sku:
css: span.sku
type: Text
categories:
css: 'span.posted_in a'
multiple: true
type: Text
tags:
css: 'span.tagged_as a'
multiple: true
type: Text
description:
css: 'div.woocommerce-Tabs-panel.woocommerce-Tabs-panel--description p'
type: Text
additional_information:
css: 'table.shop_attributes tr'
multiple: true
type: Text
children:
info:
css: th
type: Text
value:
css: td
type: Text
related_products:
css: li.product
multiple: true
type: Text
children:
name:
css: h2.woocommerce-loop-product__title
type: Text
image:
css: img.attachment-woocommerce_thumbnail
type: Attribute
attribute: src
price:
css: span.price
type: Text
url:
css: a.woocommerce-LoopProduct-link
type: Link
and extracts data in this structure
{
"name": "Bulbasaur",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/001.png",
"price": "£63.00",
"short_description": "Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun’s rays, the seed grows progressively larger.",
"stock": "45 in stock",
"sku": "4391",
"categories": [
"Pokemon",
"Seed"
],
"tags": [
"bulbasaur",
"Overgrow",
"Seed"
],
"description": "Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun’s rays, the seed grows progressively larger.",
"additional_information": [
{
"info": "Weight",
"value": "15.2 kg"
},
{
"info": "Dimensions",
"value": "2 x 2 x 2 cm"
}
],
"related_products": [
{
"name": "Charmeleon",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/005-350x350.png",
"price": "£165.00",
"url": "https://scrapeme.live/shop/Charmeleon/"
},
{
"name": "Fearow",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/022-350x350.png",
"price": "£95.00",
"url": "https://scrapeme.live/shop/Fearow/"
},
{
"name": "Blastoise",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/009-350x350.png",
"price": "£76.00",
"url": "https://scrapeme.live/shop/Blastoise/"
}
]
}
Setting up Scrapy Project
We’ll just assume you already know how to setup scrapy and use it to create scrapy.
Create a scrapy project and a spider
scrapy startproject scrapeme_shop
cd scrapeme_shop
scrapy genspider scrapeme scrapeme.live
mkdir selectorlib_yaml
Copy the two YML files into the selectorlib_yaml
folder.
Your scrapeme_shop
folder would look similar to
├── scrapeme_shop
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── selectorlib_yaml
│ │ ├── ListingPage.yml
│ │ └── ProductPage.yml
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── scrapeme.py
└── scrapy.cfg
Spider Code
The code is concise.
# -*- coding: utf-8 -*-
import scrapy
import os
import selectorlib
class ScrapemeSpider(scrapy.Spider):
name = 'scrapeme'
allowed_domains = ['scrapeme.live']
start_urls = ['http://scrapeme.live/shop/']
# Create Extractor for listing page
listing_page_extractor = selectorlib.Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__),'../selectorlib_yaml/ListingPage.yml'))
# Create Extractor for product page
product_page_extractor = selectorlib.Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__),'../selectorlib_yaml/ProductPage.yml'))
def parse(self, response):
# Extract data using Extractor
data = self.listing_page_extractor.extract(response.text)
if 'next_page' in data:
yield scrapy.Request(data['next_page'],callback=self.parse)
for p in data['product_page']:
yield scrapy.Request(p,callback=self.parse_product)
def parse_product(self, response):
# Extract data using Extractor
product = self.product_page_extractor.extract(response.text)
if product:
yield product
Full Code
You can find the full scrapy project in Github