Formating Fields using Formatters

Formatters
Modified Scrapy Code

Going back to our example scrapy project, here is how a record we extracted looked like

{
  "name": "Bulbasaur",
  "image": "https://scrapeme.live/wp-content/uploads/2018/08/001.png",
  "price": "£63.00",
  "short_description": "Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun’s rays, the seed grows progressively larger.",
  "stock": "45 in stock",
  "sku": "4391",
  "categories": [
    "Pokemon",
    "Seed"
  ],
  "tags": [
    "bulbasaur",
    "Overgrow",
    "Seed"
  ],
  "description": "Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun’s rays, the seed grows progressively larger.",
  "additional_information": [
    {
      "info": "Weight",
      "value": "15.2 kg"
    },
    {
      "info": "Dimensions",
      "value": "2 x 2 x 2 cm"
    }
  ],
  "related_products": [
    {
      "name": "Charmeleon",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/005-350x350.png",
      "price": "£165.00",
      "url": "https://scrapeme.live/shop/Charmeleon/"
    },
    {
      "name": "Fearow",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/022-350x350.png",
      "price": "£95.00",
      "url": "https://scrapeme.live/shop/Fearow/"
    },
    {
      "name": "Blastoise",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/009-350x350.png",
      "price": "£76.00",
      "url": "https://scrapeme.live/shop/Blastoise/"
    }
  ]
}

There are 4 instances of prices with £ in it, and saved as strings. Lets remove the currency and make it a Float. We will use a Formatter to do it.

Formatters

Formatters are written in python. Below is a formatter to take out the £ from any field. We will put this in the same scrapy spider.

from selectorlib.formatter import Formatter

class Price(Formatter):
    def format(self, text):
        price = text.replace('£','').strip()
        return float(price)

Lets modify our YML to include the new formatter.

name:
    css: h1.product_title
    type: Text
image:
    css: .woocommerce-product-gallery__wrapper img
    type: Attribute
    attribute: src
price:
    css: 'p.price span.woocommerce-Price-amount'
    type: Text
    format: Price
short_description:
    css: 'div.woocommerce-product-details__short-description p'
    type: Text
stock:
    css: p.stock
    type: Text
sku:
    css: span.sku
    type: Text
categories:
    css: 'span.posted_in a'
    multiple: true
    type: Text
tags:
    css: 'span.tagged_as a'
    multiple: true
    type: Text
description:
    css: 'div.woocommerce-Tabs-panel.woocommerce-Tabs-panel--description p'
    type: Text
additional_information:
    css: 'table.shop_attributes tr'
    multiple: true
    type: Text
    children:
        info:
            css: th
            type: Text
        value:
            css: td
            type: Text
related_products:
    css: li.product
    multiple: true
    type: Text
    children:
        name:
            css: h2.woocommerce-loop-product__title
            type: Text
        image:
            css: img.attachment-woocommerce_thumbnail
            type: Attribute
            attribute: src
        price:
            css: span.price
            type: Text
            format: Price
        url:
            css: a.woocommerce-LoopProduct-link
            type: Link

The formatters are passed to the extractor while they are initialised.

product_page_extractor = selectorlib.Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__),'../selectorlib_yaml/ProductPage_with_Formatter.yml'),formatters = [Price])

The data from product_page_extractor would look like:

{
  "name": "Bulbasaur",
  "image": "https://scrapeme.live/wp-content/uploads/2018/08/001.png",
  "price": 63.00,
  "short_description": "Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun’s rays, the seed grows progressively larger.",
  "stock": "45 in stock",
  "sku": "4391",
  "categories": [
    "Pokemon",
    "Seed"
  ],
  "tags": [
    "bulbasaur",
    "Overgrow",
    "Seed"
  ],
  "description": "Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun’s rays, the seed grows progressively larger.",
  "additional_information": [
    {
      "info": "Weight",
      "value": "15.2 kg"
    },
    {
      "info": "Dimensions",
      "value": "2 x 2 x 2 cm"
    }
  ],
  "related_products": [
    {
      "name": "Charmeleon",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/005-350x350.png",
      "price": 165.00,
      "url": "https://scrapeme.live/shop/Charmeleon/"
    },
    {
      "name": "Fearow",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/022-350x350.png",
      "price": 95.00,
      "url": "https://scrapeme.live/shop/Fearow/"
    },
    {
      "name": "Blastoise",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/009-350x350.png",
      "price": 76.00,
      "url": "https://scrapeme.live/shop/Blastoise/"
    }
  ]
}

Modified Scrapy Code

View Code in Github

# -*- coding: utf-8 -*-
import scrapy
import os 
import selectorlib
from selectorlib.formatter import Formatter

class Price(Formatter):
    def format(self, text):
        price = text.replace('£','').strip()
        return float(price)

class ScrapemeSpider(scrapy.Spider):
    name = 'scrapeme_with_formatter'
    allowed_domains = ['scrapeme.live']
    start_urls = ['http://scrapeme.live/shop/']
    # Create Extractor for listing page
    listing_page_extractor = selectorlib.Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__),'../selectorlib_yaml/ListingPage.yml'))
    # Create Extractor for product page
    product_page_extractor = selectorlib.Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__),'../selectorlib_yaml/ProductPage_with_Formatter.yml'),formatters = [Price])

    def parse(self, response):
        # Extract data using Extractor
        data = self.listing_page_extractor.extract(response.text)
        if 'next_page' in data: 
            yield scrapy.Request(data['next_page'],callback=self.parse)
        for p in data['product_page']:
            yield scrapy.Request(p,callback=self.parse_product)
    
    def parse_product(self, response):
        # Extract data using Extractor
        product = self.product_page_extractor.extract(response.text)
        if product: 
            yield product

You can get the full project from Github