Formating Fields using Formatters
Going back to our example scrapy project, here is how a record we extracted looked like
{
"name": "Bulbasaur",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/001.png",
"price": "£63.00",
"short_description": "Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun’s rays, the seed grows progressively larger.",
"stock": "45 in stock",
"sku": "4391",
"categories": [
"Pokemon",
"Seed"
],
"tags": [
"bulbasaur",
"Overgrow",
"Seed"
],
"description": "Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun’s rays, the seed grows progressively larger.",
"additional_information": [
{
"info": "Weight",
"value": "15.2 kg"
},
{
"info": "Dimensions",
"value": "2 x 2 x 2 cm"
}
],
"related_products": [
{
"name": "Charmeleon",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/005-350x350.png",
"price": "£165.00",
"url": "https://scrapeme.live/shop/Charmeleon/"
},
{
"name": "Fearow",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/022-350x350.png",
"price": "£95.00",
"url": "https://scrapeme.live/shop/Fearow/"
},
{
"name": "Blastoise",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/009-350x350.png",
"price": "£76.00",
"url": "https://scrapeme.live/shop/Blastoise/"
}
]
}
There are 4 instances of prices with £
in it, and saved as strings. Lets remove the currency and make it a Float. We will use a Formatter to do it.
Formatters
Formatters are written in python. Below is a formatter to take out the £
from any field. We will put this in the same scrapy spider.
from selectorlib.formatter import Formatter
class Price(Formatter):
def format(self, text):
price = text.replace('£','').strip()
return float(price)
Lets modify our YML to include the new formatter.
name:
css: h1.product_title
type: Text
image:
css: .woocommerce-product-gallery__wrapper img
type: Attribute
attribute: src
price:
css: 'p.price span.woocommerce-Price-amount'
type: Text
format: Price
short_description:
css: 'div.woocommerce-product-details__short-description p'
type: Text
stock:
css: p.stock
type: Text
sku:
css: span.sku
type: Text
categories:
css: 'span.posted_in a'
multiple: true
type: Text
tags:
css: 'span.tagged_as a'
multiple: true
type: Text
description:
css: 'div.woocommerce-Tabs-panel.woocommerce-Tabs-panel--description p'
type: Text
additional_information:
css: 'table.shop_attributes tr'
multiple: true
type: Text
children:
info:
css: th
type: Text
value:
css: td
type: Text
related_products:
css: li.product
multiple: true
type: Text
children:
name:
css: h2.woocommerce-loop-product__title
type: Text
image:
css: img.attachment-woocommerce_thumbnail
type: Attribute
attribute: src
price:
css: span.price
type: Text
format: Price
url:
css: a.woocommerce-LoopProduct-link
type: Link
The formatters are passed to the extractor while they are initialised.
product_page_extractor = selectorlib.Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__),'../selectorlib_yaml/ProductPage_with_Formatter.yml'),formatters = [Price])
The data from product_page_extractor
would look like:
{
"name": "Bulbasaur",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/001.png",
"price": 63.00,
"short_description": "Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun’s rays, the seed grows progressively larger.",
"stock": "45 in stock",
"sku": "4391",
"categories": [
"Pokemon",
"Seed"
],
"tags": [
"bulbasaur",
"Overgrow",
"Seed"
],
"description": "Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun’s rays, the seed grows progressively larger.",
"additional_information": [
{
"info": "Weight",
"value": "15.2 kg"
},
{
"info": "Dimensions",
"value": "2 x 2 x 2 cm"
}
],
"related_products": [
{
"name": "Charmeleon",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/005-350x350.png",
"price": 165.00,
"url": "https://scrapeme.live/shop/Charmeleon/"
},
{
"name": "Fearow",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/022-350x350.png",
"price": 95.00,
"url": "https://scrapeme.live/shop/Fearow/"
},
{
"name": "Blastoise",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/009-350x350.png",
"price": 76.00,
"url": "https://scrapeme.live/shop/Blastoise/"
}
]
}
Modified Scrapy Code
# -*- coding: utf-8 -*-
import scrapy
import os
import selectorlib
from selectorlib.formatter import Formatter
class Price(Formatter):
def format(self, text):
price = text.replace('£','').strip()
return float(price)
class ScrapemeSpider(scrapy.Spider):
name = 'scrapeme_with_formatter'
allowed_domains = ['scrapeme.live']
start_urls = ['http://scrapeme.live/shop/']
# Create Extractor for listing page
listing_page_extractor = selectorlib.Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__),'../selectorlib_yaml/ListingPage.yml'))
# Create Extractor for product page
product_page_extractor = selectorlib.Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__),'../selectorlib_yaml/ProductPage_with_Formatter.yml'),formatters = [Price])
def parse(self, response):
# Extract data using Extractor
data = self.listing_page_extractor.extract(response.text)
if 'next_page' in data:
yield scrapy.Request(data['next_page'],callback=self.parse)
for p in data['product_page']:
yield scrapy.Request(p,callback=self.parse_product)
def parse_product(self, response):
# Extract data using Extractor
product = self.product_page_extractor.extract(response.text)
if product:
yield product
You can get the full project from Github