Using SelectorLib to build an API for any websites in less than 10 minutes
You can also use SelectorLib to make APIs that can scrape websites in Real Time. Here is an example API built using Python aiohttp, that will fetch product details from any product from https://scrapeme.live/ with a page structure similar to https://scrapeme.live/shop/Bulbasaur/
The YAML
We will re use the YAML with formatters used in Formatting Fields
name:
css: h1.product_title
type: Text
image:
css: .woocommerce-product-gallery__wrapper img
type: Attribute
attribute: src
price:
css: 'p.price span.woocommerce-Price-amount'
type: Text
format: Price
short_description:
css: 'div.woocommerce-product-details__short-description p'
type: Text
stock:
css: p.stock
type: Text
sku:
css: span.sku
type: Text
categories:
css: 'span.posted_in a'
multiple: true
type: Text
tags:
css: 'span.tagged_as a'
multiple: true
type: Text
description:
css: 'div.woocommerce-Tabs-panel.woocommerce-Tabs-panel--description p'
type: Text
additional_information:
css: 'table.shop_attributes tr'
multiple: true
type: Text
children:
info:
css: th
type: Text
value:
css: td
type: Text
related_products:
css: li.product
multiple: true
type: Text
children:
name:
css: h2.woocommerce-loop-product__title
type: Text
image:
css: img.attachment-woocommerce_thumbnail
type: Attribute
attribute: src
price:
css: span.price
type: Text
format: Price
url:
css: a.woocommerce-LoopProduct-link
type: Link
The Code
import asyncio
import aiohttp
from aiohttp import web
import selectorlib
from selectorlib.formatter import Formatter
class Price(Formatter):
def format(self, text):
price = text.replace('£','').strip()
return float(price)
product_page_extractor = selectorlib.Extractor.from_yaml_file('ProductPage_with_Formatter.yml',formatters = [Price])
async def get_product_page(request):
async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(verify_ssl=False)) as session:
product_url = request.rel_url.query['product_url']
data = {'error':'Please provide a URL'}
if product_url:
html = await fetch(session, product_url)
data = product_page_extractor.extract(html)
return web.json_response(data)
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
app = web.Application()
app.add_routes([web.get('/', get_product_page)])
if __name__ == '__main__':
web.run_app(app)
Directory Structure
The directory structure should look like
aiohttp-api-example/
├── ProductPage_with_Formatter.yml
└── api.py
Running the API Server
From inside the folder aiohttp-api-example
run
python3.7 api.py
The server should now start in localhost:8080
Make a request to get the data from
Try loading http://localhost:8080/?product_url=https://scrapeme.live/shop/Bulbasaur/ in your browser with the server running, and should see some data similar to
{
"name": "Bulbasaur",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/001.png",
"price": 63.0,
"short_description": "Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun\u2019s rays, the seed grows progressively larger.",
"stock": "45 in stock",
"sku": "4391",
"categories": [
"Pokemon",
"Seed"
],
"tags": [
"bulbasaur",
"Overgrow",
"Seed"
],
"description": "Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun\u2019s rays, the seed grows progressively larger.",
"additional_information": [
{
"info": "Weight",
"value": "15.2 kg"
},
{
"info": "Dimensions",
"value": "2 x 2 x 2 cm"
}
],
"related_products": [
{
"name": "Beedrill",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/015-350x350.png",
"price": 168.0,
"url": "https://scrapeme.live/shop/Beedrill/"
},
{
"name": "Metapod",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/011-350x350.png",
"price": 148.0,
"url": "https://scrapeme.live/shop/Metapod/"
},
{
"name": "Wartortle",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/008-350x350.png",
"price": 123.0,
"url": "https://scrapeme.live/shop/Wartortle/"
}
]
}
Full Code
You can find the full project in Github