YML Structure
Lets take a look at this fictional store that sells Pokemon - https://scrapeme.live/shop/
Lets extract Here is a sample YML that SelectorLib accepts as Input
pokemon:
css: li.product
multiple: true
type: Text
children:
name:
css: h2.woocommerce-loop-product__title
type: Text
price:
css: span.woocommerce-Price-amount
type: Text
image:
css: img.attachment-woocommerce_thumbnail
type: Attribute
attribute: src
url:
css: a.woocommerce-LoopProduct-link
type: Link
Here pokemon
is the main element and the elements - name, price, image and url are inside it and are called the children of the pokemon element.
Every element starts with its name and can have these properties
- css
- xpath
- type
- children
- formatter
css (default: Blank)
The css selector for the element. In our example the element called pokemon is in an li with a class product. So its li.product
.
xpath (default: Blank)
The xpath selector for the element. If we were to use xpaths instead of css selectors for the element pokemon above. It would be //li[contains(@class,'pokemon')]
. Every element needs either css or xpath selectors.
Every element needs either css or xpath selectors. If both xpath and css are defined, xpath takes preference.
type (default: Text)
The type defines what kind of extraction needs to happen on the selected element. Here are accepted types
Text
This type of extraction just extracts all the text content from the selected elements. If you have not specifed a type, Text would be used as default.
Attribute
This type of extraction lets you extract a particular attribute, specified using the attribute
property for the element. This is not usually required when you are selecting using xpaths as you define that easily in an expression as compared to css selectors. eg. //img[@src]
Here is an example that extracts the src attribute of an img element
image:
css: img.attachment-woocommerce_thumbnail
type: Attribute
attribute: src
Link
This type is a shortcut for getting the href attribute from any links in the html defined using an <a>
tag
Example,
url:
css: a.woocommerce-LoopProduct-link
type: Link
HTML
HTML type, just gives you the full HTML content of the element. This is useful when you need the html as is for some custom extraction or checking a few conditions.
multiple (default: False)
If you need multiple matches on the selector of an element use multiple as true. If you only need to get the first match, use multiple as false or leave it blank. For example, the element pokemon has multiple matches on the same page, so we have set multiple:true in it to get all of them.
children (default: Blank)
An element can have multiple child elements. In the example above the parent element pokemon
has these “children” - name
,price
,image
,url
. Each child element could also more children and can be nested. If an element has children, it’s type
property is ignored.
format
You can define custom formatters, and can be used for minor transformations on the extracted data. In Python, these formatters are defined as
from selectorlib.formatter import Formatter
class Price(Formatter):
def format(self, text):
return text.replace('\\n','').strip()
Used in the YAML as
price:
css: span.woocommerce-Price-amount
type: Text
format: Price
And passed to the Extractor while its initialized
formatters = Formatter.get_all()
Extractor.from_yaml_file('a.yaml', formatters=formatters)
Python Example
scrapeme_listing_page.yml
pokemon:
css: li.product
multiple: true
type: Text
children:
name:
css: h2.woocommerce-loop-product__title
type: Text
price:
css: span.woocommerce-Price-amount
type: Text
image:
css: img.attachment-woocommerce_thumbnail
type: Attribute
attribute: src
url:
css: a.woocommerce-LoopProduct-link
type: Link
extract.py
import requests
from selectorlib import Extractor, Formatter
from pprint import pprint
import re
# Define a formatter for Price
class Price(Formatter):
def format(self, text):
price = re.findall(r'\d+\.\d+',text)
if price:
return price[0]
return None
formatters = Formatter.get_all()
extractor = Extractor.from_yaml_file('./scrapeme_listing_page.yml',formatters=formatters)
#Download the HTML and use Extractor
r = requests.get('https://scrapeme.live/shop/')
data = extractor.extract(r.text)
pprint(data)
Running the scraper
>>> python extract.py
The Data
{'pokemon': [{'image': 'https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png',
'name': 'Bulbasaur',
'price': '63.00',
'url': 'https://scrapeme.live/shop/Bulbasaur/'},
{'image': 'https://scrapeme.live/wp-content/uploads/2018/08/002-350x350.png',
'name': 'Ivysaur',
'price': '87.00',
'url': 'https://scrapeme.live/shop/Ivysaur/'},
{'image': 'https://scrapeme.live/wp-content/uploads/2018/08/003-350x350.png',
'name': 'Venusaur',
'price': '105.00',
'url': 'https://scrapeme.live/shop/Venusaur/'},
{'image': 'https://scrapeme.live/wp-content/uploads/2018/08/004-350x350.png',
'name': 'Charmander',
'price': '48.00',
'url': 'https://scrapeme.live/shop/Charmander/'},
{'image': 'https://scrapeme.live/wp-content/uploads/2018/08/005-350x350.png',
'name': 'Charmeleon',
'price': '165.00',
'url': 'https://scrapeme.live/shop/Charmeleon/'},
{'image': 'https://scrapeme.live/wp-content/uploads/2018/08/006-350x350.png',
'name': 'Charizard',
'price': '156.00',
'url': 'https://scrapeme.live/shop/Charizard/'},
{'image': 'https://scrapeme.live/wp-content/uploads/2018/08/007-350x350.png',
'name': 'Squirtle',
'price': '130.00',
'url': 'https://scrapeme.live/shop/Squirtle/'},
{'image': 'https://scrapeme.live/wp-content/uploads/2018/08/008-350x350.png',
'name': 'Wartortle',
'price': '123.00',
'url': 'https://scrapeme.live/shop/Wartortle/'},
{'image': 'https://scrapeme.live/wp-content/uploads/2018/08/009-350x350.png',
'name': 'Blastoise',
'price': '76.00',
'url': 'https://scrapeme.live/shop/Blastoise/'},
{'image': 'https://scrapeme.live/wp-content/uploads/2018/08/010-350x350.png',
'name': 'Caterpie',
'price': '73.00',
'url': 'https://scrapeme.live/shop/Caterpie/'},
{'image': 'https://scrapeme.live/wp-content/uploads/2018/08/011-350x350.png',
'name': 'Metapod',
'price': '148.00',
'url': 'https://scrapeme.live/shop/Kakuna/'},
{'image': 'https://scrapeme.live/wp-content/uploads/2018/08/015-350x350.png',
'name': 'Beedrill',
'price': '168.00',
'url': 'https://scrapeme.live/shop/Beedrill/'},
{'image': 'https://scrapeme.live/wp-content/uploads/2018/08/016-350x350.png',
'name': 'Pidgey',
'price': '159.00',
'url': 'https://scrapeme.live/shop/Pidgey/'}]}