Selectorlib
Selectorlib is combination of two packages. A chrome extension that lets you markup data on websites and export a YAML file with it. A python library that reads this YAML file, and extracts the data you marked up on the page.
Download Chrome Extension Install Python Package
Why was it built
Selectorlib was built out of frustration. At Scrapehero, we build about a hundred spiders a week, for many websites, and about 60% of them are new websites, the ones that we have not heard about.
These were relatively less complicated websites. The data needed was right there in the HTML, and not a lot of transformation was required. Part of the frustration was that there were quite a lot of data points to grab from the website, for each row of data. Some might be in a nested format with parent element and child elements grouped together in different combinations, and we needed to format each of those fields. There were two problems:
Problems
- Find the XPATHs / CSSSelectors for say 30 data points on a page. This took about an hour to find, test and set.
- Format data in nested data formats. Assume that we are dealing with some product reviews of a particular product. With each review having 5 types of ratings. These ratings presented in a text format similar to 4.7 stars/5 stars. The data we need would look like 4.7. Now this data is buried 3 levels deep. E.g.: Product -> Review -> Ratings -> Rating. It usually takes a few for loops to get there, and may be a few if else conditions to get to that data. This takes almost another two hours.
All this for a relatively small scraper.
Hunting Solutions
It’s during that time we looked at some of the open source Visual Web Scraping tools like Portia for Scrapy and Web Scraper Extension for Chrome (was open source back then) . There were more tools, but all of those required us to use their cloud platform. Well, we already had a pretty good web crawling cloud we had built over the years and perfected to crawl about 3000 pages second ( we don’t usually do that, and be nice to websites).
Portia was not maintained anymore. So we went ahead with WebScraper.io’s chrome extension, marked up webpages and built functional scrapers with it, and wrote a tool to convert those “Sitemaps” into real scraper code, in python that we could run in our cloud, for our customers. It worked for a few Scrapers, but our team was very hesitant to adopt it, as they had to generate the code first, and then there was a lot of parsing required to get it in a format that we needed. This solved the first problem, but we still had the second.
Another problem that came up was that the developer never really liked the code that we generated. Once you generate something, and if something changes in the website - you need to start from scratch again, all the custom stuff that we wrote for parsing was now gone. We couldn’t export what we wrote and put it back inside Web scraper Extension, edit the selector and paste it back.
We like the interface of Web scraper io, but it was limited in its data parsing capabilities at least for our use cases. So we went ahead and build a library to do it using some open source components from Web Scraper Extension and Parsel from Scrapy.
Sort of a Solution
We like YML, YML was always readable, like python code. And you can always export it to JSON. So our engineers came with this YML structure and a python library , where you name each element you need to extract, give it a selector, and a formatter, and the python library would read the HTML, and export the data after formatting it.
Then our front end developers built a Chrome Extension, that could create this YAML file for a web page, if you markup the data and selector. You can see how the data was going to look like right there in the browser as a JSON file, and test the Selector for multiple pages and see if all your selectors worked. You can export this from your browser as a YML file, put the file in your scrapers directory, read it and then just pass the HTML. We still haven’t added the formatter to the chrome extension yet. But you can build them in python and add them to the YML, at least for now.
What can you do with it
An Example
For example, it takes this YAML String
pokemon:
css: li.product
multiple: true
type: Text
children:
name:
css: h2.woocommerce-loop-product__title
type: Text
price:
css: span.woocommerce-Price-amount
type: Text
image:
css: img.attachment-woocommerce_thumbnail
type: Attribute
attribute: src
url:
css: a.woocommerce-LoopProduct-link
type: Link
and the page source of this page and turns it into a Dict
from selectorlib import Extractor
import requests
r = requests.get('https://scrapeme.live/shop/')
e = Extractor.from_yaml_string("""
pokemon:
css: li.product
multiple: true
type: Text
children:
name:
css: h2.woocommerce-loop-product__title
type: Text
price:
css: span.woocommerce-Price-amount
type: Text
image:
css: img.attachment-woocommerce_thumbnail
type: Attribute
attribute: src
url:
css: a.woocommerce-LoopProduct-link
type: Link
""")
e.extract(r.text)
And here is the data
[
{
'image': 'https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png',
'name': 'Bulbasaur',
'price': '63.00'
'url': 'https://scrapeme.live/shop/Bulbasaur/'
},
{ 'image': 'https://scrapeme.live/wp-content/uploads/2018/08/002-350x350.png',
'name': 'Ivysaur',
'price': '87.00',
'url': 'https://scrapeme.live/shop/Ivysaur/'
},
{
'image': 'https://scrapeme.live/wp-content/uploads/2018/08/003-350x350.png',
'name': 'Venusaur',
'price': '105.00',
'url': 'https://scrapeme.live/shop/Venusaur/'
}
]
You can either create this YML markup manually by specifying CSS or XPATH selectors, OR use the Selectorlib Chrome Extension
The Chrome Extension
You can use SelectorLib Chrome Extension to mark the data you need on webpages, and export in the YAML style supported by Selectorlib in Python.