Getting Started with SelectorLib
Install Chrome Extension
Download and install the selectorlib chrome extension
Mark up all the data
Create a template
Go to the website you need to markup and extract data from. Say - scrape.live/shop Open the Developer Toolbar on Chrome ( F12) or Ctrl + Shift + I Find “Selectorlib” in the toolbar
Create a new Template Give it any name. For now say “PokemonShop”
You now have a template for a page to start adding elements.
Adding Elements
Let’s add a few elements. Let’s grab all the Pokémon that’s listed on the page.
Press + Add on top left.
Concept of Groups & Multiple Instances
Let’s first add the product listing. Here is how it works, when you see a lot of data is repeated, we first select the repeating group, and then add the other elements inside it as children. So for now, let’s select the products in product list. AKA our Pokémon.
Give a name to the element - Pokémon. On the selector input, click on Pick Element, now move over to the page and hover over the group, and click when you see the entire area highlighted in green.
If you are not able to get it, you can click on any element in the group, then click on the Enable Shortcuts bar on the bottom right corner. You get three shortcuts - S, P and C. P lets you select the parent, S lets you select without actually clicking on the element, useful in cases where the element takes you to another page if you click on it, and c to go to the child element, if you have gone to a parent element. These are concepts borrowed from WebScraper.io
Clicking P a bunch of times can also let you select the parent element - our group.
Once you have that, click on done selecting. Now since we have many instances of this items at the Root level of the team plate, lets mark it as multiple by clicking n the checkbox. Then press save. You should see the data preview bar below the selector show a lot of the text inside the product group.
Adding Child Elements to the Group
Now, let’s add the name, price, product_link and image. Click on the + button on the Pokemon element you just added. This lets you add children to the elements.
Type: Text
Let’s add one called Name, and click on select element, You will see that the first element of the product listing is highlighted in Yellow. Click on a the name inside the yellow area and it should select it. Press done and you will see the data preview change to show just the name as multiple objects. You do not need to mark it as multiple as the current level of the element ie, inside the product group do not have multiple instances of it.
Let’s go ahead and add the price. It is just like the name.
Type: Link
Now, we have an interesting problem here. How do we get the URL to the product page ? It’s not visible on the page. That’s what the Link type is for. If you select any A element and choose the type as Link it pulls the href from it. Let’s add a new element called product_link, choose the selector, and find the a tag for the product. Pick Link as the type, and save. You should see the link along with the other data in preview.
Type: Attribute
Similarly for images, you need to pick the Attribute Type, and you will see a new input field called Atttribute show up - just choose src from the list, and save. The data should be there in the preview.
Highlight and Data Preview
To see what field are selected press highlight on top toolbar and you can see nice preview with boxes and labels around each data points you’ve marked.
Now lets see if this works with the next page. Go to page 2 and try highlighting again and you should see the boxes again, and it should work.
Press update data, and you will see that the data also updates. You can copy the data as JSON directly from the browser. It’s useful for quick extraction.
Export
Export this by clicking on Export and then Download. The exported file would be named like Pokemon_Shop.yml
The exported YAML File should look similar to
pokemon:
css: li.product
multiple: true
type: Text
children:
name:
css: h2.woocommerce-loop-product__title
type: Text
price:
css: span.woocommerce-Price-amount
type: Text
image:
css: img.attachment-woocommerce_thumbnail
type: Attribute
attribute: src
product_link:
css: a.woocommerce-LoopProduct-link
type: Link
Python Code
Install Selectorlib Python
pip install selectorlib
Create a folder called pokemon_scraper
mkdir pokemon_scraper
cd pokemon_scraper
Move the file you just downloaded above to this folder.
mv ~/Downloads/PokemonShop.yml .
Create new python file and open with an editor
touch scrape.py
Code
from selectorlib import Extractor
import requests
# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file('PokemonShop.yml')
# Download the page using requests
r = requests.get('https://scrapeme.live/shop/')
# Pass the HTML of the page and create
data = e.extract(r.text)
# Print the data
print(data)
# Download another page of similar structure
r = requests.get('https://scrapeme.live/shop/page/2/')
# Use the same extractor to get the data
data = e.extract(r.text)
# Print it again
print(data)
The data
{
"pokemon": [
{
"name": "Pidgeotto",
"price": "£84.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/017-350x350.png",
"product_link": "https://scrapeme.live/shop/Pidgeotto/"
},
{
"name": "Pidgeot",
"price": "£185.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/018-350x350.png",
"product_link": "https://scrapeme.live/shop/Pidgeot/"
},
{
"name": "Rattata",
"price": "£128.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/019-350x350.png",
"product_link": "https://scrapeme.live/shop/Rattata/"
},
{
"name": "Raticate",
"price": "£60.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/020-350x350.png",
"product_link": "https://scrapeme.live/shop/Raticate/"
},
{
"name": "Spearow",
"price": "£133.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/021-350x350.png",
"product_link": "https://scrapeme.live/shop/Spearow/"
},
{
"name": "Fearow",
"price": "£95.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/022-350x350.png",
"product_link": "https://scrapeme.live/shop/Fearow/"
},
{
"name": "Ekans",
"price": "£55.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/023-350x350.png",
"product_link": "https://scrapeme.live/shop/Ekans/"
},
{
"name": "Arbok",
"price": "£182.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/024-350x350.png",
"product_link": "https://scrapeme.live/shop/Arbok/"
},
{
"name": "Pikachu",
"price": "£37.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/025-350x350.png",
"product_link": "https://scrapeme.live/shop/Pikachu/"
},
{
"name": "Raichu",
"price": "£140.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/026-350x350.png",
"product_link": "https://scrapeme.live/shop/Raichu/"
},
{
"name": "Sandshrew",
"price": "£82.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/027-350x350.png",
"product_link": "https://scrapeme.live/shop/Sandshrew/"
},
{
"name": "Sandslash",
"price": "£155.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/028-350x350.png",
"product_link": "https://scrapeme.live/shop/Sandslash/"
},
{
"name": "Nidorina",
"price": "£28.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/030-350x350.png",
"product_link": "https://scrapeme.live/shop/Nidorina/"
},
{
"name": "Nidoqueen",
"price": "£106.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/031-350x350.png",
"product_link": "https://scrapeme.live/shop/Nidoqueen/"
},
{
"name": "Nidorino",
"price": "£179.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/033-350x350.png",
"product_link": "https://scrapeme.live/shop/Nidorino/"
},
{
"name": "Nidoking",
"price": "£31.00",
"image": "https://scrapeme.live/wp-content/uploads/2018/08/034-350x350.png",
"product_link": "https://scrapeme.live/shop/Nidoking/"
}
]
}