Getting Started with SelectorLib

  1. Install Chrome Extension
  2. Mark up all the data
    1. Create a template
    2. Adding Elements
      1. Concept of Groups & Multiple Instances
      2. Adding Child Elements to the Group
        1. Type: Text
        2. Type: Link
        3. Type: Attribute
    3. Highlight and Data Preview
    4. Export
  3. Python Code
    1. Code
  4. The data

Install Chrome Extension

Download and install the selectorlib chrome extension

Install from Chrome Web Store

Mark up all the data

Create a template

Go to the website you need to markup and extract data from. Say - scrape.live/shop Open the Developer Toolbar on Chrome ( F12) or Ctrl + Shift + I Find “Selectorlib” in the toolbar

Create Template

Create a new Template Give it any name. For now say “PokemonShop”

Create Template

You now have a template for a page to start adding elements.

Adding Elements

Let’s add a few elements. Let’s grab all the Pokémon that’s listed on the page.

Press + Add on top left.

Concept of Groups & Multiple Instances

Let’s first add the product listing. Here is how it works, when you see a lot of data is repeated, we first select the repeating group, and then add the other elements inside it as children. So for now, let’s select the products in product list. AKA our Pokémon.

Give a name to the element - Pokémon. On the selector input, click on Pick Element, now move over to the page and hover over the group, and click when you see the entire area highlighted in green.

Add Element

If you are not able to get it, you can click on any element in the group, then click on the Enable Shortcuts bar on the bottom right corner. You get three shortcuts - S, P and C. P lets you select the parent, S lets you select without actually clicking on the element, useful in cases where the element takes you to another page if you click on it, and c to go to the child element, if you have gone to a parent element. These are concepts borrowed from WebScraper.io

Clicking P a bunch of times can also let you select the parent element - our group.

Shortcuts

Once you have that, click on done selecting. Now since we have many instances of this items at the Root level of the team plate, lets mark it as multiple by clicking n the checkbox. Then press save. You should see the data preview bar below the selector show a lot of the text inside the product group.

Adding Child Elements to the Group

Now, let’s add the name, price, product_link and image. Click on the + button on the Pokemon element you just added. This lets you add children to the elements.

Type: Text

Let’s add one called Name, and click on select element, You will see that the first element of the product listing is highlighted in Yellow. Click on a the name inside the yellow area and it should select it. Press done and you will see the data preview change to show just the name as multiple objects. You do not need to mark it as multiple as the current level of the element ie, inside the product group do not have multiple instances of it.

Let’s go ahead and add the price. It is just like the name.

Adding Child Elements

Now, we have an interesting problem here. How do we get the URL to the product page ? It’s not visible on the page. That’s what the Link type is for. If you select any A element and choose the type as Link it pulls the href from it. Let’s add a new element called product_link, choose the selector, and find the a tag for the product. Pick Link as the type, and save. You should see the link along with the other data in preview.

Link selectors

Type: Attribute

Similarly for images, you need to pick the Attribute Type, and you will see a new input field called Atttribute show up - just choose src from the list, and save. The data should be there in the preview.

Attribute Selectors

Highlight and Data Preview

To see what field are selected press highlight on top toolbar and you can see nice preview with boxes and labels around each data points you’ve marked.

Now lets see if this works with the next page. Go to page 2 and try highlighting again and you should see the boxes again, and it should work.

Highlight Elements

Press update data, and you will see that the data also updates. You can copy the data as JSON directly from the browser. It’s useful for quick extraction.

Export

Export this by clicking on Export and then Download. The exported file would be named like Pokemon_Shop.yml

Export Template

The exported YAML File should look similar to

pokemon:
    css: li.product
    multiple: true
    type: Text
    children:
        name:
            css: h2.woocommerce-loop-product__title
            type: Text
        price:
            css: span.woocommerce-Price-amount
            type: Text
        image:
            css: img.attachment-woocommerce_thumbnail
            type: Attribute
            attribute: src
        product_link:
            css: a.woocommerce-LoopProduct-link
            type: Link


Python Code

Install Selectorlib Python

pip install selectorlib

Create a folder called pokemon_scraper

mkdir pokemon_scraper
cd pokemon_scraper

Move the file you just downloaded above to this folder.

mv ~/Downloads/PokemonShop.yml .

Create new python file and open with an editor

touch scrape.py

Code

from selectorlib import Extractor
import requests 

# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file('PokemonShop.yml')

# Download the page using requests
r = requests.get('https://scrapeme.live/shop/')
# Pass the HTML of the page and create 
data = e.extract(r.text)
# Print the data 
print(data)

# Download another page of similar structure
r = requests.get('https://scrapeme.live/shop/page/2/')
# Use the same extractor to get the data 
data = e.extract(r.text)
# Print it again
print(data)

The data

{
  "pokemon": [
    {
      "name": "Pidgeotto",
      "price": "£84.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/017-350x350.png",
      "product_link": "https://scrapeme.live/shop/Pidgeotto/"
    },
    {
      "name": "Pidgeot",
      "price": "£185.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/018-350x350.png",
      "product_link": "https://scrapeme.live/shop/Pidgeot/"
    },
    {
      "name": "Rattata",
      "price": "£128.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/019-350x350.png",
      "product_link": "https://scrapeme.live/shop/Rattata/"
    },
    {
      "name": "Raticate",
      "price": "£60.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/020-350x350.png",
      "product_link": "https://scrapeme.live/shop/Raticate/"
    },
    {
      "name": "Spearow",
      "price": "£133.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/021-350x350.png",
      "product_link": "https://scrapeme.live/shop/Spearow/"
    },
    {
      "name": "Fearow",
      "price": "£95.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/022-350x350.png",
      "product_link": "https://scrapeme.live/shop/Fearow/"
    },
    {
      "name": "Ekans",
      "price": "£55.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/023-350x350.png",
      "product_link": "https://scrapeme.live/shop/Ekans/"
    },
    {
      "name": "Arbok",
      "price": "£182.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/024-350x350.png",
      "product_link": "https://scrapeme.live/shop/Arbok/"
    },
    {
      "name": "Pikachu",
      "price": "£37.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/025-350x350.png",
      "product_link": "https://scrapeme.live/shop/Pikachu/"
    },
    {
      "name": "Raichu",
      "price": "£140.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/026-350x350.png",
      "product_link": "https://scrapeme.live/shop/Raichu/"
    },
    {
      "name": "Sandshrew",
      "price": "£82.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/027-350x350.png",
      "product_link": "https://scrapeme.live/shop/Sandshrew/"
    },
    {
      "name": "Sandslash",
      "price": "£155.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/028-350x350.png",
      "product_link": "https://scrapeme.live/shop/Sandslash/"
    },
    {
      "name": "Nidorina",
      "price": "£28.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/030-350x350.png",
      "product_link": "https://scrapeme.live/shop/Nidorina/"
    },
    {
      "name": "Nidoqueen",
      "price": "£106.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/031-350x350.png",
      "product_link": "https://scrapeme.live/shop/Nidoqueen/"
    },
    {
      "name": "Nidorino",
      "price": "£179.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/033-350x350.png",
      "product_link": "https://scrapeme.live/shop/Nidorino/"
    },
    {
      "name": "Nidoking",
      "price": "£31.00",
      "image": "https://scrapeme.live/wp-content/uploads/2018/08/034-350x350.png",
      "product_link": "https://scrapeme.live/shop/Nidoking/"
    }
  ]
}