Web Scraping for Non Programmers

24
Web scraping for non programmers ITNIG | 25th September 2014 @algonpaje - www.quadrigram.com

description

Web scraping is a core skill in the current technology context, but it is usually done by high skilled programmers. In this itnig talk our intention was to show that you - non programmer! - can also do web scraping to gather data easily from different sources. For this purpose, we invited a top data scientist with 9 years experience in data visualization, Alberto González Paje, data scientist at Bestiairo.

Transcript of Web Scraping for Non Programmers

Page 1: Web Scraping for Non Programmers

Web scraping for non programmers

ITNIG | 25th September 2014

@algonpaje - www.quadrigram.com

Page 2: Web Scraping for Non Programmers

Goal: Introduce non programmers to APIs and scraping concepts (*)

(*) In a simple way…..

@algonpaje - www.quadrigram.com

Page 3: Web Scraping for Non Programmers
Page 4: Web Scraping for Non Programmers

How?: Using few modules of a visual programming language called “Quadrigram”

@algonpaje - www.quadrigram.com

Page 5: Web Scraping for Non Programmers

> Quadrigram is a computer software designed to make the practice of data analysis and data visualization more universal

> It is designed to gather, shape, and share data

> It enables to prototype and share ideas rapidly, as well as produce compelling solutions with data in the forms of interactive visualizations, animations or dashboards

> The Quadrigram approach to data analysis and visualization is based on a visual programming language composed of around 500 modules

@algonpaje - www.quadrigram.com

Page 6: Web Scraping for Non Programmers

Example 1: Getting financial information in real time

@algonpaje - www.quadrigram.com

Page 7: Web Scraping for Non Programmers

> Data source: http://finance.yahoo.com/

@algonpaje - www.quadrigram.com

Stock Ticker Input Box

Page 8: Web Scraping for Non Programmers

> Base URL: http://finance.yahoo.com/q?s=TEF.MC&ql=1/

1.- http://finance.yahoo.com/q?s=2.- ticker (TEF.MC)3.- &ql=1/

@algonpaje - www.quadrigram.com

1 + 2 + 3 = Base URL

Page 9: Web Scraping for Non Programmers

1.- Building base URL using Quadrigram

1.1.- Module “Text” (String): “http://finance.yahoo.com/q?s=”1.2.- Module “Text Entry Box”: Input the stock ticker (eg: TEF.MC)1.3.- Module “Text” (String): “&ql=1/”1.4.- Module “Addition of 5 objects” concatenating 1, 2 and 3

…. result = “http://finance.yahoo.com/q?s=TEF.MC&ql=1/”

@algonpaje - www.quadrigram.com

Page 10: Web Scraping for Non Programmers

2.- Querying data

2.1.- Connect the output of “Addition of 5 Objects” (“http://finance.yahoo.com/q?s=TEF.MC&ql=1/”) to module “Query HTTP GET”

2.2.- Connect a “Periodic Pulse” module to “Query HTTP GET” to query data each “X” seconds

…. and so we get our HTML code ready to be scraped

@algonpaje - www.quadrigram.com

Page 11: Web Scraping for Non Programmers

3.- Scraping data

3.1.- Analyse the code and look for a “left - content - right” pattern.

In this case, the pattern we are looking for is:

left = <span id="yfs_l84_tef.mc">content = stock price (* real time when market is opened)right = </span>

@algonpaje - www.quadrigram.com

Page 12: Web Scraping for Non Programmers

3.- Scraping data

@algonpaje - www.quadrigram.com

Page 13: Web Scraping for Non Programmers

3.- Scraping data

3.2.- Use “Scrape Text” module to extract data

“Scrape Text” inlets:

source text = HTML code (output of Query HTTP GET)start sequence = <span id="yfs_l84_tef.mc">end sequence = </span>

3.3.- Extract the stock price using “Extract Object from List” module

@algonpaje - www.quadrigram.com

Page 14: Web Scraping for Non Programmers

@algonpaje - www.quadrigram.com

Page 15: Web Scraping for Non Programmers

Example 2: Build a network of similarities using “The Echonest” API

@algonpaje - www.quadrigram.com

Page 16: Web Scraping for Non Programmers

>Data source: http://developer.echonest.com/raw_tutorials/artist_api/raw_artist_02.html

@algonpaje - www.quadrigram.com

Page 17: Web Scraping for Non Programmers

>BaseURL:

http://developer.echonest.com/api/v4/artist/similar?api_key=J1OPQ9MJ8G8FC19FH&name=stones

1.- http://developer.echonest.com/api/v4/artist/similar?api_key=J1OPQ9MJ8G8FC19FH&name=

2.- artist´s name (“strokes”)

@algonpaje - www.quadrigram.com

1 + 2 = Base URL

Page 18: Web Scraping for Non Programmers

1.- Building base URL using Quadrigram

1.1.- Module “Text” (String): “http://developer.echonest.com/api/v4/artist/similar?

api_key=J1OPQ9MJ8G8FC19FH&name=”

1.2.- Module “Text Entry Box”: Input the artist´s name (eg: strokes)

1.3.- Module “Addition of 5 objects” concatenating 1 and 2

…. result = “http://developer.echonest.com/api/v4/artist/similar?

api_key=J1OPQ9MJ8G8FC19FH&name=strokes”

@algonpaje - www.quadrigram.com

Page 19: Web Scraping for Non Programmers

2.- Querying data

2.1.- Connect the output of “Addition of 5 Objects”

(“http://developer.echonest.com/api/v4/artist/similar?api_key=J1OPQ9MJ8G8FC19FH&name=strokes”)

to module “Query HTTP GET”

…. and so we get our HTML code

@algonpaje - www.quadrigram.com

Page 20: Web Scraping for Non Programmers

3.- Scraping data

3.2.- Use “Scrape Text” module to extract data

“Scrape Text” inlets:

source text = HTML code (output of Query HTTP GET)start sequence = "name": "end sequence = "},

… and we obtain the list with similar artists to our query name

@algonpaje - www.quadrigram.com

Page 21: Web Scraping for Non Programmers

4.- Build a Network of similarities

4.1.- Use “Length of List” module to count how many similar artists the are

4.2.- Use “Create List with repeated Object” module to create as many “strokes” as similar artists are

4.3.- Create a Pair Table using “Create Custom Data Structure” module

4.4.- Conver the Pair Table to a Network using “Convert PairTable to Network” module

@algonpaje - www.quadrigram.com

Page 22: Web Scraping for Non Programmers

@algonpaje - www.quadrigram.com

Page 23: Web Scraping for Non Programmers

More information: www.quadrigram.com

@algonpaje - www.quadrigram.com

Page 24: Web Scraping for Non Programmers

Thank you!!!

@algonpaje - www.quadrigram.com