Scraping in 20 mins

19
Paul Bradshaw Leanpub.com/scrapingforjournalists * Scraping in 20 mins Friday, 13 July 2012

description

Presenti

Transcript of Scraping in 20 mins

Page 1: Scraping in 20 mins

Paul BradshawLeanpub.com/scrapingforjournalists*

Scraping in 20 mins

Friday, 13 July 2012

Page 2: Scraping in 20 mins

*

Friday, 13 July 2012

Page 3: Scraping in 20 mins

*

Function (Parameters)

Friday, 13 July 2012

Page 4: Scraping in 20 mins

*

Function (Parameters)=SUM(A2:A50)=AVERAGE(B2:B300)=COUNTIF(A10:A3000,”Smith”)

Friday, 13 July 2012

Page 5: Scraping in 20 mins

*

(“string”, index)

Friday, 13 July 2012

Page 6: Scraping in 20 mins

*

Tip: search for documentation

Friday, 13 July 2012

Page 7: Scraping in 20 mins

*

Tip: search for structure around data

Friday, 13 July 2012

Page 8: Scraping in 20 mins

*

Friday, 13 July 2012

Page 9: Scraping in 20 mins

*

//div[starts-with(@class, ‘jobWrap’)]

Friday, 13 July 2012

Page 10: Scraping in 20 mins

*

bit.ly/nrwscraper2

Friday, 13 July 2012

Page 11: Scraping in 20 mins

*

excelnotes.posterous.com/tag/importxml/tag/importhtml

Friday, 13 July 2012

Page 12: Scraping in 20 mins

*

Friday, 13 July 2012

Page 15: Scraping in 20 mins

Things to know

• Libraries• Functions• Variables• Lists or arrays [‘Bob’, ‘Jane’]• Index• String, integer, float• If/Else• For loops• Operators

Friday, 13 July 2012

Page 16: Scraping in 20 mins

Following the data

• From String (URL) ->• Variable (html) ->• Variable (root) ->• Variable containing a list (tds) ->• Variable (td)

Friday, 13 July 2012

Page 17: Scraping in 20 mins

Looping through a list

• Tds = [‘Duarte’, ‘Sihl’, ‘Franzi’, ‘Paul’]• For td in tds• The first time, td = Duarte• The second time, td = Sihl• Then td = Franzi• Then td = Paul• Then it has finished the loop!

Friday, 13 July 2012

Page 18: Scraping in 20 mins

*

Friday, 13 July 2012

Page 19: Scraping in 20 mins

***

Leanpub.com/scrapingforjournalists@paulbradshaw

onlinejournalismblog.comhelpmeinvestigate.com

slideshare.net/onlinejournalistlinkedin.com/in/onlinejournalist

Friday, 13 July 2012