Screen scraping se ScraperWiki (Jindřich Mynarz)

13

Screen-scraping se ScraperWiki Big Clean Praha, 19.3. 2011 Jindřich Mynarz NTK, SNM FF UK

Upload
narodni-technicka-knihovna-ntk
Category

Technology
view
979
download
2

Embed Size (px):

description

Prezentace Screen scraping se ScraperWiki z workshopu Big Clean, Chcete vědět víc? Mnoho dalších prezentací, videí z konferencí, fotografií i jiných dokumentů je k dispozici v institucionálním repozitáři NTK: http://repozitar.techlib.cz Would you like to know more? Find presentations, reports, conference videos, photos and much more in our institutional repository at: http://repozitar.techlib.cz/?ln=en

Transcript of Screen scraping se ScraperWiki (Jindřich Mynarz)

Page 1: Screen scraping se ScraperWiki (Jindřich Mynarz)

Screen-scrapingse ScraperWiki

Big CleanPraha, 19.3. 2011

Jindřich MynarzNTK, SNM FF UK

Page 2: Screen scraping se ScraperWiki (Jindřich Mynarz)

Co je to "scraper"?

Scraper je "počítačový program převádějící webové stránky na data."

(http://scraperwiki.com/)

Page 3: Screen scraping se ScraperWiki (Jindřich Mynarz)

Kroky screen-scraperu

1. Stažení zdroje informací (např. HTML)2. Parsování3. Extrakce informací

Page 4: Screen scraping se ScraperWiki (Jindřich Mynarz)

Extrakce informací a parsování

• HTMLo HTMLTidyo Document Object Model (DOM)

• texto regulární výrazy

Page 5: Screen scraping se ScraperWiki (Jindřich Mynarz)

Zodpovědné scrapování

• Návštěva webové stránky scraperem by měla být nerozeznatelná od návštěvy člověkem.

• Návštěva webových stránek je jako návštěva u někoho doma.

• Buďte zdvořilí.

Page 6: Screen scraping se ScraperWiki (Jindřich Mynarz)

Omezte počet HTTP požadavků

Page 7: Screen scraping se ScraperWiki (Jindřich Mynarz)

Omezte počet HTTP požadavků

1. Omezte množství stahovaných dat jen na ta, která potřebujete.

2. Časově rozložte HTTP požadavky 3. Používejte cache.

Page 8: Screen scraping se ScraperWiki (Jindřich Mynarz)

http://www.flickr.com/photos/dreamsjung/5244004907/

Page 9: Screen scraping se ScraperWiki (Jindřich Mynarz)

Podmínky scrapování

• Ověřte si, zdali jste oprávněni obsah webu používat.• Respektujte licence obsahu webových stránek.• Respektujte robots.txt.

Page 10: Screen scraping se ScraperWiki (Jindřich Mynarz)

Nástroje

• Needlebaseo http://needlebase.com/

• Yahoo! Query Languageo SELECT * FROM html WHERE url="http://example.com"

o http://developer.yahoo.com/yql/• Google Spreadsheets

o importHtml()o http://docs.google.com/

• ScraperWikio http://scraperwiki.com/

Page 11: Screen scraping se ScraperWiki (Jindřich Mynarz)

ScraperWiki

• Wiki pro screen-scrapery umožňující jejich kolaborativní vytváření

• Hosting pro scrapery• Náhledy na získaná data: formátování a základní analýza

dat• Hostovaná databáze (SQLite)• Nástroje pro práci s různými formáty: HTML, CSV, XLS,

PDF• Podporované programovací jazyky: Python, Ruby, PHP• Data sklizená scrapery jsou ke stažení jako CSV, XML,

JSON, atp.

Page 12: Screen scraping se ScraperWiki (Jindřich Mynarz)

Zapojte se

• "Trh" scraperůo Poptávka po

scraperech - vypsané odměny za data

o Výzvy k opravám a lepšímu popisu scraperů a náhledů vytvořených na sklizenými daty (tagy, popisky)

Page 13: Screen scraping se ScraperWiki (Jindřich Mynarz)

Další informace

O ScraperWiki• http://scraperwiki.com/about/

Návody• http://scraperwiki.com/help/tutorials/

Dokumentace• http://scraperwiki.com/help/documentation/

The Sphaeroceridae (Diptera) of Madeira, with notes on ... 2007b... · The Sphaeroceridae (Diptera) of Madeira, with notes on their biogeography Jindřich Roháček The Sphaeroceridae

The Sphaeroceridae (Diptera) of Madeira, with notes on ... 2007b... · The Sphaeroceridae (Diptera) of Madeira, with notes on their biogeography Jindřich Roháček The Sphaeroceridae

Jindřich Toman - College of LSA · Trained in Czechoslovakia, Germany and USA, Jindřich Toman follows an academic path defined by languages, literatures and cultures of Central

Jindřich Toman - College of LSA · Trained in Czechoslovakia, Germany and USA, Jindřich Toman follows an academic path defined by languages, literatures and cultures of Central

Distribution of Elymus caninus in the Czech Republic · 1 Distribution of Elymus caninus in the Czech Republic Author of the map: Jindřich Chrtek Map produced on: 26-10-2018 Database

Distribution of Elymus caninus in the Czech Republic · 1 Distribution of Elymus caninus in the Czech Republic Author of the map: Jindřich Chrtek Map produced on: 26-10-2018 Database

Jindřich Feld - diva-portal.se1091393/FULLTEXT01.pdf · 10.3 Use of the Stradella Bass System ... of violin at the Prague Conservatory. ... Jindřich Feld started to compose for

Jindřich Feld - diva-portal.se1091393/FULLTEXT01.pdf · 10.3 Use of the Stradella Bass System ... of violin at the Prague Conservatory. ... Jindřich Feld started to compose for

JINDŘICH HENRY KOPEČEK UNIVERSITY OF UTAH DEPARTMENT …€¦ · jindŘich henry kopeČek university of utah department of pharmaceutics and pharmaceutical chemistry department

JINDŘICH HENRY KOPEČEK UNIVERSITY OF UTAH DEPARTMENT …€¦ · jindŘich henry kopeČek university of utah department of pharmaceutics and pharmaceutical chemistry department

Jindřich Kušnír Ministry of Transport, Czech Republic Future development of Czech railway infrastructure.

Jindřich Kušnír Ministry of Transport, Czech Republic Future development of Czech railway infrastructure.

A introduction to Scraperwiki (for not developers)

A introduction to Scraperwiki (for not developers)

Jindřich Mynarz and Václav Zeman | DB-quiz: a DBpedia-backed knowledge game

Jindřich Mynarz and Václav Zeman | DB-quiz: a DBpedia-backed knowledge game

Jindřich DURAS Vltava River Authority Czech Republic · Vltava River Authority Czech Republic. NITRATES – dangerous?! Health threat! Key nutrient for blue-greens!! ONLY IN HIGH

Jindřich DURAS Vltava River Authority Czech Republic · Vltava River Authority Czech Republic. NITRATES – dangerous?! Health threat! Key nutrient for blue-greens!! ONLY IN HIGH

jindřich henry kopeček university of utah department of ...

jindřich henry kopeček university of utah department of ...

Milan Halada Jindřich Štraser. Content 1.History of the company & introduction 2.Competitors & consumers 3.Kofola‘s main advantages in the market 4.Country.

Milan Halada Jindřich Štraser. Content 1.History of the company & introduction 2.Competitors & consumers 3.Kofola‘s main advantages in the market 4.Country.

Faculty of Mathematics and Physics Charles University in Prague Linked Data Tutorial Tomáš Knap, Jindřich Mynarz, Martin Nečaský, Jakub Stárka February.

Faculty of Mathematics and Physics Charles University in Prague Linked Data Tutorial Tomáš Knap, Jindřich Mynarz, Martin Nečaský, Jakub Stárka February.

Structure and Properties of Chitosan/Chitin-Nanofibrils ...€¦ · Structure and Properties of Chitosan/Chitin-Nanofibrils Based Materials Jindřich Hašek, IBT AV ČR Praha Chitosan

Structure and Properties of Chitosan/Chitin-Nanofibrils ...€¦ · Structure and Properties of Chitosan/Chitin-Nanofibrils Based Materials Jindřich Hašek, IBT AV ČR Praha Chitosan

in silico Drug Design KFC/ADD QSAR and ADMETfch.upol.cz/wp-content/uploads/2016/02/ADD_12_Berka_QSAR-ADMET… · Karel Berka, Ph.D. Jindřich Fanfrlík, Ph.D. Martin Lepšík, Ph.D.

in silico Drug Design KFC/ADD QSAR and ADMETfch.upol.cz/wp-content/uploads/2016/02/ADD_12_Berka_QSAR-ADMET… · Karel Berka, Ph.D. Jindřich Fanfrlík, Ph.D. Martin Lepšík, Ph.D.

Ing. Jindřich Kušnír Department of Railway and Combined Transport Intermodal transport between Germany, Poland and the Czech Republic – Intelligent TEN-T.

Ing. Jindřich Kušnír Department of Railway and Combined Transport Intermodal transport between Germany, Poland and the Czech Republic – Intelligent TEN-T.

EUROPEAN CULTURAL ROUTE OF SAINTS CYRIL AND METHODIUS July 3, 2012 CONFERENCE HOTEL U SKANZENU MODRÁ 1 Jindřich Ondruš Deputy Governor of the Zlín Region.

EUROPEAN CULTURAL ROUTE OF SAINTS CYRIL AND METHODIUS July 3, 2012 CONFERENCE HOTEL U SKANZENU MODRÁ 1 Jindřich Ondruš Deputy Governor of the Zlín Region.

Jindřich DURAS Vltava River Authority Czech RepublicVltava River Authority Czech Republic NITRATES – dangerous?! Health threat! Key nutrient for blue-greens!! ONLY IN HIGH CONCENTRATONS

Jindřich DURAS Vltava River Authority Czech RepublicVltava River Authority Czech Republic NITRATES – dangerous?! Health threat! Key nutrient for blue-greens!! ONLY IN HIGH CONCENTRATONS

Why Max Dvořák did not become a Professor in Prague subsequently quite forgotten – Bohumil Matějka (1867–1909, Fig. 2), ... Jindřich Vybíral Why Max Dvořák did not become

Why Max Dvořák did not become a Professor in Prague subsequently quite forgotten – Bohumil Matějka (1867–1909, Fig. 2), ... Jindřich Vybíral Why Max Dvořák did not become

Burn the digital paper! A call to arms, by Francis Irving of ScraperWiki

Burn the digital paper! A call to arms, by Francis Irving of ScraperWiki

Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl

Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl

Languages

Pages

Legal

Copyright © 2022 FDOCUMENTS