Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some...

16
Data Collection and Web Crawling

Transcript of Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some...

Page 1: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Data Collection and Web Crawling

Page 2: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Overview

• Data intensive applications are likely to powered by some databases.

• How do you get the data in your database?– Your private secret data source– Public data from Internet

• In this tutorial, we will introduce how to collect data from Internet.– Use APIs– Web Crawlers

Page 3: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Collecting data from Internet: Use APIs

• The easiest way to get data from the Internet.• Steps:– 1. Make sure the data source provide APIs for data

collection.– 2. Obtain API key or other forms of authorization.– 3. Read documentation– 4. Coding

Page 4: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Collecting data from Internet: Use APIs

• Example: Twitter Search API• 1. Make sure the data source provide APIs for data

collection.– “Search API is focused on relevance and not completeness”– “Requests to the Search API, hosted on search.twitter.com, do

not count towards the REST API limit. However, all requests coming from an IP address are applied to a Search Rate Limit. The Search Rate Limit isn't made public to discourage unnecessary search usage and abuse, but it is higher than the REST Rate Limit. We feel the Search Rate Limit is both liberal and sufficient for most applications and know that many application vendors have found it suitable for their needs.”

Page 5: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Collecting data from Internet: Use APIs

• 2. Obtain API key or other forms of authorization.– Read through

https://dev.twitter.com/docs/auth/tokens-devtwittercom and get them

• 3. Read documentation• Found a Java implementation of Twitter API and read some

documentation files and sample codes at http://twitter4j.org/en/index.html

Page 6: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Collecting data from Internet: Use APIs

• 4. Coding• Code based on the documentation and code samples.

• Refer to our sample code (DataCollection/TweetsCollector.java)

Page 7: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Collecting data from Internet: Web Crawlers

• However, other providers hosting the data you are interested in may not provide API for you.– Example case: You want all movies’ information from

IMDB, but IMDB doesn’t provide API for programmers.– e.g. You want all the movie information found at a

starting page http://www.imdb.com/features/video/browse/

• You need to develop your own crawler.• Prerequisite: HTTP Client and Regular Expression

Page 8: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Collecting data from Internet: Web Crawlers

• After browsing the website, you find out that each movie’s information can be found at http://www.imdb.com/title/tt******/ where *****=movie id

• Pseudo Code:extract the movie ids from the starting page http://www.imdb.com/features/video/browse/for each id in {ids} access http://www.imdb.com/title/tt-movieid/, store page content in d obtain movie’s title t, year y, storyline s store (id, t,y,s) in database

Page 9: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Collecting data from Internet: Web Crawlers

• Selected Useful Java methods:

• Read html files:

• Regex that finds specific patterns in a text:

• Wait for several seconds to reduce the risks of being detected and banned

URLConnection conn = new URL(String url).openConnection().getInputStream();//Returns an InputStream object that contains the source html content for url.

Matcher m=Pattern.compile(Stirng regex).matcher(String source_text);while (m.find()){String result=m.group(i)};//From in source_text, find string(s) that matches the pattern specified by regex;//Then store the ith parenthesis group in regex.

Thread.sleep((long) (1000*Math.random()*k));//wait for 0~k seconds.

Page 10: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Regular Expression

• Regex - An advanced search.– “Normal search” only deals with finding fixed character

sequences.– Regex can handle various patterns.

• An interactive tutorial:– http://regexone.com/

• A place to quickly test a written regex against a source text:– http://regexpal.com/

Page 11: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Regular Expression

The most useful ones for web crawlers:

<tag>(.*?)</tag>

match everything surrounded by <tag><tags>

Page 12: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Example

<div class="txt-block" itemprop="actors" itemscope itemtype="http://schema.org/Person"> <h4 class="inline">Stars:</h4><name size=3>Ben Ziegler</name>, <name size=5>Glenna Hill</name>, <name size=4>Jason Woolfolk</name> <span class="ghost">|</span> <span class="see-more inline nobr"><a href="fullcredits?ref_=tt_ov_st_sm" itemprop='url'> See full cast and crew</a> &raquo; </span> </div>

html content:

Page 13: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Example

• Match the three names surrounded by <name> tags– <name size=\d>(.*?)</name>

Page 14: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Example

• Convert this regex into Java expression:

– we use \\d instead of \d in order to escape the escape character “\”.

– () controls the group to be extracted.

– Feel the difference:• What if we use (.*) instead of (.*?) ?

Matcher m=Pattern.compile("(?mis)<name size=\\d>(.*?)</name>").matcher(html_content);while (m.find()){System.out.println(“name: ”+m.group(1));

Matcher m=Pattern.compile("(?mis)<name size=(\\d)>(.*?)</name>").matcher(html_content);while (m.find()){System.out.println(“name: ”+m.group(2));

Page 15: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

Collecting data from Internet: Web Crawlers

• A complete sample code is provided in• DataCollection/MovieSpider.java

Page 16: Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?

SummaryPros Cons

Third party APIs Convenient, easy to usesafe, won’t be blockedFast

Need to manage API keysInflexibleLimit on access

Your own web crawlers Very flexible. Theoretically, you can collect anything you find.

A lot of codingMay be blocked