Data Scraping Amit Sharma, Chenhao Tan - Cornell …Application programming interface •It is NOT...
Transcript of Data Scraping Amit Sharma, Chenhao Tan - Cornell …Application programming interface •It is NOT...
![Page 1: Data Scraping Amit Sharma, Chenhao Tan - Cornell …Application programming interface •It is NOT for data scraping •Respect rate limit (of course, this is my view) –Check rate](https://reader030.fdocuments.in/reader030/viewer/2022040411/5ed6a51c37026524164d26cb/html5/thumbnails/1.jpg)
Data Scraping
Been there, scraped that
Amit Sharma, Chenhao Tan
![Page 2: Data Scraping Amit Sharma, Chenhao Tan - Cornell …Application programming interface •It is NOT for data scraping •Respect rate limit (of course, this is my view) –Check rate](https://reader030.fdocuments.in/reader030/viewer/2022040411/5ed6a51c37026524164d26cb/html5/thumbnails/2.jpg)
Why do you want to scrape data?
• It is cool to have some interesting data lying around
• Do research– Is there a clear question in mind?
– What kind of data is needed?
– What degree of comprehensiveness is needed?
![Page 3: Data Scraping Amit Sharma, Chenhao Tan - Cornell …Application programming interface •It is NOT for data scraping •Respect rate limit (of course, this is my view) –Check rate](https://reader030.fdocuments.in/reader030/viewer/2022040411/5ed6a51c37026524164d26cb/html5/thumbnails/3.jpg)
How do we scrape data?
• Processed datasets– Stackoverflow, Wikipedia
• Small static websites– Debate.org
• Large modern websites– Application programming interface (API)
![Page 4: Data Scraping Amit Sharma, Chenhao Tan - Cornell …Application programming interface •It is NOT for data scraping •Respect rate limit (of course, this is my view) –Check rate](https://reader030.fdocuments.in/reader030/viewer/2022040411/5ed6a51c37026524164d26cb/html5/thumbnails/4.jpg)
Application programming interface
• It is NOT for data scraping
• Respect rate limit (of course, this is my view)– Check rate limit
– Add sleep between API calls
• Save all the raw data, disk is cheap, API calls are expensive
![Page 5: Data Scraping Amit Sharma, Chenhao Tan - Cornell …Application programming interface •It is NOT for data scraping •Respect rate limit (of course, this is my view) –Check rate](https://reader030.fdocuments.in/reader030/viewer/2022040411/5ed6a51c37026524164d26cb/html5/thumbnails/5.jpg)
Case study: Twitter
• Started with search API
• Search change.org and other petition sites
![Page 6: Data Scraping Amit Sharma, Chenhao Tan - Cornell …Application programming interface •It is NOT for data scraping •Respect rate limit (of course, this is my view) –Check rate](https://reader030.fdocuments.in/reader030/viewer/2022040411/5ed6a51c37026524164d26cb/html5/thumbnails/6.jpg)
Case study: Twitter
• Set up the scraping in a way that is easy to restart (keep logs, set up some ordering)– Switched to the user view
– Get the most popular users from another dataset
– Get all the tweets from those users following an order
![Page 7: Data Scraping Amit Sharma, Chenhao Tan - Cornell …Application programming interface •It is NOT for data scraping •Respect rate limit (of course, this is my view) –Check rate](https://reader030.fdocuments.in/reader030/viewer/2022040411/5ed6a51c37026524164d26cb/html5/thumbnails/7.jpg)
Case study: reddit
• The internet is your friend– http://www.redditanalytics.com/
– http://www.reddit.com/r/redditdev/comments/1hpicu/whats_this_syntaxcloudsearch_do/
![Page 8: Data Scraping Amit Sharma, Chenhao Tan - Cornell …Application programming interface •It is NOT for data scraping •Respect rate limit (of course, this is my view) –Check rate](https://reader030.fdocuments.in/reader030/viewer/2022040411/5ed6a51c37026524164d26cb/html5/thumbnails/8.jpg)
Case study: reddit
• Sanity check and baby sitting
2008 2009 2010 2011 2012
![Page 9: Data Scraping Amit Sharma, Chenhao Tan - Cornell …Application programming interface •It is NOT for data scraping •Respect rate limit (of course, this is my view) –Check rate](https://reader030.fdocuments.in/reader030/viewer/2022040411/5ed6a51c37026524164d26cb/html5/thumbnails/9.jpg)
Case study: Last.fm
1. Research Question: How do preferences evolve in social networks?
Effects of social influences, homophily and other processes.
2. Is there a dataset already? Search, search…3. What data attributes do I need?
Timestamped activity data, exposure data and friendship data. Last.fm provides all but one : timestamped listening data, love data but snapshot-only friendship data
![Page 10: Data Scraping Amit Sharma, Chenhao Tan - Cornell …Application programming interface •It is NOT for data scraping •Respect rate limit (of course, this is my view) –Check rate](https://reader030.fdocuments.in/reader030/viewer/2022040411/5ed6a51c37026524164d26cb/html5/thumbnails/10.jpg)
Case study: Last.fm
Biases, biases, biases…Your sampling strategy will create biases.
Your research question will guide which biases to nurture ( e.g. inactive users are not useful for studying temporal preferences, but critical for studying why users leave)
I needed information on friends for each user, and also a reasonably connected component. So chose weighted BFS
![Page 11: Data Scraping Amit Sharma, Chenhao Tan - Cornell …Application programming interface •It is NOT for data scraping •Respect rate limit (of course, this is my view) –Check rate](https://reader030.fdocuments.in/reader030/viewer/2022040411/5ed6a51c37026524164d26cb/html5/thumbnails/11.jpg)
Case study: Last.fm
How much data do you need?---parallel programming
--I first wanted to implement parallel BFS (!).
Data will never be perfect-- robust error checking (RTFM!), email scripts
Think hard about data format-- flat files, databases, json?
Contributions
-- data, code (why not a general library for data crawl?)
![Page 12: Data Scraping Amit Sharma, Chenhao Tan - Cornell …Application programming interface •It is NOT for data scraping •Respect rate limit (of course, this is my view) –Check rate](https://reader030.fdocuments.in/reader030/viewer/2022040411/5ed6a51c37026524164d26cb/html5/thumbnails/12.jpg)
Our version of summary
• Think about what data you need
• Search for tips/existing solutions
• Start with small, manageable size, at least estimate how long it may take
• Keep logs and the raw data
• Sanity check and baby sitting