MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

17
MongoDB and Web Scraping with the Gyes Platform Jesus Diaz [email protected] @infinithread, @elyisu MongoDB Atlanta 2013

description

Gyes is an aggregation platform for the Web. Gyes allows you to develop, schedule and troubleshoot data extraction programs (crawlers) that translate html content into structured data you can use later on. In selecting the data model for the platform, several challenges arose due to the lack of structure of the scraped data, and the need to provide meaningful and efficient access to it. MongoDB was our third rewrite of the Gyes back-end, and by far has exceeded expectations. In this talk, I would like to discuss some of the challenges we faced, and how MongoDB addressed them. Details about implementation challenges are also shared.

Transcript of MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

Page 1: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

MongoDB and Web Scraping with the Gyes Platform

Jesus [email protected]

@infinithread, @elyisu

MongoDB Atlanta 2013

Page 2: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

What is Gyes?

• Let's think on the web as a huge data source of unstructured data

• Absence of a web service or API layer to consume most of the data

• Significant value on thematic aggregation

• Finance (Mint.com, Manilla.com)

• Travel (Kayak.com)

• Shopping (Nextag.com)

Page 3: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

What is Gyes? (cont)

• Aggregation platform for the web

• SaaS or hosted

• Domain-specific Scrapers

• JavaScript + jQuery = JSON

• Oriented to provide programmatic access of the data

Page 4: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

Goals

• Decouple Data Extraction from Data Consumption

• Provide a Flexible Data Model

• Provide a Semi-structured Model to Access Scraped Data

Page 5: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

Challenges: Data Storage

• Relational Databases?• Lack of support for JSON

• Tabular structure vs data schema flexibility

• Key/value stores• Very flexible, but

• Inability of querying the data in more than one dimension

Page 6: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

The MongoDB solution

• No tricks, store data as-is

• Flexible (structure of scraped data can change, MongoDB doesn't care)

• Powerful query mechanisms

• Scalable

• Again, store data as-is, consume as-is

Page 7: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

Using MongoDB in Gyes

Page 8: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013
Page 9: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013
Page 10: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

Using MongoDB in Gyes

• One database per user

• Data segregation

• Avoid name conflicts

• Two collections per crawler

• Permanent results (available to the API)

• Temporary results (developing and tuning crawler)

Page 11: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

Gyes API

• Ease data consumption programmatically

• RESTful

•API Data functions leverage MongoDB query capabilities (latest,find)

Page 12: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

Case Study: Ubirates

• www.ubirates.com. Financial aggregation website (Japan)

• 10 banks (and counting)

• Gyes as aggregator platform and BaaS (data served via API upon page load)

Page 13: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013
Page 14: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

Case Study: Ubirates (cont)

find API call (POST)

URL:http://api.gyeslab.com/v1/find/ubirates/all?apiKey=xxyy&take=1

Body:{

q: { Status: 'success' },

p: { CrawlerName: 1, Data: 1, _id: 0 }

}

Page 15: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

Use Case: Ubirates (cont)

var data = crawlers.Select(crawler =>

database.GetCollection(crawler.ToLower()))

.Select(collection =>

collection.Find(q)

.SetSortOrder(SortBy.Descending("RequestId"))

.SetFields(p) .Skip(skip)

.Take(take)

.ToJson(jsonWritterSettings)

);

Page 16: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

What's Next

• Scale Data Repository + API• Sharding

• Get data closer to users

• Query optimizations• Indexing

• Caching

Page 17: MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

The End

Thanks!

@infinithreadwww.infinithread.comwww.gyeslab.com