MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013
-
Upload
jesus-diaz -
Category
Technology
-
view
998 -
download
1
description
Transcript of MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013
MongoDB and Web Scraping with the Gyes Platform
Jesus [email protected]
@infinithread, @elyisu
MongoDB Atlanta 2013
What is Gyes?
• Let's think on the web as a huge data source of unstructured data
• Absence of a web service or API layer to consume most of the data
• Significant value on thematic aggregation
• Finance (Mint.com, Manilla.com)
• Travel (Kayak.com)
• Shopping (Nextag.com)
What is Gyes? (cont)
• Aggregation platform for the web
• SaaS or hosted
• Domain-specific Scrapers
• JavaScript + jQuery = JSON
• Oriented to provide programmatic access of the data
Goals
• Decouple Data Extraction from Data Consumption
• Provide a Flexible Data Model
• Provide a Semi-structured Model to Access Scraped Data
Challenges: Data Storage
• Relational Databases?• Lack of support for JSON
• Tabular structure vs data schema flexibility
• Key/value stores• Very flexible, but
• Inability of querying the data in more than one dimension
The MongoDB solution
• No tricks, store data as-is
• Flexible (structure of scraped data can change, MongoDB doesn't care)
• Powerful query mechanisms
• Scalable
• Again, store data as-is, consume as-is
Using MongoDB in Gyes
Using MongoDB in Gyes
• One database per user
• Data segregation
• Avoid name conflicts
• Two collections per crawler
• Permanent results (available to the API)
• Temporary results (developing and tuning crawler)
Gyes API
• Ease data consumption programmatically
• RESTful
•API Data functions leverage MongoDB query capabilities (latest,find)
Case Study: Ubirates
• www.ubirates.com. Financial aggregation website (Japan)
• 10 banks (and counting)
• Gyes as aggregator platform and BaaS (data served via API upon page load)
Case Study: Ubirates (cont)
find API call (POST)
URL:http://api.gyeslab.com/v1/find/ubirates/all?apiKey=xxyy&take=1
Body:{
q: { Status: 'success' },
p: { CrawlerName: 1, Data: 1, _id: 0 }
}
Use Case: Ubirates (cont)
var data = crawlers.Select(crawler =>
database.GetCollection(crawler.ToLower()))
.Select(collection =>
collection.Find(q)
.SetSortOrder(SortBy.Descending("RequestId"))
.SetFields(p) .Skip(skip)
.Take(take)
.ToJson(jsonWritterSettings)
);
What's Next
• Scale Data Repository + API• Sharding
• Get data closer to users
• Query optimizations• Indexing
• Caching
The End
Thanks!
@infinithreadwww.infinithread.comwww.gyeslab.com