MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

Post on 12-May-2015

998 views 1 download

Tags:

description

Gyes is an aggregation platform for the Web. Gyes allows you to develop, schedule and troubleshoot data extraction programs (crawlers) that translate html content into structured data you can use later on. In selecting the data model for the platform, several challenges arose due to the lack of structure of the scraped data, and the need to provide meaningful and efficient access to it. MongoDB was our third rewrite of the Gyes back-end, and by far has exceeded expectations. In this talk, I would like to discuss some of the challenges we faced, and how MongoDB addressed them. Details about implementation challenges are also shared.

Transcript of MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

MongoDB and Web Scraping with the Gyes Platform

Jesus Diazjesus.diaz@infinithread.com

@infinithread, @elyisu

MongoDB Atlanta 2013

What is Gyes?

• Let's think on the web as a huge data source of unstructured data

• Absence of a web service or API layer to consume most of the data

• Significant value on thematic aggregation

• Finance (Mint.com, Manilla.com)

• Travel (Kayak.com)

• Shopping (Nextag.com)

What is Gyes? (cont)

• Aggregation platform for the web

• SaaS or hosted

• Domain-specific Scrapers

• JavaScript + jQuery = JSON

• Oriented to provide programmatic access of the data

Goals

• Decouple Data Extraction from Data Consumption

• Provide a Flexible Data Model

• Provide a Semi-structured Model to Access Scraped Data

Challenges: Data Storage

• Relational Databases?• Lack of support for JSON

• Tabular structure vs data schema flexibility

• Key/value stores• Very flexible, but

• Inability of querying the data in more than one dimension

The MongoDB solution

• No tricks, store data as-is

• Flexible (structure of scraped data can change, MongoDB doesn't care)

• Powerful query mechanisms

• Scalable

• Again, store data as-is, consume as-is

Using MongoDB in Gyes

Using MongoDB in Gyes

• One database per user

• Data segregation

• Avoid name conflicts

• Two collections per crawler

• Permanent results (available to the API)

• Temporary results (developing and tuning crawler)

Gyes API

• Ease data consumption programmatically

• RESTful

•API Data functions leverage MongoDB query capabilities (latest,find)

Case Study: Ubirates

• www.ubirates.com. Financial aggregation website (Japan)

• 10 banks (and counting)

• Gyes as aggregator platform and BaaS (data served via API upon page load)

Case Study: Ubirates (cont)

find API call (POST)

URL:http://api.gyeslab.com/v1/find/ubirates/all?apiKey=xxyy&take=1

Body:{

q: { Status: 'success' },

p: { CrawlerName: 1, Data: 1, _id: 0 }

}

Use Case: Ubirates (cont)

var data = crawlers.Select(crawler =>

database.GetCollection(crawler.ToLower()))

.Select(collection =>

collection.Find(q)

.SetSortOrder(SortBy.Descending("RequestId"))

.SetFields(p) .Skip(skip)

.Take(take)

.ToJson(jsonWritterSettings)

);

What's Next

• Scale Data Repository + API• Sharding

• Get data closer to users

• Query optimizations• Indexing

• Caching

The End

Thanks!

@infinithreadwww.infinithread.comwww.gyeslab.com