Inferno + Disco
Transcript of Inferno + Disco
![Page 1: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/1.jpg)
Dancing with Big DataInferno + Disco
![Page 2: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/2.jpg)
Disco
• Open Source Map Reduce Platform
• 50% Erlang, 50% Python (roughly)
• Jobs are written in Python
• No Java!
• http://discoproject.com/
![Page 3: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/3.jpg)
XML
Why Disco?
![Page 4: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/4.jpg)
Why Disco?
• Simplicity of Erlang Clusters
• Tag based distributed file system
• Minimal Dev-Ops Effort
• Small, readable source
• Small runtime footprint
![Page 5: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/5.jpg)
Inferno
• Map / Reduce Framework
• Powered by Disco
• 100% Python (sorry)
• Developed at Chango
• Open Sourced in March 2012
![Page 6: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/6.jpg)
Chango
• Advertising Technology Company
• Search Retargeting
• Real-time bidding
• Process 10,000,000,000 records / day
![Page 7: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/7.jpg)
Erlang at Chango
• Couchbase
• Real-time bidding (200,000 / second)
• Disco
• 24 Nodes (2 TB per node)
![Page 8: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/8.jpg)
Inferno
• Query DSL for your logs
• Automation
• E.g. Summarize to database: billions of records become1000s of rows
• Distributed computing tasks
![Page 9: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/9.jpg)
Logs
• Structured Logs
• Each line is valid JSON
• Replay / Reprocess Records
• Each line has a timestamp
• Each tag has a date
• Disco “chunks” plain text files
![Page 10: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/10.jpg)
Example
{"time":"1330969562706","domain":"bighealthtree.com","campaign_id":11056,"search_term":"5 Signs of a Stroke You Don't Want to Ignore","size":"728x90","ip_address":"127.0.0.1",}
![Page 11: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/11.jpg)
DEMO
![Page 12: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/12.jpg)
Query DSL
• Rules
• Keysets
• Parts
![Page 13: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/13.jpg)
Rules
• Automatic (Daemon Mode), Manual
• Data Source (DDFS tags)
• Date range selectors
• Processors
• Transformations
![Page 14: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/14.jpg)
Keysets
• At least one per Rule
• Have Key and Value “Parts”
• Multiple M / R ops on the same data
![Page 15: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/15.jpg)
Parts
• Key Parts are what you want to “map”, Value Parts are the “reduce” values
• Example: Count all the clicks for an ad on a particular site:
• Keys: ad_id, site_id
• Values: count (magic function)
![Page 16: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/16.jpg)
Example
![Page 17: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/17.jpg)
Process & Transform
• Field Transforms
• Select & Generate (Chain-able)
• Post Processors
• Input Streams (Extends Disco)
![Page 18: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/18.jpg)
Archiving
• Update the same tag with new data
• Blobs are tagged and never reprocessed
• Tag dates are used intelligently
• Schedule data processing
![Page 19: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/19.jpg)
DEMO
![Page 20: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/20.jpg)
Dedication
• Jimmy Ellis, the lead singer of the hit “Disco Inferno” from ‘70s R&B/funk group The Trammps.
• Died March 2012 in Rock Hill, South Carolina. He was 74.
![Page 21: Inferno + Disco](https://reader031.fdocuments.in/reader031/viewer/2022021421/58a1ac721a28abeb428b9fa9/html5/thumbnails/21.jpg)
• Find us and ask questions
• http://bitbucket.org/chango/inferno
• http://inferno.rtfd.org/
• https://groups.google.com/group/python-inferno