Our experience with NoSQL and MapReduce technologies
description
Transcript of Our experience with NoSQL and MapReduce technologies
Grid Technology
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
DBCFCFGT
Our experience with NoSQL and MapReduce technologies
Fabio Souto
IT Monitoring Working Group, 19th September 2011
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Outline
• Objective
• Big data technologies
• Technologies reviewed
• Deployed infrastructure
• Current status
• Lessons learned
2
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Problem and goal
• The SAM infrastructure for WLCG– monitors 400 sites and ~2,000 services daily
– receives and stores ~600,000 metric results daily
– computes statuses and hourly availabilities for services and sites
• SWAT is a system to gather information about the configuration of WNs
• Massive data generation, making storage, search, sharing, analytics and visualizing difficult
• Objective: proof of concept using big data technologies
3
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Big Data Technologies
•NoSQL databases– Not relational. Schema free.– Distributed – High availability
•MapReduce– Framework for processing huge datasets on clusters of
computers– Takes advantage of data locality:
• Move computation is more efficient than moving data
4
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Technologies reviewed
• NoSQL databases~140 different solutions, we focused on:
–MongoDB• No durability(at the moment of study)
–Cassandra• No single point of failure• Big and responsive community
• Apache Hadoop–Big data de facto standard
–Framework for data intensive applications
–To write MapReduce jobs for Cassandra
5
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Technologies reviewed II
• Hive and Pig– ease the complexity of writing MapReduce– Initially not considered
• Less efficient than pure Hadoop
– Independent from the data source• We can change to HBase easily
– Hive: SQL-like syntax– Pig: data flow language
• Is not turing complete (no loops, if-else…)– But can be embebed into python code– It’s possible to write custom functions in python/java
6
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Technologies reviewed III
• Hue– Set of Django apps to interact with Hadoop
• OpenTSDB– Open source time series database– Lack of flexibility
• Oozie– Job scheduler and workflow engine for Hadoop
7
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Other Tools
• Msg-consume2db inserter:– WLCG Messaging infrastructure -> NoSQL
• sql2nosql-sync – SAM Oracle DB -> NoSQL
8
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Actual infrastructure
Deployed infrastructure
9
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Actual infrastructure
10
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Current status
11
• SAM– DONE: running infrastructure reading messaging
and SAM data and launch pig jobs to calculate availability.
– TODO:• Results tuning• Web interface to visualize the results• JSON/XML API to extract results• Unit testing
• SWAT– Early stage of development (~6 days)– Data collection
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Lessons learned
• Use abstraction layer on top of Hadoop– Write pure MapReduce Hadoop apps is difficult
and time-consuming
• Choose a solution with a responsive community:– Technology in early state(unresolved bugs,
undocumented functions), you will need to get in touch with developers/users
• Big data needs big platform
12
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Lessons learned
• Must keep up to date. New companies, technologies and tools are emerging– Twitter real time hadoop about to be released– Cascalog, hadoop data mining language– Bigdata distributions: Cloudera, Datastax, Mapr…
13
Grid Technology
Questions?
14