Big Data, Hadoop, NoSQL and more ...

35
© Orzota, Inc. 2013

description

I gave a series of Seminars at the following colleges in Solapur. 1. Walchand Institute of Technology, Solapur. 2. Brahmdevdada Mane Institute of Technology, Solapur. 3. Orchid College of Engineering & Technology, Solapur. 4. SVERI's College of Engineering, Pandharpur. It focussed on what 'BigData' is and how the next generation of professionals should be ready the BigData revolution

Transcript of Big Data, Hadoop, NoSQL and more ...

Page 1: Big Data, Hadoop, NoSQL and more ...

© Orzota, Inc. 2013

Page 2: Big Data, Hadoop, NoSQL and more ...

Big Data, Hadoop, NoSQL and more …

Varad MeruSoftware Development Engineer, Orzota, Inc.

[email protected]/in/vmeru

@vrdmr

© Orzota, Inc. 2013 2

Page 3: Big Data, Hadoop, NoSQL and more ...

Mission: Make big data easy for consumption Offers Big Data/Hadoop Solutions and Software

Services to companies Develops Software to help companies consume Big

Data Founded in March 2012 Headquartered in Silicon Valley, California Offshore offices in Chennai, India

About Orzota

© Orzota, Inc. 2013 3

Page 4: Big Data, Hadoop, NoSQL and more ...

We work ono Big Datao Hadoopo Cloud Technologieso Data Scienceo Products and Serviceso Everything that it takes to be a valued Player.

About Orzota (contd.)

© Orzota, Inc. 2013 4

Page 5: Big Data, Hadoop, NoSQL and more ...

Community Development Occasional seminars by Architects, Engineers,

Managers. We invite professionals and aspiring professionals to

join Big Data / Hadoop communities in their geographies.

Pune Hadoop User Group – Participant + Organizer. Chennai Hadoop User Group – Participant + Sponsor.

About Orzota (contd.)

© Orzota, Inc. 2013 5

Page 6: Big Data, Hadoop, NoSQL and more ...

6

About Me• Orzota, Inc.

• Currently working with Hadoop, Mahout, Cloud, etc.

• Past Work Experience

• Persistent Systems – Search, Recommendation Engines and User Behavior Analytics.

• Area of Interest

• Data Science, Information Retrieval

• Distributed Systems

© Orzota, Inc. 2013

Page 7: Big Data, Hadoop, NoSQL and more ...

7© Orzota, Inc. 2013

Some of the Innovation Centers in

Technological World

Page 8: Big Data, Hadoop, NoSQL and more ...

Agenda

• Introduction to BigData• Technologies and Domain

• Hadoop EcoSystem• Introduction to MapReduce

• Architecture – HDFS + MapReduce.

• NoSQL Databases• CAP Theorem

• Different NoSQL Databases

• Other Trends© Orzota, Inc. 2013 8

Page 9: Big Data, Hadoop, NoSQL and more ...

© Orzota, Inc. 2013

Big Data

9

Page 10: Big Data, Hadoop, NoSQL and more ...

10

• What is Big Data?

• What does it mean to me?

• Why so much fuss in the industry?

• Who uses these technologies?

• How are they used in the Industry and Academia?

• When to start using them?

• How to learn them?

Big Data

© Orzota, Inc. 2013

Page 11: Big Data, Hadoop, NoSQL and more ...

11

• Volume - Amassing terabytes—even petabytes—of information.

• 12 terabytes of Tweets created each day.

• 350 billion annual meter readings.

• Velocity - Sometimes 2 minutes is too late.

• Scrutinize 5 million trade events.

• 500 million daily call detail records

• Variety - Big data is any type of data.• 80% data growth in images, video and documents.

“Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”

– Laney Douglas. "The Importance of 'Big Data': A Definition"

Big Data – 3 Vs

© Orzota, Inc. 2013

Page 12: Big Data, Hadoop, NoSQL and more ...

12

Problem

• Store and Process Data for - • Search Engines,

• Recommendations Engines,

• Fraud Detection,

• Aadhar (Govt. of India),

• Spam Detection, etc.

• Also, in some cases Real-time (e.g. Facebook)

© Orzota, Inc. 2013

Page 13: Big Data, Hadoop, NoSQL and more ...

13

Solutions ?

• Classical Solutions• Database + Programming Language (Java-Oracle, C#-

SQL Server)

• Data Warehouses – Teradata, Netezza, Microsoft PDW

• Legacy Network Systems• Novel• CORBA• Java RMI – RPC

© Orzota, Inc. 2013

Page 14: Big Data, Hadoop, NoSQL and more ...

14

Problems of the Solutions

• Problems with Classical Solutions• CAP Theorem, by Prof. Eric Brewer (Berkeley) –

• Choose any 2 between

Consistency, Availability and Partitioning

• ACID Properties• For Small number of Transactions, cumulative overhead still

manageable.• For Very large number of Transactions – Facebook Posts?

• Very High Licensing Fees.

• Closed Source – Stick with the Company’s Eco-System.© Orzota, Inc. 2013

Page 15: Big Data, Hadoop, NoSQL and more ...

15

Solution to the Problems of the Solutions

• Focus on Problem Domain• What’s more important for your Solution?

• Consistency, Availability, and Partitioning

• Which Industry/Company already face similar Problems?

• How/Where to Collect Data?

• Technology Fields – Internet Companies• Hadoop, NoSQL Datastores

• Open Source, Free and with Friendly Licenses.

© Orzota, Inc. 2013

Page 16: Big Data, Hadoop, NoSQL and more ...

© Orzota, Inc. 2013

Hadoop Eco-System

16

Page 17: Big Data, Hadoop, NoSQL and more ...

17

Introduction

• Started by Doug Cutting and Mike Caferella for Nutch – Open Search Engine.

• Further Developed at Yahoo!, Facebook and contributed by people from many companies.

• Named after a Little Toy Elephant owned by Doug’s Son.

• Inspired by 2 research papers from Google

• The Google File System – 2003

• MapReduce – 2004

© Orzota, Inc. 2013

Page 18: Big Data, Hadoop, NoSQL and more ...

18

Introduction (contd.)

• Contains 3 modules• Distributed File System

• MapReduce

• Commons (A Java library containing common functions used by both DFS and MapReduce)

• Apache Top Level Project• Hadoop’s Website – hadoop.apache.org

• Two Parallel Release Cycles – 1.x and 2.x

© Orzota, Inc. 2013

Page 19: Big Data, Hadoop, NoSQL and more ...

19

• A Rich Eco-System built around Hadoop• Hive – Large Scale Data Warehouse

• Hbase – NoSQL Database

• Pig – A Data-flow language on top of Hadoop

• Flume – Log Management for Hadoop

• Oozie –Workflow framework

• Mahout – Machine Learning Library on top of Hadoop

• Vaidya – Performance benchmarking framework.

• MRUnit – Unit testing framework for MapReduce Programs.

• And many more …

© Orzota, Inc. 2013

Introduction (contd.)

Page 20: Big Data, Hadoop, NoSQL and more ...

20

MapReduce in 2 minutes –

Problem Statement – Sum of Double of set of Numbers.

The intermediate array after Processing

MapReduce

1 3 4 5 6 8 9 11 17 21 1

3

4

5

6

8

9

11

17

21

2

6

8

10

12

16

18

22

34

42© Orzota, Inc. 2013

Page 21: Big Data, Hadoop, NoSQL and more ...

21

Introduction – contd.

Mapping Phase

• Splitting the input

• Sending slaves(datanodes) the mapping code - f(x).

• Apply the f(x) method on the data split 1

1

98

6

11

43

17

21

The Master Node

This node contains the code of the function to be applied on individual entries of Array

Written in the map() method in Hadoop.

Mapping Phase

Code f(x) being sent to the slave node for applying the logic on the data piece. In our case the data piece is an entry from the Array.

Slave Nodes

© Orzota, Inc. 2013

Page 22: Big Data, Hadoop, NoSQL and more ...

22

Introduction – contd.

Spill Phase

• Masternode directs the Mappers to send the processed f(x) output data to intermediate location.

• Shuffle and Sorting2

2

1816

12

22

86

34

42

The Master Node.

The Results of the Processed Data (from the slave nodes is given to s specific node where reducer function runs)

Spill Phase :- Shuffle and Sort

Slave Nodes

© Orzota, Inc. 2013

Page 23: Big Data, Hadoop, NoSQL and more ...

23

Introduction – contd.

Reduce Phase

• MasterNode (JobTracker) to invokes the Reduce task once the spilling is over.

• Get location of the Spill output from MasterNode (Namenode).

g(x)=162The Master

Node.The Results of the Processed Data (from the slave nodes is given to s specific node where reducer function runs)

Reducer Phase

Slave Nodes

© Orzota, Inc. 2013

Page 24: Big Data, Hadoop, NoSQL and more ...

24

Steps involved in writing a MapReduce program

• Write the Mapper

• Write the Reducer

• Write the Driver

Life’s Simple until you start customizing and work on Data Cleansing

MapReduce Programming

© Orzota, Inc. 2013

Page 25: Big Data, Hadoop, NoSQL and more ...

25

Hadoop – Bird’s Eye View

© Orzota, Inc. 2013

DN TT

DN TT

DN TTDN TTDN TT

DN TT

DN TTDN TT …

… …

Name Node

Job Tracker

DFS Message Path

MapReduce Processing Msg

Page 26: Big Data, Hadoop, NoSQL and more ...

© Orzota, Inc. 2013

NoSQL – Not Only SQL

26

Page 27: Big Data, Hadoop, NoSQL and more ...

27

Non-Relational Databases

• Data Model not bound by a Schema.

• No Predetermined Schema, Run-Time Columns

• Sample Data• Twitter Streams

• Web Forms

• Sensor Networks

Introduction

© Orzota, Inc. 2013

Page 28: Big Data, Hadoop, NoSQL and more ...

28

Schema-less Systems

Entry 1

{“name”:“emp1”}

Entry 2

{“name”:“emp2”,“e_id”:“1”,“e_addr”:“Cupertino”}

Entry 3

{“name”:“emp3”,“e_id”:“3”}

Entry 4

{“name”:“emp4”,“e_id”:“6”, “dob”:“03-Sep-1964”}

© Orzota, Inc. 2013

Page 29: Big Data, Hadoop, NoSQL and more ...

Business Requirements

• High Writes, Low Reads – Sensor Networks, Large Hadron Collider, Click Logging.

• High Reads, Low Writes – Archival Storage.

• Don’t have any fixed Schema.

Open Question - Where Else?

29© Orzota, Inc. 2013

Page 30: Big Data, Hadoop, NoSQL and more ...

30

NoSQL Types

• Key-Value Pair • Riak, Voldemort, etc.

• Document Oriented • CouchDB, MongoDB, etc.

• BigTable Implementations • Cassandra, HyperTable, Hbase, etc.

• Graph oriented • Neo4j, etc.

© Orzota, Inc. 2013

Page 31: Big Data, Hadoop, NoSQL and more ...

31

Introduction

© Orzota, Inc. 2013

Source: http://techcrunch.com/2012/10/27/big-data-right-now-five-trendy-open-source-technologies/© Orzota, Inc. 2013

Page 32: Big Data, Hadoop, NoSQL and more ...

Wake up - Conclusion Time

• BigData on the Rise

• Technology and the Domain

• Smart Engineers needed, with BigData skills

• Chance to develop niche areas of Expertise even before stepping into the Industry

• 3rd Year Students – Select your final year projects very carefully, with the tools mentioned in this Seminar

• 4th Year Students – Equip your self with the necessary skills for better industry opportunities.

© Orzota, Inc. 2013

Page 33: Big Data, Hadoop, NoSQL and more ...

33

Recommendations

• I recommend aspiring professionals and young professionals read:

• How to Solve it by Computer – RG Dromey

• Code Complete 2 – Steve McConnell

• Advanced Programming in the Unix Environment – Richard Stevens

• Many Books on Hadoop, NoSQL Datastores, and Big Data in general.

© Orzota, Inc. 2013

… and many more

Page 34: Big Data, Hadoop, NoSQL and more ...

© Orzota, Inc. 2013

Questions ?

34

Page 35: Big Data, Hadoop, NoSQL and more ...

35

Contact Us at –

Thank You

Linkedin.com/company/orzota-inc- Twitter.com/orzota

© Orzota, Inc. 2013