Intro to Big Data Hadoop

26
Dr. Sandeep G. Deshmukh Introducti on to

Transcript of Intro to Big Data Hadoop

Page 1: Intro to Big Data Hadoop

Dr. Sandeep G. Deshmukh

Introduction to

Page 2: Intro to Big Data Hadoop

Contents

❑ Big Data ❑ Distributed Systems❑ Hadoop

➢ Hadoop Distributed File System (HDFS)➢ MapReduce

2

Page 3: Intro to Big Data Hadoop

Show of Hands

Page 4: Intro to Big Data Hadoop

Introduction to Big Data

Page 5: Intro to Big Data Hadoop

Big data is data that exceeds the processing capacity of

conventional database systems.

The data is too big, moves too fast, or doesn’t fit the strictures of

your database architectures.

To gain value from this data, you must choose an alternative way

to process it.

https://www.oreilly.com/ideas/what-is-big-data

Definition

Page 6: Intro to Big Data Hadoop

Quantity of data

Data sets too large to store and analyze using traditional databases

Volume

Page 7: Intro to Big Data Hadoop

Velocity

Speed at which data is generated

Speed at which data is moving around and analyzed

Analyze data while it is being generated without even putting it into databases

Page 8: Intro to Big Data Hadoop

Variety

Different types of data that we can use

Page 9: Intro to Big Data Hadoop

Veracity

Messiness or trustworthiness of the data

Volume makes up for quality

Eg. Tweets with spelling mistakes, short words ( u -> you, thr-> there)

Page 10: Intro to Big Data Hadoop

Value

Getting value out of Big Data!!!

Page 11: Intro to Big Data Hadoop

Definition

“Big data” is

high-volume, -velocity and -variety information assets

that demand cost-effective, innovative forms of information processing

for enhanced insight and decision making

By Gartner

Page 12: Intro to Big Data Hadoop

Definition

Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate

Challenges include analysis, capture, data curation, search,sharing, storage, transfer, visualization, querying, updating and information privacy.

The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set.

Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.

Wikipedia

Page 13: Intro to Big Data Hadoop

Use Case: Big Data in Oil & Gas Drilling

http://analytics-magazine.org/images/stories/novdec12/big-data.jpg

Page 14: Intro to Big Data Hadoop

Use Case: Uber - Pay Surge Pricing if Battery is Low

Page 16: Intro to Big Data Hadoop

Distributed Systems

Page 17: Intro to Big Data Hadoop

A distributed system is a collection of independent computers that appears to its users as a single coherent system.

Distributed Systems: Principles and Paradigms, 2nd Edition, Andrew S. Tanenbaum, Maarten Van Steen, 2006

http://www.mypearsonstore.com/bookstore/distributed-systems-principles-and-paradigms-9780132392273?xid=PSED

Definition

Page 18: Intro to Big Data Hadoop

Distributed Systems: Principles and Paradigms, 2nd Edition, Andrew S. Tanenbaum, Maarten Van Steen, 2006

Page 19: Intro to Big Data Hadoop

Transparency Description

Access Hide differences in data representation and how a resource is accessed

Location Hide where a resource is located

Migration Hide that a resource may move to another location

Relocation Hide that a resource may be moved to another location while in use

Replication Hide that a resource is replicated

Concurrency Hide that a resource may be shared by several competitive users

Failure Hide the failure and recovery of a resource

Forms of Transparency in Distributed Systems

Page 20: Intro to Big Data Hadoop

● A distributed system consists of components (i.e., computers) that are autonomous

● Users (be they people or programs) think they are dealing with a single system. This means that one way or the other the autonomous components need to collaborate. How to establish this collaboration lies at the heart of developing distributed systems.

Page 21: Intro to Big Data Hadoop

A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages.

The components interact with each other in order to achieve a common goal.

Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components.

Wikipedia

https://www.oreilly.com/ideas/what-is-big-data

Definition

Page 22: Intro to Big Data Hadoop

● Distributed Computing - Wikipedia

● Distributed computing

● Characteristics of distributed system

Further Reading

Page 23: Intro to Big Data Hadoop

Miscellaneous Concepts

Page 24: Intro to Big Data Hadoop

Big Data Primers: Size does matter

Page 25: Intro to Big Data Hadoop

Big Data Primers: Vertical Vs Horizontal Scaling

Vertical Scaling Horizontal Scaling

Page 26: Intro to Big Data Hadoop

Big Data Primers: The scale of infrastructure