Open Sourcing Big Data - sfd.fsug.ir · About Me Hadi Sotudeh - Information Technology...

Post on 21-Jul-2020

3 views 0 download

Transcript of Open Sourcing Big Data - sfd.fsug.ir · About Me Hadi Sotudeh - Information Technology...

Open Sourcing Big Data

Hadi

Sotudeh

About Me

Hadi Sotudeh - Information Technology

hsotudeh@ce.sharif.edu Ce.sharif.edu/~hsotudeh

About Us

Dr. Sharif

Big Data: From a Business & Managerial Perspective

Bigdata.blog.ir

About Us

Torob.ir Co-Founder : Ali Babei

About Us

B.S Project : (DRPC)Distributed Real Time Processing Crawler using Apache Storm

Dr. Goudarzi

Everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.

Dan Ariely

News

Big Data Definition

Is there any standard definition?

Big Data Definitions Gartner Mckinsey ….

Gartner Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

Mckinsey datasets whose size is beyond the ability of

typical database software tools to capture,

store, manage, and analyze

• Sensors

• Transactions

• GPS

• Email

• Social Network

• Sound Files

• Video

• Image

• Telescope

• Log

• Tex

• ....

Data Sources

Tim Berners Lee

Open Data Movement

Open Data:

19

State/Org Website

UAE http://government.ae/web/guest/uae-data

UK http://data.gov.uk

US http://data.gov

World Bank http://data.worldbank.org/

India http://data.gov.in

Russia http://opengovdata.ru

EU Open-data.Europa.eu/en/data

• Google.com/trends/explore

• Google.com/finance

20

Close Data!

23

شبکه های اجتماعی

24

A Tweet

Edward Snowden

NSA

Log or Dark Data

34

35

Importance

Analytics is the discovery and communication ofmeaningful patterns in data

Analytics

Types of Analytics Cube Analytics Multi Dimensional Product Date Price

BI Predictive Analytics Statistics and Machine Learning Linear Regression Data Clustering Find Association

Dimensions of Analytics Variants

Real Time Ability to Analyze the data instantly

Batch Ability to provide insights after several

hours/days when a query is posted

TOOLS

Do It

Real Time

Problems

Scaling is painfulPoor fault-tolerance

Coding is tedious

What We Want

Guaranteed Data ProcessingHorizontal scalabilityFault-tolerance“just works”

What Is The Key?

Hadoop

Batch Oriented System

Storm

Guaranteed Data ProcessingHorizontal scalabilityFault-tolerance“just works”

Use cases

Streams

Spouts

Bolts

Topology

Word Count

Tuple Tree

Resources

Book Apache Storm website

Conclusion

• Data, Data, and Data

• Data Gathering

• Analytics

• Visualization

• Action

• Bottleneck is Creativity not Technology

• Discover Use Cases