Open problems big_data_19_feb_2015_ver_0.1

20
1 Open Problems in Big Data Analytics: A Practitioner’s View Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus Invited Talk, National Conference on Distributed Machine Learning, Feb 2015

Transcript of Open problems big_data_19_feb_2015_ver_0.1

Page 1: Open problems big_data_19_feb_2015_ver_0.1

1

Open Problems in Big Data

Analytics: A Practitioner’s View

Dr. Vijay Srinivas Agneeswaran,

Director and Head, Big-data R&D,

Innovation Labs, Impetus

Invited Talk, National Conference on Distributed

Machine Learning, Feb 2015

Page 2: Open problems big_data_19_feb_2015_ver_0.1

Contents

2

State-of-art in Big Data Analytics

Big Data Computations: Characterization

Big Data pipelines: open problems

Page 3: Open problems big_data_19_feb_2015_ver_0.1

• Start from business questions

• How quickly and accurately can we get

answers?

• Data gets stored in HDFS

• Various frameworks to process data

• Spark – machine learning

• Giraph/GraphLab – graph processing

• Storm – real-time processing

State of Art in Big Data Analytics

3

Page 4: Open problems big_data_19_feb_2015_ver_0.1

• HDFS the right storage?

• Alternatives

• Cassandra, MapR – M7, QFS,

Cleversafe, Isilion, etc.

http://www.inktank.com/news-events/new/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

State of Art in Big Data Analytics

4

Page 5: Open problems big_data_19_feb_2015_ver_0.1

5

State of Art in Big Data Analytics

• Spark the right platform for processing?

• Alternatives

• Flink

• Forge – meta domain specific

language

Page 6: Open problems big_data_19_feb_2015_ver_0.1

6

State of Art in Big Data Analytics

• Spark Streaming/Storm the right platform

for stream processing?

Page 7: Open problems big_data_19_feb_2015_ver_0.1

7

Big Data ComputationsC

om

puta

tions/O

pera

tio

ns

Giant 1 (simple stats) is perfect for Hadoop 1.0.

Giants 2 (linear algebra), 3 (N-body), 4 (optimization) Spark from UC Berkeley is efficient?

Logistic regression, kernel SVMs, conjugate gradient descent, collaborative filtering, Gibbs

sampling, alternating least squares.

Example is social group-first approach for consumer churn

analysis [2]

Interactive/On-the-fly data processing – Storm.

OLAP – data cube operations. Dremel/Drill

Data sets – not embarrassingly parallel?

Deep LearningArtificial Neural Networks/Deep

Belief Networks

Machine vision from Google [3]

Speech analysis from Microsoft

Giant 5 – Graph processing –GraphLab, Pregel, Giraph

[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.

[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social

Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741

[3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio

Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012:

1232-1240

Page 8: Open problems big_data_19_feb_2015_ver_0.1

8

Big Data Pipelines

1. Nuance – incompleteness

2. Scale

3. Timeliness

4. Privacy

5. Human Loop

Page 9: Open problems big_data_19_feb_2015_ver_0.1

9

Big Data Pipelines: Data Acquisition

• Needle in a Haystack.

• Blink DB?

• Automatic metadata discovery

Page 10: Open problems big_data_19_feb_2015_ver_0.1

10

Big Data Pipelines: Information

Extraction

• Error models for data cleaning

• Multimedia data

Page 11: Open problems big_data_19_feb_2015_ver_0.1

11

Big Data Pipelines: Analytics

• Multi-dimensional data

Page 12: Open problems big_data_19_feb_2015_ver_0.1

The network to identify the individual digits

from the input image

http://neuralnetworksanddeeplearning.com/chap1.html

Copyright @Impetus Technologies, 2014

Page 13: Open problems big_data_19_feb_2015_ver_0.1

DLNs for Face Recognition

Copyright @Impetus Technologies, 2014

Page 14: Open problems big_data_19_feb_2015_ver_0.1

Copyright @Impetus Technologies, 2015

DLN for Face Recognition

http://www.slideshare.net/hammawan/deep-neural-networks

Page 15: Open problems big_data_19_feb_2015_ver_0.1

Copyright @Impetus Technologies,

2014

Success stories of DLNsAndroid voice

recognition system –

based on DLNs

Improves accuracy by

25% compared to state-

of-art

Microsoft Skype Translate software

and Digital assistant Cortana

1.2 million images, 1000

classes (ImageNet Data)

– error rate of 15.3%,

better than state of art at

26.1%

Page 16: Open problems big_data_19_feb_2015_ver_0.1

Copyright @Impetus Technologies, 2015

Success stories of DLNs…..

Senna system – PoS tagging, chunking, NER,

semantic role labeling, syntactic parsing

Comparable F1 score with state-of-art with huge speed

advantage (5 days VS few hours).

DLNs VS TF-IDF: 1 million

documents, relevance search.

3.2ms VS 1.2s.

Robot navigation

Page 17: Open problems big_data_19_feb_2015_ver_0.1
Page 18: Open problems big_data_19_feb_2015_ver_0.1

18

• Hadoop = HDFS + Map-Reduce

• Useful for large scale embarrassingly

parallel processing of data sets

• Not so good for iterative, interactive

computing.

• Beyond Hadoop Map-Reduce philosophy

• Optimization and other problems.

• Real-time computation

• Processing specialized data structures

Conclusions

Page 19: Open problems big_data_19_feb_2015_ver_0.1

Thank You!

Mail • [email protected]

LinkedIn • http://in.linkedin.com/in/vijaysrinivasagneeswaran

Blogs • blogs.impetus.com

Twitter • @a_vijaysrinivas.

Page 20: Open problems big_data_19_feb_2015_ver_0.1

• Divyakant Agarwal et. al., Challenges and

Opportunities with Big Data, Computing

Research Association White Paper,

available from

http://www.cra.org/ccc/files/docs/init/bigdat

awhitepaper.pdf.

• Vijay Srinivas Agneeswaran et. al.,

Distributed Deep Learning over Spark,

available at:

http://www.datasciencecentral.com/profiles/

blogs/implementing-a-distributed-deep-

learning-network-over-spark

References

20