Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

46
Roman Nikitchenko, 04.12.2014

Transcript of Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

Page 1: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

Roman Nikitchenko, 04.12.2014

Page 2: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

2www.vitech.com.ua

Any real big data is just about DIGITAL LIFE FOOTPRINT

Page 3: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

3www.vitech.com.ua

THE SAME IS ABOUT...

NOT ALL THINGS IN OUR LIFE ARE NICE

Page 4: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

4www.vitech.com.ua

BIG DATA is not about the

data. It is about OUR ABILITY TO HANDLE THEM.

Page 5: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

5www.vitech.com.ua

YARN

Page 6: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

6www.vitech.com.ua

Don't shoot your own foot with BIG GUN!

Some aspects are more special.

Most dangerous things in Big Data

Basics

Couple of specific notes

Beware!

Page 7: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

7www.vitech.com.ua

MOST SERIOUS BIG DATA failure IS ...

NO DATA

Page 8: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

8www.vitech.com.ua

NO DATA

NO MONEY

The biggest mistake in BIG DATA strategy is to limit amount of data you collect.

Page 9: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

9www.vitech.com.ua

WHERE ARE

YOU?

Page 10: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

10www.vitech.com.ua

DATA LAKETake as much data about your business processes as you can take. The more data you have the more value you could get from it.

Page 11: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

11www.vitech.com.ua

YOU ALWAYS HAVE OPTION

● We have developed our own online storage which lowers maintenance and stores anything.

Page 12: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

12www.vitech.com.ua

Most serious errors in Big Data are about operations and infrastructure. Not about algorithms, or code.

LIVE WITH IT

Page 13: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

13www.vitech.com.ua

YOU ALWAYS HAVE OPTION

● We have special engineering roadmap for big data infrastructure development.

Page 14: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

14www.vitech.com.ua

Why hadoop?

x MAX+

=

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

Use robust solutions

Page 15: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

15www.vitech.com.ua

What is HADOOP?

● Hadoop is open source framework for big data. Both distributed storage and processing.

● Hadoop is reliable and fault tolerant with no rely on hardware for these properties.

● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes.

Page 16: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

16www.vitech.com.ua

Hadoop: don't do it yourself

Page 17: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

17www.vitech.com.ua

● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet. Some people LOVE them.

Cloudera is stable enough but not stale. Hadoop 2.5 with YARN, HBase 0.98.x, Spark 1.x. Balance as for late 2014.

● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority.

Option? Our experience is:

Page 18: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

18www.vitech.com.ua

HBase motivation

● Designed for throughput, not for latency.

● HDFS blocks are expected to be large. There is issue with lot of small files.

● Write once, read many times ideology.

● MapReduce is not so flexible so any database built on top of it.

● How about realtime?

Hadoop is...

Page 19: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

19www.vitech.com.ua

● 64G RAM is considered pretty small amount. 128G is more and more often configuration.

● 2xCPU with 6 cores each is considered commodity.

● 4xHDD is a minimum. SSD are used more and more often.

Uses commodity hardware...

'Commodity' word understanding is growing

Page 20: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

20www.vitech.com.ua

Virtualization

NOTSO

REAL ELEPHANT

VIRTUALIZATION

Page 21: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

21www.vitech.com.ua

CONCERNS● Is possible for key nodes. Not for

workers unless you are really big.

● Several nodes on single physical host: what happens if this host fail?

● Loaded services on VM: is it meaningful? Double duties?

Page 22: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

22www.vitech.com.ua

Virtualization: practical case

● Apache ZooKeeper is QUORUM based service.

● If host with 2 ZK fails, Everything fail which breaks tolerancy to 1 failure.

● Can you garantee equal performance for ZK service instances?

● DON'T PUT QUORUM SERVICES IN VIRTUAL ENVIRONMENT!

HOST

HOST

REAL EXAMPLE

Page 23: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

23www.vitech.com.ua

YOU ALWAYS HAVE OPTION

● Indeed there is lot of options with virtualization. The only concern is about ability to use your own brains.

Page 24: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

24www.vitech.com.ua

HBase motivationNeed online storage for big data?

LATENCY, SPEED and all Hadoop properties.

Page 25: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

25www.vitech.com.ua

NO ANY SECONDARY

INDEXES OUT OF THE BOX.

Page 26: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

26www.vitech.com.ua

YOU ALWAYS HAVE LOT OF OPTIONS

● We have buit our search indexing technology.

Page 27: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

27www.vitech.com.ua

● SOLR indexes documents. What is stored into SOLR index is not what you index. SOLR is NOT A STORAGE, ONLY INDEX

● But it can index ANYTHING. Search result is document ID

INDEX UPDATE

Search responses

INDEX QUERY

Index update request is analyzed, tokenized,

transformed... and the same is for queries.

INDEX ALTERNATIVE: SOLR

Page 28: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

28www.vitech.com.ua

● HBase handles user data change online requests.

● NGData Lily indexer handles stream of changes and transforms them into SOLR index change requests.

● Indexes are built on SOLR so HBase data are searchable.

Page 29: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

29www.vitech.com.ua

HDFS

HBase: Data and search integration

HBase regions

Data update

Client

User just puts (or deletes) data.

Search responses

Lily HBase NRT indexer

Replication can be set up to column

family level.

REPLICATIONHBasecluster

Translates data changes into SOLR

index updates.

SOLR cloudSearch requests (HTTP)

Apache Zookeeper does all coordination

Finally provides search

Serves low level file system.

Page 30: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

30www.vitech.com.ua

ETL

LOADYOURDATA

WITH CARE

ETL

Page 31: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

31www.vitech.com.ua

ENTERPRISE DATA HUBDon't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution.

Page 32: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

32www.vitech.com.ua

ETL & BD: main stages

SQLserver

Table1

Table2

Table3

Table4 BIG DATA shard

BIG DATA shard

BIG DATA shard

Transform

● SQL solution are usually not so distributed as Big Data one. How to partition your data?

● Big data storages are mostly non-relational. You are to map table relations into objects. Where to put this complexity?

JOIN Partition

EXTRACT TRANSFORM LOAD

Page 33: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

33www.vitech.com.ua

ETL & BD: complexity on SQL

SQLserver

JOIN

Table1

Table2

Table3

Table4 BIG DATA shard

BIG DATA shard

BIG DATA shard

ETL stream

● It's hard to transform SQL relationship into NoSQL objects: complex joins.

● Simple stream on big data, lowered network traffic. HUGE load on SQL.

● What if you have several SQL servers and you need 2 times faster import?

SQL

dies

on

this

Page 34: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

34www.vitech.com.ua

ETL & BD: complexity on BD side

SQLserver

JOIN

Table1

Table2

Table3

Table4 BIG DATA shard

BIG DATA shard

BIG DATA shardETL stream

● Simple streaming from SQL. Things like joins on Big Data side.

● Even if you have 100 SQL servers, you have to scale single cluster.

● Network load is more intensive.

Muc

h m

ore

scal

able

ETL stream

ETL stream

ETL stream

Page 35: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

35www.vitech.com.ua

● YARN forms resource management layer and completes real distributed data OS so heterogeneous clusters and multi-tenancy are real things.

● New distributed processing approaches: MapReduce is from now only one among other YARN appliactions.

YARN: future of Hadoop

Page 36: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

36www.vitech.com.ua

First ever worldDATA OS

10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance.

Page 37: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

37www.vitech.com.ua

This is how retail agents often work.

YARN

Page 38: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

38www.vitech.com.ua

This is how it often works.

YARNWhat can be reality

CPU

CPU CPU CPU

YARN presents

CPU CPU CPU CPU

it's about reservation. Indeed you could have no resource because of service not aware of YARN.

Page 39: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

39www.vitech.com.ua

YOU ALWAYS HAVE OPTION

Page 40: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

40www.vitech.com.ua

Apache Spark

● Better MapReduce with at least some MapReduce elements able to be reused.

● New job models. Not only Map and Reduce.

● Scala and Python API in addition to Java. Functional model support.

● Results can be passed through memory including final one.

Page 41: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

41www.vitech.com.ua

● Works much better if knows about size of job to do. Streaming is just sequence of small jobs.

● Requires proper YARN tuning to use resources properly. No dynamic allocation of executors.

● Persistance: int limitation with 2G. HUGE amount of memory as for today.

● You cannot partition data 'on the fly'. Should guess right way.

Page 42: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

42www.vitech.com.ua

● Dynamic, faster to startup, resources reusage.

● Unified management infrastructure such as logging.

+

Your cluster is ready for next tasksMap-reduce Spark

YARN

Page 43: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

43www.vitech.com.ua

It is simply too good to wait...

Page 44: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

44www.vitech.com.ua

TRUST ME ;-)

Page 45: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

45www.vitech.com.ua

Share your knowledge!

DO NOTHIDE YOUREXPERIENCE

Page 46: Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

46www.vitech.com.ua

Questions and discussion