MetaScale is a subsidiary of Sears Holdings Corporation

MetaScale is a subsidiary of Sears Holdings Corporation The 3 Ts of Hadoop Wuheng Luo Ankur Gupta 06.2013


The 3 Ts of Hadoop. Wuheng Luo Ankur Gupta 06.2013. MetaScale is a subsidiary of Sears Holdings Corporation. The 3 Ts of Hadoop. 3-Stage Circular Process of Enterprise Big Data. What is the 3Ts?. 3Ts = Transfer , Transform , and Translate A new enterprise big data pattern - PowerPoint PPT Presentation

Transcript of MetaScale is a subsidiary of Sears Holdings Corporation

Page 1: MetaScale is a subsidiary of Sears Holdings Corporation

MetaScale is a subsidiary of Sears Holdings Corporation

The 3 Ts of Hadoop

Wuheng LuoAnkur Gupta06.2013

Page 2: MetaScale is a subsidiary of Sears Holdings Corporation

The 3 Ts of Hadoop

3-Stage Circular Process of Enterprise Big Data

Page 3: MetaScale is a subsidiary of Sears Holdings Corporation

What is the 3Ts?

3Ts = Transfer, Transform, and Translate

A new enterprise big data pattern to bring disruptive change to conventional ETL To leverage Hadoop for streamlining data processes To move toward real-time analytics

Page 4: MetaScale is a subsidiary of Sears Holdings Corporation

The 3Ts Goal

To simplify enterprise data processing, reduce latency to turn enterprise data from raw form to products of discovery so as to better support business decisions.

Page 5: MetaScale is a subsidiary of Sears Holdings Corporation

The 3Ts One Liners

TransferOnce the Hadoop system is in place, a mandate is needed to immediately and continuously capture and deliver all enterprise data, from all data sources, through all data systems, to Hadoop, and store the data under HDFS.

TransformWhen source data is in, clean, standardize, and convert the data through dimensional modeling. Data transformation should be performed in-place within Hadoop, without moving the data out again for integration reasons.

TranslateFinish the data flow cycle by turning analytical data aggregated in Hadoop to data products of business wisdom. Use batch and streaming tools built on top of Hadoop to Interact with data scientists and end users.

Page 6: MetaScale is a subsidiary of Sears Holdings Corporation

Hadoop as Enterprise Data Hub

“Data Hub” is not a new concept, but:

Conventional Data Hub Hadoop Enterprise Data Hub

RDBMS or EDW based Hadoop ecosystem based

No consistent architectural style:ODS, MDM, messaging or publish-subscribe, etc.

3-phased architecture to cover full enterprise data flow cycle from data source to data products

Heavily reply on ETL 3Ts-driven

Intermediate, partial, siloed True center of enterprise data

… …

Page 7: MetaScale is a subsidiary of Sears Holdings Corporation


Sourcing Data into Hadoop


Capture continuously all enterprise data at earliest touch points possible, deliver the data from all sources, through all source data systems, to Hadoop, and store the data under HDFS.

Page 8: MetaScale is a subsidiary of Sears Holdings Corporation



To gain distinctive competing capability, enterprises need to build an integrated data infrastructure as the foundation for big data analytics. Use Hadoop as THE centralized enterprise data repository, and make it the grand destination for all enterprise source data.

Page 9: MetaScale is a subsidiary of Sears Holdings Corporation


(3 Ts’) Transfer vs. (ETL’s) Extract

Traditional ETL - Extract Hadoop - Transfer

Bottom-up Top-down

Task/project specific Enterprise-wide mandate

Passive Proactive

Data is not available when needed Data is ready when needed

Same datasets are moved around again and again, with no value added

Move the data once, and use it many times, each time with value increased

Page 10: MetaScale is a subsidiary of Sears Holdings Corporation



Before After

Isolated, disconnected in various siloed data/file systems

Consolidated and centralized in Hadoop

Monolithically segmented Heterogeneous, diverse, huge

Separated and partial Federated and holistic

Page 11: MetaScale is a subsidiary of Sears Holdings Corporation



Always do a data gap analysis first

Fork the ingestion in both batch and streaming if needed

Have a delivery plan for the data feed

Synchronize data changes between source system and Hadoop

Page 12: MetaScale is a subsidiary of Sears Holdings Corporation


Integrating Data within Hadoop


Keep the data flow beyond the ingest phase by transforming the data from dirty to clean, from raw to standardized, and from transactional to analytical, all within Hadoop.

Page 13: MetaScale is a subsidiary of Sears Holdings Corporation



As the latency or speed from raw data to business insight becomes the focal point of enterprise data analytics, use Hadoop as data integration platform to perform in-place data transformation.

Page 14: MetaScale is a subsidiary of Sears Holdings Corporation



Partition enterprise-wide standardized data and job-specific analytical data in HDFS, and retain history.

Use dimensional modeling to transform and standardize, make dimensional data as the atomic unit of enterprise data.

Identify all enterprise data entities, and add finest grain attributes to each entity as dimensional data.

Take a bottom-up approach, also think about data usage across the enterprise, not specific task bound.

Page 15: MetaScale is a subsidiary of Sears Holdings Corporation


(3 Ts’) Transform vs. (ETL’s) Transform

Transform in ETL / ELT Transform in 3 Tsin vitro, outside Hadoop in vivo, within Hadoop

Use Hadoop as rental space Use Hadoop as integration platform

Non-value adding data movement in between data storage and transformation

Data is transformed while flowing from one partition to another under HDFS

High latency Low latency

Network bottleneck Data locality

Page 16: MetaScale is a subsidiary of Sears Holdings Corporation


Making Data Products out of Hadoop


Turn analytical data into data products of business wisdom using home-made or commercial tools of analytics built on top of Hadoop. Business decisions supported by data products will help generate more new data, thus a new round of enterprise data flow cycle…

Page 17: MetaScale is a subsidiary of Sears Holdings Corporation



Low-latency big data analytics requires right platform/tools

Use Hadoop as the platform of choice for enterprise data analytics because of its openness and flexibility

Choose analytical tools that are flexible, agile, interactive and user friendly

Page 18: MetaScale is a subsidiary of Sears Holdings Corporation



Big data analytics takes a team effort Include statisticians, data scientists and developers Utilize both generic and Hadoop specific technologies Consider both batch and streaming based approaches Provide access to pre-computed view and on-the-fly query Use both home-made and Hadoop-based commercial tools Use web-based, mobile friendly UI Visualize

Page 19: MetaScale is a subsidiary of Sears Holdings Corporation

The 3 Ts of Hadoop

Continuous Iteration of Enterprise Data Flow

Page 20: MetaScale is a subsidiary of Sears Holdings Corporation

Thank You!For further information


[email protected]

MetaScale is a subsidiary of Sears Holdings Corporation