Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

50
Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta Alexey Gayduk Roman Nikolaenko

description

Хотите услышать о проекте, где используется стек технологий из Hadoop для распределенной обработки и хранения данных, Katta для распределенного хранения и обработки Lucene индексов, MongoDB для хранения неструктурированных данных? Мы хотели бы рассказать о реальном опыте применения этой связки, с какими проблемами мы столкнулись и как мы их решали. Допустим одна из проблем это использование сторонних библиотек в Hadoop Map/Reduce, все очевидно, но как сделать это красиво и удобно? Или как запустить Hadoop job из под web приложения, а не из консоли, и мониторить ее выполнение? А вот проблема хранения и обработки неструктурированных данных в MySql. Что за данные мы хранили там и почему решили использовать MongoDB? И зачем же мы все-таки используем Katta? Все эти проблемы и их решения исходят из реальной бизнес идеи, и обо всем этом мы расскажем вам.

Transcript of Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Page 1: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Alexey GaydukRoman Nikolaenko

Page 2: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

E-commerce

Page 3: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

E-commerceAmazon

WalmartTarget

Macy’s

Page 4: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta
Page 5: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Intelligence

Page 6: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Product

Page 7: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

What are the characteristics?

Page 8: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

● Size● Material● Pocket Style● Weave type● Hem Style● Cleaning● Fit

Page 9: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Target

● Weave Type: Denim● Pocket Style: 5 Pocket

Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top

Stitching● Protective Features: Stretch● Hem Style: Finished Hems

Wrangler Jeans - Women’s Bootcut Jeans

Page 10: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Target

● Weave Type: Denim● Pocket Style: 5 Pocket

Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top

Stitching● Protective Features: Stretch● Hem Style: Finished Hems

WalmartWrangler Jeans - Women’s Bootcut Jeans

Wrangler® Womens Bootcut Jean - Grey

Page 11: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Target

● Weave Type: Denim● Pocket Style: 5 Pocket

Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top

Stitching● Protective Features: Stretch● Hem Style: Finished Hems

Walmart

● Weave Type: Denim● Pockets: 2 hip pockets,

1 watch pocket, 2 front scoop pockets

● Fabric Care Instructions: Machine Wash,Tumble Dry

● Decorative Details: Top Stitching

● Fabric Content: Cotton,Spandex

Wrangler Jeans - Women’s Bootcut Jeans

Wrangler® Womens Bootcut Jean - Grey

Page 12: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

● Data capture● Processing and storage● Reports

What problems do we solve?

Page 13: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Data Capture

Crawler● Distributed Uses EC2 for crawling ● Failover The failed node will be replaced with another.

Page 14: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Data Capture

{ "Name":"Wrangler® Womens Bootcut Jean - Grey", "Weave Type":"Denim", "Pockets":"2 hip pockets, 1 watch pocket, 2 front scoop pockets", "Fabric Care Instructions":"Machine Wash,Tumble Dry", "Decorative Details":"Top Stitching", "Fabric Content":"Cotton,Spandex"}

JSON

Page 15: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Processing and Storage

● Distributed data storage ● Distributed data processing

Page 16: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

● Files are stored as blocks● Write once, read many times● Reliability by replication● One central point of access to files

HDFS (Hadoop Distributed File System)

http://dev-time.org/?p=893

Page 17: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Data Matching

Target

● Weave Type: Denim● Pocket Style: 5 Pocket

Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top

Stitching● Protective Features: Stretch● Hem Style: Finished Hems

Walmart

● Weave Type: Denim● Pockets: 2 hip pockets,

1 watch pocket, 2 front scoop pockets

● Fabric Care Instructions: Machine Wash,Tumble Dry

● Decorative Details: Top Stitching

● Fabric Content: Cotton,Spandex

Wrangler Jeans - Women’s Bootcut Jeans

Wrangler® Womens Bootcut Jean - Grey

Page 18: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Wrangler Jeans - Women’s Bootcut Jeans

Wrangler® Womens Bootcut Jean - Grey

How to match?

Data Matching

Page 19: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Katta (Lucene index storage)

● Distributed storage of Lucene index● Makes serving large or high load indices

easy.● Failover● Data replication● Easy to scale● Plays well with Hadoop cluster

Page 20: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

It's time to...

Roman Nikolaenko

Page 21: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Crawled Data Target Amazon eBay

Sears Walmart

Page 22: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Crawled Data

"Human" Cloud

Page 23: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

"Human" Cloud

Data Storage for Reports

Page 24: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Data Storage for Reports

Page 25: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

"Human" Cloud

Page 26: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Hadoop

Page 27: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Crawled Data

Hadoop

Data Storage for Reports

Data

Flow

Task

Chain

Page 28: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

{ "Name":"Wrangler® Womens Bootcut Jean - Grey", "Weave Type":"Denim", "Pockets":"2 hip pockets, 1 watch pocket, 2 front scoop pockets", "Fabric Care Instructions":"Machine Wash,Tumble Dry", "Decorative Details":"Top Stitching", "Fabric Content":"Cotton,Spandex"}

{ "NAME":"Womens Bootcut Jean", "MFGR_NAME":WRANGLER", "COLOR":"GREY", "WEAVE_TYPE":"DENIM", "POKETS_TYPES":"2 HIP|1 WATCH|2 FRONT SCOOP", "CARE_INSTRUCTIONS":"MACHINE WASH|TUMBLE DRY", "DECORATIVE_DETAILS":"Top Stitching", "CONTENT":"COTTON, SPANDEX", "FINGERPRINT":"Womens Bootcut Jean!WRANGLER", "FINGERPRINT_HASH":"3902152632", "MAS_PROD_ID":"72312"}

Task

Chain

Page 29: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

0. Load data to HDFS 1. Attribute Name Transformation 2. Attribute Values Normalization 3. Create Fingerprints 4. Make Mappings 5. Load data to Data Storage for Reports

Task

Chain

Page 30: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Task

Chain

Page 31: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Control

Coordination

Monitoring

Comfort

Task

Chain

Page 32: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Project Manager

Task TypeTask

Task

Task

Task

Task

Task

PROJECT

Task Type

Task Type

Task Type

Task Type

Task Type

Page 33: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

http://example0.com/dataloader/service/

http://example2.com/normalization/service/

http://example1.com/fingerprint/service/

http://example3.com/lookup/service/

http://example4.com/reportingLoader/service/

TASK

TASK

TASK

TASK

TASK

TASK

PROJECT

http://example1.com/transformation/service/

Page 34: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

REST API:(JSON as DTO)

GET PARAMETERS of TASK

START TASK

GET STATUS of TASK

STOP TASK

Page 35: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

REST API:(JSON as DTO)

GET PARAMETERS of TASK START TASKGET STATUS of TASK STOP TASK

TASK_TYPE: http://example0.com/dataloader/service/

Page 36: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Task

REST API:

Task Task

Task

Task Type: http://example0.com/dataloader/service/

Page 37: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Project Manager

Task 1 Task TypeService URL

Web Service

Hadoop Local processing

Task 2

Page 38: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Web Service: Start MapReduce

Job job = getMapReduceJob(); job.waitForCompletion(true);

OR

job.submit();

Page 39: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Web Service: MapReduce monitoring

job.isComplete();

job.isSuccessful();

job.mapProgress();

job.reduceProgress();

Page 40: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

System.setProperty("path.separator", ":"); Configuration config = getConfig();FileSystem fileSystem = getFS(); fileSystem.copyFromLocalFile(source, destination); DistributedCache.addArchiveToClassPath(destination, config, fileSystem);

Web Service: MapReduce & Third Party Libraries

Page 41: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Job configuration file on cluster will contain: mapred.cache.archives =hdfs://namenode.com:9000/distributedCache/gson-1.7.1.jar,... mapred.job.classpath.archives = /distributedCache/gson-1.7.1.jar:...

Web Service: MapReduce & Third Party Libraries

Page 42: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Reporting Application

Page 43: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Dashboard

Page 44: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta
Page 45: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Get Report REST API

Reports Data Storage Facade

Create Report REST API

MongoDB cluster

Page 46: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

http://spf13.com/post/mongodb-and-hadoop

Page 47: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

MongoDB cluster

CLIENT_546 collection: ...{"report_type":"PROD COUNT", "snapshot_time" : "2011-01-15", "AMAZON":"500", "TARGET":"300","WALMART":"900"}...{"report_type":"PRICE COMPARE", "MAS_PROD_ID":"72312", "snapshot_time" : "2011-02-15","NAME":"Womens Bootcut Jean","MFGR_NAME":WRANGLER","AMAZON_PRICE":"50", "TARGET_PRICE":"45","WALMART_PRICE":"55"}...

Page 48: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

UI For Client

Reports Data Storage Facade

JSON query JSON report

Page 49: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Reducer Reducer

JSON report row

Hadoop

Reports Data Storage Facade

Page 50: Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Contacts [email protected] oleksiy_gayduk http://www.linkedin.com/pub/alexey-gayduk/4/39b/a31

Alexey Gayduk

[email protected] roman_jd_nikolaenko http://ua.linkedin.com/pub/roman-nikolaienko/2b/413/431

Roman Nikolaenko