HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website

Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website

HBaseCon 2012

Applications Track – Case Study

Who Are We?

Suraj Varma Director of Technology Implementation Gap Inc Direct (GID), San Francisco, CA IRC: svarma

Gupta Gogula Director-IT & Domain Architect of

Catalog Management & Distribution Gap Inc Direct (GID), San Francisco, CA

Agenda - Case Study

Problem Domain

HBase Schema Specifics

HBase Cluster Specifics

Learning & Challenges

2005NEW SITE LAUNCH

2007PIPERLIME2008

UNIVERSALITY2009ATHLETA

INCOMING TRAFFIC

2010CA & EU

MARKETS

APPLICATION SERVERS DATABASES

Problem Domain

Evolution of the GID Apparel Catalog 2005 - Three independent brands in US 2010 – 5 integrated brands in US, CA, EU

Rapid Expansion of Apparel Catalog

However, each brand / market combination necessitated separate logical catalog databases

What We Wanted …

Single Catalog store for all brands/markets Horizontally scalable over time Cross brand business features

Access data store directly To avail of inventory awareness of items

Minimal Caching – only for optimization Keeping caches in sync is a problem.

Highly Available

Initial Explorations

Sharded RDMBS, MemCached, etc Significant effort was required Still had scalability limits

Non-relational alternatives considered

HBase POC (early-2010) Promising results -decided to move

Why HBase?

Strong Consistency Model Server Side Filters Automatic Sharding, Distribution,

Failover Hadoop Integration out of the box

General Purpose Other use cases outside of Catalog

Strong Community!

Architecture Diagram

HBASE CLUSTER

MUTATIONS

MUTATIONSMUTATIONS

REQUESTSBACKEND SERVICES

NEAR REAL TIME INVENTORY UPDATES

PRICING UPDATES ITEM UPDATES

INCOMING REQUESTS

FOR CATALOG DATA

Cluster Traffic Patterns

Read Mostly

Write / Delete Bursts

Continuous Writes

Website Traffic Sync MR Jobs

Catalog Publish Phase out to near real-

time updates from originating systems

MR jobs on Live Cluster

Inventory Updates

Rows:100KB avg size1000-5000 colsSparse rows

Data Model & Access Patterns Hierarchical Data (Primarily)

SKU -> Style Lookups (child -> parent) Cross Brand Sell (sibling <-> sibling)

Data Access Patterns Full Product Graph in one read Single path of graph from root to leaf node Search - Secondary Indices Large Feed files

Primary Access Patterns

READ FULL GRAPH

READ SINGLE PATH / EDGE

HBase Schema Management Built custom “bean to schema

mapper” POJO graph < -> HBase qualifiers Flexibility to shorten column qualifiers Flexibility to change schema qualifiers

(per environment / developer)<…><association>one-to-many</association>

<prefix>SC</prefix> <uniqueId>colorCd</uniqueId>

<beanName>styleColorBean</beanName> <…>

Schema Example - Hierarchy <PP>_<id1>_QQ_<id2>_RR_<id3>_name

Where PP is parent, QQ is child, RR is grandchild

cf1:VAR_1_SC_0012_colorCdcf2:VAR_1_SC_0012_SCIMG_10_path

Pattern: ANCESTOR IDS EMBEDDED IN QUALIFIER NAME

Schema – Lookups

Secondary Index <id3> => RR ; QQ ; PP FilterList with (RR, QQ, PP) ids to get

thin slice path

14444 333 22KEY_5555

Pattern: SECONDARY INDEX TO HIERARCHICAL ANCESTORS

Schema – Future Dates, Large Files “Publish at Midnight”

Future Dated PUTs Get/Scan with time range

Large Feed Files Sharded into smaller chunks < 2MB per

S_4S_1 S_2 S_3KEY_nnnn

Pattern: SHARDED CHUNKS

HBase Cluster

16 Slave (RS + TT + DN) Nodes 8 & 16 GB RAM

3 Master (HM,ZK,JT, NN) Nodes 8 GB RAM

NN Failover via NFS

Configurations – Block Cache, GC

Block Cache Maximize Block Cache hfile.block.cache.size: 0.6

Garbage Collection MSLAB enabled CMSInitiatingOccupancyFactor

Configurations – Timeouts Quick Recovery on node failure

Default timeouts too large zookeeper.session.timeout

Region Server hbase.rpc.timeout

Data Node dfs.heartbeat.recheck.interval heartbeat.recheck.interval

Learnings – Regions

Block Cache Size Tuning Block Cache Churn

Hot Row scenarios Perf Tests & Doing Phased Rollouts

Hot Region issues Perf Tests & Pre-split Regions.

Filters CPU Intensive – profiling needed.

Learnings – Monitoring, Hardware

Monitoring is crucial Layer by layer -> what’s the bottleneck Metrics to target optimization & tuning Troubleshooting

Non Uniform Hardware Sub-optimal region distribution Hefty boxes lightly loaded.

Learnings – Miscellaneous

M/R Jobs running on live cluster Has an impact – so cannot run full

throttle Go easy …

Feature Enablement – Phase in Don’t turn on several features together Easier identification of potential hot

regions / rows, overloaded RS, etc

Phasing In Features

HBASE CLUSTER

LOT MORE

REQUESTS

BACKEND SERVICES

INVENTORY UPDATES

PRICING UPDATES ITEM UPDATES

Enable Features individually to measure impact and tune cluster accordingly

FEATURE “A” ENABLED: ADDITIONAL “N” REQ / SEC

FEATURE “B” ENABLED: ADDITIONAL “K” REQ / SEC

INCOMING REQUESTS

Challenges – Search, Transactions

Search No out-of-the-box secondary indexes. Custom solution with Solr

Transactions Only row level atomicity But … can’t pack all in a single row Atomic Cross-Row Put/Delete and HBASE-

5229 seem potential partial solves (0.94+)

Challenges – Optimal Schema Orthogonal access patterns

Optimize for most frequently used pattern.

Filters May suffice, with early out configurations Impacts CPU usage

Duplicate data for every access pattern Too drastic Effort to keep all copies in sync

Challenges - Backups

Rebuild from source data Takes time … but no data loss

Export / import based backups Faster … but stale Another MR on live cluster

Better options in future releases …

Gap Inc Direct

We’re hiring!http://www.gapinc.com

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website

Technology

Transcript of HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website

Hbase in action - Chapter 09: Deploying HBase

HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase

HTrace: Tracing in HBase and HDFS (HBase Meetup)

HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at Explorys

HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data

Elastic HBase on Mesos - HBaseCon 2015

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016

HBaseCon 2012 | Scaling GIS In Three Acts

HBase User Group #9: HBase and HDFS

HBaseCon 2012 | Orchestrating Clusters with Ironfan and Chef - Runa

HBaseCon 2012 | Real-time Analytics with HBase - Sematext

HBase Schema Design - HBase-Con 2012

HBaseCon 2013: Apache HBase on Flash

HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily & HBase - ngdata

HBaseCon 2012 | Storing and Manipulating Graphs in HBase

HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce

HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro

HBaseCon 2012 | Growing Your Inbox, HBase at Tumblr