Download - Large-scale Web Apps @ Pinterest

Large Scale Web Apps @Pinterest (Powered by Apache HBase)

May 5, 2014

Pinterest is a visual discovery tool for collecting the things you love, and discovering related content along the way.

What is Pinterest ?

ScaleChallenges @scale • 100s of millions of pins/repins per month • Billions of requests per week • Millions of daily active users • Billions of pins • One of the largest discovery tools on the internet

Storage stack @Pinterest!

• MySQL • Redis (persistence and for cache) • MemCache (Consistent Hashing)

App Tier

Manual Sharding

Sharding Logic

Why HBase ?!

• High Write throughput - Unlike MySQL/B-Tree, writes don’t ever seek on Disk

• Seamless integration with Hadoop • Distributed operation

- Fault tolerance - Load Balancing - Easily add/remove nodes !

Non-Technical Reasons • Large active community • Large scale online use cases

Outline!

• Features powered by HBase • SaaS (Storage as a Service)

- MetaStore - HFile Service (Terrapin)

• Our HBase setup - optimizing for High availability & Low latency

Applications/Features!

• Offline - Analytics - Search Indexing - ETL/Hadoop worklows

• Online - Personalized Feeds - Rich Pins - Recommendations

!

Why HBase ?

Personalized Feeds

WHY HBASE ? Write Heavy load due to Pin fanout.

Recommended Pins

Users I follow

Rich Pins

WHY HBASE ? Negative Hits with Bloom Filters

Recommendations

HADOOP 1.0

HBASE + HADOOP 2.0

HADOOP 2.0

WHY HBASE ? Seamless Data Transfer from Hadoop

Generate Recommendations

DistCP Jobs

Serving Cluster

SaaS

• Large number of feature requests • 1 Cluster per feature • Scaling with organizational growth • Need for “defensive” multi tenant storage • Previous solutions reaching their limits

MetaStore I• Key Value store on top of HBase • 1 HBase Table per Feature with salted keys • Pre split tables • Table level rate limiting (online/offline reads/writes) • No Scan support • Simple client API !

!

string getValue(string feature, string key, boolean online); void setValue(string feature, string key, string value,

boolean online);

MetaStore II

MetaStore Thrift Server

Primary HBase Secondary HBase

Clients

Master/Master Replication

Thrift

Salting + Rate Limiting ZooKeeper

Issue Gets/Sets

Notifications

Metastore Config - Rate Limits - Primary Cluster

HFile Service (Terrapin)

• Solve the Bulk Upload problem • HBase backed solution

- Bulk upload + major compact - Major compact to delete old data

• Design solution from scratch using mashup of: - HFile - HBase BlockCache - Avoid compactions - Low latency key value lookups

!

!

!

High Level Architecture I

!

Client Library /Service

ETL/Batch Jobs Load/Reload

HFile Servers

!

HFiles on Amazon S3

Key/Value Lookups

Multiple HFiles/Server

High Level Architecture II• Each HFile server runs 2 processes

- Copier: pulls HFiles from S3 to local disk - Supershard: serves multiple HFile shards to client

• ZooKeeper - Detecting alive servers - Coordinating loading/swapping of new data - Enabling clients to detect availability of new data

• Loader Module (replaces distcp) - Trigger new data copy - Trigger swap through zookeeper - Update ZooKeeper and notify client

• Client library understands sharding • Old data deleted by background process !

!

Salient Features

• Multi tenancy through namespacing • Pluggable sharding functions - modulus, range & more • HBase Block Cache • Multiple clusters for redundancy • Speculative execution across clusters for low latency !

!

!

Setting up for Success• Many online usecases/applications • Optimize for:

- Low MTTR - high availability - Low latency (performance)

!

!

MTTR - I

DEADLIVE STALE20sec 9min 40sec

!

• Stale nodes avoided - As candidates for Reads - As candidate replicas for writes - During Lease Recovery

• Copying of underreplicated blocks starts when a Node is marked as “Dead”

DataNode States

MTTR - II

Failure Detection

Lease Recovery

Log Split

Recover Regions

30 sec ZooKeeper session timeout

HDFS 4721

HDFS 3703 + HDFS 3912

< 2 min

!

• Avoid stale nodes at each point of the recovery process • Multi minute timeouts ==> Multi second timeouts

Simulate, Simulate, Simulate

Simulate “Pull the plug failures” and “tail -f the logs” • kill -9 both datanode and region server - causes connection refused errors • kill -STOP both datanode and region server - causes socket timeouts • Blackhole hosts using iptables - connect timeouts + “No Route to host” - Most representative of AWS failures

PerformanceConfiguration tweaks • Small Block Size, 4K-16K • Prefix compression to cache more - when data is in the key, close to 4X reduction for some data sets • Separation of RPC handler threads for reads vs writes • Short circuit local reads • HBase level checksums (HBASE 5074)

Hardware • SATA (m1.xl/c1.xl) and SSD (hi1.4xl) • Choose based on limiting factor

- Disk space - pick SATA for max GB/$$ - IOPs - pick SSD for max IOPs/$$, clusters with heavy reads or heavy compaction activity

Performance (SSDs)

HFile Read Performance • Turn off block cache for Data Blocks, reduce GC + heap fragmentation • Keep block cache on for Index Blocks • Increase “dfs.client.read.shortcircuit.streams.cache.size” from 100 to 10,000 (with short circuit reads) • Approx. 3X improvement in read throughput !

Write Performance • WAL contention when client sets AutoFlush=true • HBase 8755

In the Pipeline...!

• Building a graph database on HBase • Disaster recovery - snapshot + incremental backup + restore • Off Heap cache - reduce GC overhead and better use of hardware • Read path optimizations

And we are Hiring !!