Download - Large-scale Web Apps @ Pinterest

Transcript
Page 1: Large-scale Web Apps @ Pinterest

Large Scale Web Apps @Pinterest (Powered by Apache HBase)

May 5, 2014

Page 2: Large-scale Web Apps @ Pinterest

Pinterest is a visual discovery tool for collecting the things you love, and discovering related content along the way.

What is Pinterest ?

Page 3: Large-scale Web Apps @ Pinterest
Page 4: Large-scale Web Apps @ Pinterest

ScaleChallenges @scale • 100s of millions of pins/repins per month • Billions of requests per week • Millions of daily active users • Billions of pins • One of the largest discovery tools on the internet

Page 5: Large-scale Web Apps @ Pinterest

Storage stack @Pinterest!

• MySQL • Redis (persistence and for cache) • MemCache (Consistent Hashing)

App Tier

Manual Sharding

Sharding Logic

Page 6: Large-scale Web Apps @ Pinterest

Why HBase ?!

• High Write throughput - Unlike MySQL/B-Tree, writes don’t ever seek on Disk

• Seamless integration with Hadoop • Distributed operation

- Fault tolerance - Load Balancing - Easily add/remove nodes !

Non-Technical Reasons • Large active community • Large scale online use cases

Page 7: Large-scale Web Apps @ Pinterest

Outline!

• Features powered by HBase • SaaS (Storage as a Service)

- MetaStore - HFile Service (Terrapin)

• Our HBase setup - optimizing for High availability & Low latency

Page 8: Large-scale Web Apps @ Pinterest

Applications/Features!

• Offline - Analytics - Search Indexing - ETL/Hadoop worklows

• Online - Personalized Feeds - Rich Pins - Recommendations

!

Why HBase ?

Page 9: Large-scale Web Apps @ Pinterest

Personalized Feeds

WHY HBASE ? Write Heavy load due to Pin fanout.

Recommended Pins

Users I follow

Page 10: Large-scale Web Apps @ Pinterest

Rich Pins

WHY HBASE ? Negative Hits with Bloom Filters

Page 11: Large-scale Web Apps @ Pinterest

Recommendations

HADOOP 1.0

HBASE + HADOOP 2.0

HADOOP 2.0

WHY HBASE ? Seamless Data Transfer from Hadoop

Generate Recommendations

DistCP Jobs

Serving Cluster

Page 12: Large-scale Web Apps @ Pinterest

SaaS

• Large number of feature requests • 1 Cluster per feature • Scaling with organizational growth • Need for “defensive” multi tenant storage • Previous solutions reaching their limits

Page 13: Large-scale Web Apps @ Pinterest

MetaStore I• Key Value store on top of HBase • 1 HBase Table per Feature with salted keys • Pre split tables • Table level rate limiting (online/offline reads/writes) • No Scan support • Simple client API !

!

string getValue(string feature, string key, boolean online); void setValue(string feature, string key, string value,

boolean online);

Page 14: Large-scale Web Apps @ Pinterest

MetaStore II

MetaStore Thrift Server

Primary HBase Secondary HBase

Clients

Master/Master Replication

Thrift

Salting + Rate Limiting ZooKeeper

Issue Gets/Sets

Notifications

Metastore Config - Rate Limits - Primary Cluster

Page 15: Large-scale Web Apps @ Pinterest

HFile Service (Terrapin)

• Solve the Bulk Upload problem • HBase backed solution

- Bulk upload + major compact - Major compact to delete old data

• Design solution from scratch using mashup of: - HFile - HBase BlockCache - Avoid compactions - Low latency key value lookups

!

!

!

Page 16: Large-scale Web Apps @ Pinterest

High Level Architecture I

!

Client Library /Service

ETL/Batch Jobs Load/Reload

HFile Servers

!

HFiles on Amazon S3

Key/Value Lookups

Multiple HFiles/Server

Page 17: Large-scale Web Apps @ Pinterest

High Level Architecture II• Each HFile server runs 2 processes

- Copier: pulls HFiles from S3 to local disk - Supershard: serves multiple HFile shards to client

• ZooKeeper - Detecting alive servers - Coordinating loading/swapping of new data - Enabling clients to detect availability of new data

• Loader Module (replaces distcp) - Trigger new data copy - Trigger swap through zookeeper - Update ZooKeeper and notify client

• Client library understands sharding • Old data deleted by background process !

!

Page 18: Large-scale Web Apps @ Pinterest

Salient Features

• Multi tenancy through namespacing • Pluggable sharding functions - modulus, range & more • HBase Block Cache • Multiple clusters for redundancy • Speculative execution across clusters for low latency !

!

!

Page 19: Large-scale Web Apps @ Pinterest

Setting up for Success• Many online usecases/applications • Optimize for:

- Low MTTR - high availability - Low latency (performance)

!

!

Page 20: Large-scale Web Apps @ Pinterest

MTTR - I

DEADLIVE STALE20sec 9min 40sec

!

• Stale nodes avoided - As candidates for Reads - As candidate replicas for writes - During Lease Recovery

• Copying of underreplicated blocks starts when a Node is marked as “Dead”

DataNode States

Page 21: Large-scale Web Apps @ Pinterest

MTTR - II

Failure Detection

Lease Recovery

Log Split

Recover Regions

30 sec ZooKeeper session timeout

HDFS 4721

HDFS 3703 + HDFS 3912

< 2 min

!

• Avoid stale nodes at each point of the recovery process • Multi minute timeouts ==> Multi second timeouts

Page 22: Large-scale Web Apps @ Pinterest

Simulate, Simulate, Simulate

Simulate “Pull the plug failures” and “tail -f the logs” • kill -9 both datanode and region server - causes connection refused errors • kill -STOP both datanode and region server - causes socket timeouts • Blackhole hosts using iptables - connect timeouts + “No Route to host” - Most representative of AWS failures

Page 23: Large-scale Web Apps @ Pinterest

PerformanceConfiguration tweaks • Small Block Size, 4K-16K • Prefix compression to cache more - when data is in the key, close to 4X reduction for some data sets • Separation of RPC handler threads for reads vs writes • Short circuit local reads • HBase level checksums (HBASE 5074)

Hardware • SATA (m1.xl/c1.xl) and SSD (hi1.4xl) • Choose based on limiting factor

- Disk space - pick SATA for max GB/$$ - IOPs - pick SSD for max IOPs/$$, clusters with heavy reads or heavy compaction activity

Page 24: Large-scale Web Apps @ Pinterest

Performance (SSDs)

HFile Read Performance • Turn off block cache for Data Blocks, reduce GC + heap fragmentation • Keep block cache on for Index Blocks • Increase “dfs.client.read.shortcircuit.streams.cache.size” from 100 to 10,000 (with short circuit reads) • Approx. 3X improvement in read throughput !

Write Performance • WAL contention when client sets AutoFlush=true • HBase 8755

Page 25: Large-scale Web Apps @ Pinterest

In the Pipeline...!

• Building a graph database on HBase • Disaster recovery - snapshot + incremental backup + restore • Off Heap cache - reduce GC overhead and better use of hardware • Read path optimizations

Page 26: Large-scale Web Apps @ Pinterest

And we are Hiring !!