Cassandra Summit 2014: Huge Online Genealogical Database Driven By Cassandra
-
Upload
planet-cassandra -
Category
Technology
-
view
262 -
download
1
description
Transcript of Cassandra Summit 2014: Huge Online Genealogical Database Driven By Cassandra
1 © 2014 by Intellectual Reserve, Inc. All rights reserved.
Huge Online Genealogical Database
Driven By Cassandra
Cassandra Summit 2014
John Sumsion
2
Outline
• Introduction to FamilySearch Family Tree
• Outline of Cassandra reimplementation
• Journal-based Consistency Model
• Experience with Cassandra
3
What is FamilySearch?
Familysearch.org website
Very large single pedigree (Family Tree)
Largest collection of free genealogical records
Largest genealogical library
Family History Department of Church of Jesus
Christ of Latter-day Saints (known as Mormons)
4
Why does FamilySearch exist?
Visit http://mormon.org/family-history/
5
Family Tree
Records Indexing Family Tree
Memories
Community
Where it fits
6
Record Preservation
Neglect
Time
Disasters (e.g. WWII)
7
Record Preservation (continued)
• 100 million images published online / year
8
Indexing
3.5 billion indexed records – 35M / month
Turns this… …into this!
9
Memories
10
Community
11
Family Tree
Records Indexing Family Tree
Memories
Community
Where it fits
12
Family Tree Data
Family Tree:
• 900M+ person records, open-edit
• 500M+ relationships, open-edit
• 8.4B change log entries, 100M+ per quarter
• Dynamic OLTP system
• Data-dependent performance issues
13
Family Tree: Example 9 Gen Pedigree
up to 511 person slots Dynamic content!
14
Family Tree: Example Pedigree App
31+ persons per section Dynamic content!
15
Family Tree: Example Ancestor Page
10+ persons in families 100-1000+ changes Dynamic content!
16
Family Tree: Example Change History
100-1000+ changes Dynamic content!
17
Contents
• Introduction to FamilySearch Family Tree
• Outline of Cassandra reimplementation
• Journal-based Consistency Model
• Experience with Cassandra
18
Performance & Scale
• Slow page views • pedigree (500-3000ms for 3 generations)
• change history (2000+ms for first page of changes)
• large family view
• Query problems • relationships connect persons, range scan by person id
• every person => person traversal is 200-300M btree scan
(global index)
• change history queries travers 8+B btree scan
(global index)
19
Performance & Scale
• Query performance problems
Person Relationship
Person
Wide range scan
Pedigree
Change History Change History
Wide range scan
20
Cassandra Reimplementation
• selected Cassandra after extensive testing
• full data scale proof-of-concept & tests
• required: new data model (performance)
• required: new consistency model (critical!)
21
Cassandra Reimplementation
• event-sourced data model – journal / views
• new data model – no indexes
• new consistency model – satisfies consistency
JE #8
P1 P1 Views
A B
JE #6
P2 P2 Views
A B
22
Cassandra Reimplementation
• denormalized relationships
P1 P2
R1
R2
R3
R5
R4
23
Cassandra Reimplementation
• denormalized relationships
P1 P2
R1
R2
R3
R5
R4
R2
R3
24
Cassandra Reimplementation
• denormalized relationships
• exact duplication allows biderectional traversal
Person/Rels
Person/Rels
Person Relationship
Person
Wide query P1 P2
R1
R2
R3
R5
R4
R2
R3
25
Cassandra Reimplementation
• change history is a core feature
• denormalized change history
• optimizes for displaying recent changes
JE #8
P1 P1 Change History View
1000s of changes (spread over multiple Cassandra cells)
Last 100-1000 changes (local to a single Cassandra cell)
26
Contents
• Introduction to FamilySearch Family Tree
• Outline of Cassandra reimplementation
• Journal-based Consistency Model
• Experience with Cassandra
27
Journal-based Consistency Model
Command Journal View View
View
Rough Process Flow
captures edits safely
stores edits canonically
view-optimized summations
28
Journal-based Consistency Model
Command
• write-once with quorum
• application to journal requires 3 tables:
pending / completed / aborted
• idempotent application to journal
Command Journal View View
View
29
Journal-based Consistency Model
Command Schema
• key: command v1 uuid (as text)
• value: blob (binary json)
Command Journal View View
View
30
Journal-based Consistency Model
Journal
• write-once with quorum & C* batch
• denormalized byte-exact across
affected persons & relationships
• each entry stored in separate cell
(compaction required for fast journal reads)
Command Journal View View
View
31
Journal-based Consistency Model
Journal
• CmRDT (commutative replicated type)
• partitions converge without conflict
because of unique uuid
Command Journal View View
View
32
Journal-based Consistency Model
Command Journal View View
View
Partition Key Command UUID Content (blob)
KWZ3-P71
KWZ3-P71
eda6f967-0955…
6af8d90c-8f3a…
{ "attribution": {}, … } (binary json)
{ "attribution": {}, … } (binary json)
KCDT-J59 fd35ac61-7def… { "attribution": {}, … } (binary json)
KCDT-J59 b2db2fa5-da5f… { "attribution": {}, … } (binary json)
33
Journal-based Consistency Model
View
• multiple views for multiple uses (person, person card, change history)
• populated by applying journal entries
• incrementally updated in steady state
• not canonical data, can be recalculated
Command Journal View View
View
34
Journal-based Consistency Model
Command Journal View View
View
P1 P1 Views
A B
35
Journal-based Consistency Model
Command Journal View View
View
JE #8
P1 P1 Views
A B
JE #8 JE #8
36
Journal-based Consistency Model
Command Journal View View
View
P1 P1 Views
A B
JE #8 JE #8
A (new)
B (new)
JE #8
37
Journal-based Consistency Model
Command Journal View View
View
P1 P1 Views
A B
38
Journal-based Consistency Model
View
• views have same schema as journal
• journal entries are written to view for
incremental refresh
• core of the consistency model
Command Journal View View
View
39
Journal-based Consistency Model
View
• CvRDT (convergent replicated type)
• partitions converge with conflict; resolved
by full view refresh from canonical journal
• steady state: one view of a given type per
entity
Command Journal View View
View
40
Journal-based Consistency Model
Command Journal View View
View
P1 P1 Views
A B
JE #8 JE #8
A (new)
B (new)
JE #8
41
Journal-based Consistency Model
• Performance & Scale • lookup by partition key only, no indexes
• any cross-entity change happens in duplicate on all
• stored “current-state” views – cheapest possible read
• custom views – tunable to different use cases
• disposable views – able to tweak view over time
42
Journal-based Consistency Model
• Business Rule Enforcement • Read / Write / Read & Revert
• pre-command checks prevent invalid changes
• write with appropriate quorum ensures consistent write
• post-command checks prevent business-rules conflicts
• administrative revert marks command as “not applicable”
and thereby causes full refresh which ignores changes
• appropriate quorum: depending on the change, either
LOCAL_QUORUM or EACH_QUORUM
43
Journal-based Consistency Model
• Strong consistency • command store – atomic capture of a single user action
• command handling – idempotent writes to journal,
picked up later even if interrupted
• no global lock needed for optimistic concurrency
• Read after write • consistency ONE for normal reads
• quorum when the client knows it’s refreshing after write
44
Journal-based Consistency Model
• Journal / View Concerns • native support for change history
• no journal tombstones in steady state – write-once
• blob schema implementable on any db engine that
supports two-level keys (partition, composite)
• consistency model implementable on any db engine that
supports batches & quorum writes/reads
• view tombstones on every write, biggest concern
• leveled compaction?
• WISH: size-tiered compaction with data locality hoisting
45
Contents
• Introduction to FamilySearch Family Tree
• Outline of Cassandra reimplementation
• Journal-based Consistency Model
• Experience with Cassandra
46
Experience with Cassandra
• tested Community 1.2 and 2.0
• fantastic performance
• easy cloud setup
• great developer response
• easy to bulk load through CQL3
• harder to get running inside AWS VPC
47
Experience with Cassandra
• Bulk import experience • 8.4B change log records => 5.8B journal entries (2.5TB lzo)
• ‘hi1.4xlarge’ cluster (2x 1TB SSDs)
• import through CQL was fast enough
• 11h to import 5-node cluster (5h on 30-node cluster)
• 140k writes / sec, fed from 128 writer threads
• 20 records / unlogged batch write, 1-2k record size
• minimal post-import compaction (size-tiered)
• ended up with 3.5-4TB on C* disk after import
• OpsCenter – great visibility for tuning
• Community – harder to automate repairs, etc.
48
Experience with Cassandra
• Full-scale load test experience • got to 25x our peak hourly load on 25-28-node cluster
• production peak load included significant write load
• working-set size was about 2M persons in a month
• enabled row cache, ran almost entirely without disk access
• bottlenecked on interconnect socket w/ round robin client
• got 50% boost from token-aware, round robin client
• OpsCenter – great visibility for tuning
• Large SSD cluster – able to handle repair
during scale tests
49
Experience with Cassandra
current system
cassandra impl (1x, 10x, 20x)
50
Experience with Cassandra
current system
cassandra impl (1x, 10x, 20x)
LOG SCALE!
51
Current Status
• still working on implementation & rollout
• migration, reconciliation, integration…
• consistency model code separate
52
Contents
• Introduction to FamilySearch Family Tree
• Outline of Cassandra reimplementation
• Journal-based Consistency Model
• Experience with Cassandra
Questions?
53
Contact Info
John Sumsion
Sr. Software Engineer
@jdsumsion
Thanks to the team at FamilySearch! esp. Randy & James for doing the model
Thanks to the awesome presenters & organizers at #CassandraSummit!