Mozscape no sql-at-terabyte-scale
-
Upload
philhsmith -
Category
Technology
-
view
239 -
download
0
Transcript of Mozscape no sql-at-terabyte-scale
![Page 1: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/1.jpg)
Mozscape: NoSQL at Terabyte Scale
Phil SmithSoftware Engineer
![Page 2: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/2.jpg)
What We Do
SEO & Inbound Marketing Metrics
www.opensiteexplorer.org
![Page 3: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/3.jpg)
What We Do
Collect back links across the web
www.opensiteexplorer.org
![Page 4: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/4.jpg)
What We Do
Collect back links across the web
www.opensiteexplorer.org
Compute metrics estimating value
![Page 5: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/5.jpg)
What We Do
Collect back links across the web
www.opensiteexplorer.org
Compute metrics estimating value
Serve links and metrics with API and OSE
![Page 6: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/6.jpg)
How We Do
~25-30 billion pages per month
Crawl the Web
![Page 7: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/7.jpg)
How We Do
~25-30 billion pages per month
20 Crawler machines
Crawl the Web
![Page 8: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/8.jpg)
How We Do
~25-30 billion pages per month
20 Crawler machines
~256 MB/sec aggregate download rate
Crawl the Web
![Page 9: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/9.jpg)
How We Do
1:5 to 1:50 Compression Ratios
Compute Aggregates and Metrics
![Page 10: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/10.jpg)
How We Do
1:5 to 1:50 Compression Ratios
Aggregates are Parallelized Linear Scans
Compute Aggregates and Metrics
![Page 11: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/11.jpg)
How We Do
1:5 to 1:50 Compression Ratios
Aggregates are Parallelized Linear Scans
Communication Avoided where Possible
Compute Aggregates and Metrics
![Page 12: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/12.jpg)
How We Do
~12 TB per Release in Amazon S3
Surface with a Read-Only API
![Page 13: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/13.jpg)
How We Do
~12 TB per Release in Amazon S3
6 m2.4xlarge Instances for Cache
Surface with a Read-Only API
![Page 14: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/14.jpg)
How We Do
~12 TB per Release in Amazon S3
6 m2.4xlarge Instances for Cache
~28k Requests per Minute
Surface with a Read-Only API
![Page 15: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/15.jpg)
Observations and Strategy
Billions of Small, Similar Records
De-normalization Avoids Complex Joins
Batch-style Emphasizes Spatial Locality
![Page 16: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/16.jpg)
Data Layout
Column-Orientation exploits Locality
Broken into 5GB chunks for S3
~64KB Compression Runs within
![Page 17: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/17.jpg)
Compression
Tuned to Overcome Disk Read Bound
By-Column, Run & Gap Encoding on LZO
Customized Pipelines per Column
![Page 18: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/18.jpg)
Job Control
Each Stage has Parallel, Idempotent Tasks
Tasks are Procs with easy Command Line
stdout, exit code are logged to track state
![Page 19: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/19.jpg)
Checkpoints
Time
S3
Table Scan Checkpoint
Barrier
Table Scan
Barrier
![Page 20: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/20.jpg)
Indexing
Columns have BDBs indexing by ID
Subset of IDs map to Compression Runs
Decompress Run and Scan to find Record
![Page 21: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/21.jpg)
Physical Deployment
Crawlers run in Colo for white-listed IPs
Batch Process and API layer in EC2
The API might be in a colo too, butELB + Autoscaling are nice
![Page 22: Mozscape no sql-at-terabyte-scale](https://reader031.fdocuments.in/reader031/viewer/2022021923/58ee03c31a28ab647c8b4575/html5/thumbnails/22.jpg)
Questions?
We’re Hiring!