Indexing big data in the cloud

22
Indexing Big Data in the Cloud

description

Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.

Transcript of Indexing big data in the cloud

Page 1: Indexing big data in the cloud

Indexing Big Data in the Cloud

Page 2: Indexing big data in the cloud

Indexing Big Data in the Cloud 2

Me

Scott StultsCo-Founder of OpenSource Connections

Solr / Lucene

Bash / Python / Java

Page 3: Indexing big data in the cloud

Indexing Big Data in the Cloud 3

Eric

Page 4: Indexing big data in the cloud

Indexing Big Data in the Cloud 4

Big Data

Page 5: Indexing big data in the cloud

Indexing Big Data in the Cloud 5

Big Data Wrangler

Page 6: Indexing big data in the cloud

Indexing Big Data in the Cloud 6

How?

Address a Real ProjectBe Agile

Make Small Mistaeks FastSucceed BIG

Page 7: Indexing big data in the cloud

Indexing Big Data in the Cloud 7

USPTO Goals

Prototype Search UX

Prove Solr:Scales

IntegratesExcels

Page 8: Indexing big data in the cloud

Indexing Big Data in the Cloud 8

Scale?

Page 9: Indexing big data in the cloud

Indexing Big Data in the Cloud 9

Our Approach

KISSYAGNI

(This space intentionally left blank)

Page 10: Indexing big data in the cloud

Indexing Big Data in the Cloud 10

Minimal Flair

Page 11: Indexing big data in the cloud

Indexing Big Data in the Cloud 11

Record Everything!

Page 12: Indexing big data in the cloud

Indexing Big Data in the Cloud 12

Some Numbers

Doc Count 1.1 MillionZip Files 313

Docs per Zip File 4,000

Zip File Size 75M

File Size 300M

Page 13: Indexing big data in the cloud

Indexing Big Data in the Cloud 13

Testing

Start some serversProcess a batchCheck the clock

Page 14: Indexing big data in the cloud

Indexing Big Data in the Cloud 14

start_nodes

start_nodes() { ec2-run-instances ami-1b814f72 \ --block-device-mapping '/dev/sdb=snap-48adde35::true' \ --block-device-mapping '/dev/sdi1=:10:false' \ --block-device-mapping '/dev/sdi2=:10:false' \ --block-device-mapping '/dev/sdi3=:20:false' \ --instance-type m1.large \ --key uspto-proto \ --instance-count $MAX_NODES \ --group default > ~/run-output}

Page 15: Indexing big data in the cloud

Indexing Big Data in the Cloud 15

Gut Check

How fast can we do this?

What can we do in parallel?

Page 16: Indexing big data in the cloud

Indexing Big Data in the Cloud 16

Scaling

Raise our instance limit

xargs -P GNU parallel

Page 17: Indexing big data in the cloud

Indexing Big Data in the Cloud 17

Shortcomings

SSH?Error recovery

One Solr

Page 18: Indexing big data in the cloud

Indexing Big Data in the Cloud 18

Alternatives

CloudFormationPuppet / Chef

Multiple Cores / ShardsHadoop

Page 19: Indexing big data in the cloud

Indexing Big Data in the Cloud 19

Success

Page 20: Indexing big data in the cloud

Indexing Big Data in the Cloud 20

Victory Lap

Page 21: Indexing big data in the cloud

Indexing Big Data in the Cloud 21

Instances / Time

Page 22: Indexing big data in the cloud

Indexing Big Data in the Cloud 22

Thank You

https://github.com/sstults/patent-indexing

@scottstults#o19s