Indexing big data in the cloud

Post on 29-Aug-2014

1.721 views 1 download

Tags:

description

Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.

Transcript of Indexing big data in the cloud

Indexing Big Data in the Cloud

Indexing Big Data in the Cloud 2

Me

Scott StultsCo-Founder of OpenSource Connections

Solr / Lucene

Bash / Python / Java

Indexing Big Data in the Cloud 3

Eric

Indexing Big Data in the Cloud 4

Big Data

Indexing Big Data in the Cloud 5

Big Data Wrangler

Indexing Big Data in the Cloud 6

How?

Address a Real ProjectBe Agile

Make Small Mistaeks FastSucceed BIG

Indexing Big Data in the Cloud 7

USPTO Goals

Prototype Search UX

Prove Solr:Scales

IntegratesExcels

Indexing Big Data in the Cloud 8

Scale?

Indexing Big Data in the Cloud 9

Our Approach

KISSYAGNI

(This space intentionally left blank)

Indexing Big Data in the Cloud 10

Minimal Flair

Indexing Big Data in the Cloud 11

Record Everything!

Indexing Big Data in the Cloud 12

Some Numbers

Doc Count 1.1 MillionZip Files 313

Docs per Zip File 4,000

Zip File Size 75M

File Size 300M

Indexing Big Data in the Cloud 13

Testing

Start some serversProcess a batchCheck the clock

Indexing Big Data in the Cloud 14

start_nodes

start_nodes() { ec2-run-instances ami-1b814f72 \ --block-device-mapping '/dev/sdb=snap-48adde35::true' \ --block-device-mapping '/dev/sdi1=:10:false' \ --block-device-mapping '/dev/sdi2=:10:false' \ --block-device-mapping '/dev/sdi3=:20:false' \ --instance-type m1.large \ --key uspto-proto \ --instance-count $MAX_NODES \ --group default > ~/run-output}

Indexing Big Data in the Cloud 15

Gut Check

How fast can we do this?

What can we do in parallel?

Indexing Big Data in the Cloud 16

Scaling

Raise our instance limit

xargs -P GNU parallel

Indexing Big Data in the Cloud 17

Shortcomings

SSH?Error recovery

One Solr

Indexing Big Data in the Cloud 18

Alternatives

CloudFormationPuppet / Chef

Multiple Cores / ShardsHadoop

Indexing Big Data in the Cloud 19

Success

Indexing Big Data in the Cloud 20

Victory Lap

Indexing Big Data in the Cloud 21

Instances / Time

Indexing Big Data in the Cloud 22

Thank You

https://github.com/sstults/patent-indexing

@scottstults#o19s