Indexing big data in the cloud

Indexing Big Data in the Cloud

Indexing Big Data in the Cloud 2

Scott StultsCo-Founder of OpenSource Connections

Solr / Lucene

Bash / Python / Java

Big Data

Big Data Wrangler

Address a Real ProjectBe Agile

Make Small Mistaeks FastSucceed BIG

USPTO Goals

Prototype Search UX

Prove Solr:Scales

IntegratesExcels

Scale?

Our Approach

KISSYAGNI

(This space intentionally left blank)

Minimal Flair

Record Everything!

Some Numbers

Doc Count 1.1 MillionZip Files 313

Docs per Zip File 4,000

Zip File Size 75M

File Size 300M

Testing

Start some serversProcess a batchCheck the clock

start_nodes

start_nodes() { ec2-run-instances ami-1b814f72 \ --block-device-mapping '/dev/sdb=snap-48adde35::true' \ --block-device-mapping '/dev/sdi1=:10:false' \ --block-device-mapping '/dev/sdi2=:10:false' \ --block-device-mapping '/dev/sdi3=:20:false' \ --instance-type m1.large \ --key uspto-proto \ --instance-count $MAX_NODES \ --group default > ~/run-output}

Gut Check

How fast can we do this?

What can we do in parallel?

Scaling

Raise our instance limit

xargs -P GNU parallel

Shortcomings

SSH?Error recovery

One Solr

Alternatives

CloudFormationPuppet / Chef

Multiple Cores / ShardsHadoop

Success

Victory Lap

Instances / Time

Thank You

https://github.com/sstults/patent-indexing

@scottstults#o19s

Indexing big data in the cloud

Technology

Transcript of Indexing big data in the cloud

BIG-IQ and BIG-IP Cloud Edition

Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda

Efﬁcient B-tree Based Indexing for Cloud Data Processingooibc/vldb10-cgindex.pdf · Efﬁcient B-tree Based Indexing for Cloud Data Processing Sai Wu#1, Dawei Jiang#2, Beng Chin

Big Data & the Cloud

BIG-IP Cloud Edition Managing Compliance ... - F5 Networks · BIG-IP Cloud Edition Buying uide f5com 3 F5 BIG-IP Cloud Edition: A brief architectural overview BIG-IP Cloud Edition

Big Data: Indexing ~50Tb of URIs

User s Guide and Reference - Oracle Cloud · 1.7.5 Managing BDSG Text Indexing Using SolrCloud 7.0 1-23 ... 2.9 Spatial Raster Processing Support in Big Data Cloud Service 2-46 2.10

THE!CLOUD!BEGINS WITH!COAL BIG$DATA BIG$NETWORKS BIG ... · THE!CLOUD!BEGINSWITH!COAL! BIG$DATA,$BIG$NETWORKS,$BIG$INFRASTRUCTURE,$ANDBIG$POWER$ ANOVERVIEW#OF#THE#ELECTRICITYUSED#BYTHE#GLOBALDIGITAL#ECOSYSTEM##

Big data cloud architecture

Workshop on Big Data Management in Clouds · • Cloud storage architectures for Big Data • Query processing and indexing in Cloud computing systems • Data privacy and security

BE Cloud-to-Cloud User Guide...7 2.4 Indexing Your Archived Data Granular search and restore requires indexing the backup data, but the indexed data is not encrypted. y default the

Web data indexing in the cloud: efficiency and cost …openproceedings.org/2013/conf/edbt/Camacho-RodriguezCM13.pdfWeb Data Indexing in the Cloud: Efﬁciency and Cost Reductions Jesús

Rackspace Cloud Big Data Platformc744563d32d0468a7cf1-2fe04d8054667ffada6c4002813… · · 2015-10-12Rackspace Cloud Big Data Platform: On-demand Big Data Processing Platform ...

Indexing Big Data 30,000 Foot View of Databases Big data problem · 2014-12-01 · organize data on disks query your data 365 42 ingest data Big data problem Indexing Big Data Michael

Efficiently Indexing AND Querying Big Data in Hadoop MapReduce

Cloud - Big Data

Presentation cloud meets big

Enterprise Cloud Forum: Turning Big Data into Big Dollars

ITU-T SG13 Cloud Computing & Big Data activities · capabilities for cloud computing and big data ... –Distributed cloud overview and high-level requirements –Cloud computing

The Big Cloud Questions