Geoffrey Hendrey @geoffhendrey Architecture for real-time ad-hoc query on distributed filesystems.

16
Geoffrey Hendrey @geoffhendrey Architecture for real- time ad-hoc query on distributed filesystems

Transcript of Geoffrey Hendrey @geoffhendrey Architecture for real-time ad-hoc query on distributed filesystems.

Geoffrey Hendrey@geoffhendrey

Architecture for real-time ad-hoc query on distributed filesystems

Motivation

• Big Data is more opaque than small data– Spreadsheets choke– BI tools can’t scale– Small samples often fail to replicate issues

• Engineers, data scientists, analysts need:– Faster “time to answer” on Big Data– Rapid “find, quantify, extract”

• Solve “I don’t know what I don’t know”• This is NOT about looking up items in a

product catalog (i.e. not a consumer search problem)

Scaling search with classic sharding

Classic “side system” approach• Definition of KLUDGE: “a system and

especially a computer system made up of poorly matched components” –Merriam-Webster

Hadoop SearchCluster

?????

Classic “search toolkit”

• Built around fulltext use case• Inverted Indexes optimized for on-the-fly

ranking of results– TF-IDF– Okapi BM-25

• Yet never able to fully realize google-style search capability

• Issues:– Phrase detection– Pseudo synonymy– Open loop architecture

Big data ad-hoc query

• Not typically a fulltext “document search” problem• Data is structured, mixed structured, and

denormalized– Log lines– Json records– CSV files– Hadoop native formats (SequenceFile)

• Ranking is explicit (ORDER BY), not relevance based

• Sometimes “needle in haystack” (support, debugging)

• Sometimes “haystack in haystack” (summary analytics, segmentation)

Dremel MPP query execution tree

Finer points of Dremel architecture• MapReduce friendly• In-Situ approach is DFS friendly• Excels at aggregation. Not so much for needle-in-

haystack.• Column storage format accelerates mapreduce

(less extraneous data pushed through)• But in some regards still a “side system”• Applications must explicitly store their data in a

columnar format• “massive” is both a benefit and a hazard

– Complex (operationally and WRT query execution)– Queries can execute quickly…on huge clusters

Crawled In-Situ Index Architecture

HDFSMapReduce

Data Crawl

In-situ Index

SimpleSearch

Application

Hadoop

Benefits to crawled In-Situ index• No changes to application data format

– CSV– JSON– SequenceFile

• Clear “separation of concerns” between data and index

• Indexes become “disposable”: easily built, easily thrown away

• There is no “side system” that needs to be maintained

• Use the mapreduce “hammer” to pound a nail

Architect for Elasticity

AWS S3

Elastic MapReduce

JetS3tEC2

M1.large

ApplicationCrawl

Index

HTTP

Interesting: you don’t actually need to have hadoop installed…

Declarative Crawl Indexing

HDFSMapReduce

Data Crawl

In-situ Index

SimpleSearch

Application

Hadoop

{"filter”:"column[4]==\"athens\"" }

Parse.json

• Indexer reads declarative instructions from in-situ file• “pull” vs. traditional “push” indexing approach

Thin index

• Index size is small because data is a holistic part of the system

• data does not need to be “put into” the search system and repicated in the index.

HDFSMapReduce

Data Crawl

In-situ Index

Data

Index

Lazy data loading

HDFSMapReduce

Data Crawl

ExecutionRuntime

Data

IndexLRU

IndexCache

Lazy Pull

Lazy Pull

Column Oriented Approach