Download slides - NoSQL Roadshow

39
16.05.2013 Full-text Search with NoSQL Technologies NoSQL Search Roadshow 2013, Berlin Kai Spichale

Transcript of Download slides - NoSQL Roadshow

Page 1: Download slides - NoSQL Roadshow

16.05.2013

Full-text Search with NoSQL Technologies

NoSQL Search Roadshow 2013, Berlin

Kai Spichale

Page 2: Download slides - NoSQL Roadshow

About me

► Kai Spichale

► Software Engineer at adesso AG

► NoSQL, Full-text searching, Spring, Java EE

16.05.2013 1

► adesso is among Germany‘s top IT service providers

► Consulting and software development focus

► More than 1,000 members of staff

► Some of the most important customers are Allianz, Hannover Rück, Westdeutsche Lotterie, Zurich Versicherung, DEVK, and DAK

Page 3: Download slides - NoSQL Roadshow

Motivation

NoSQL

► Exponential data growth

► Semi-structured data

► More connections

► 80 percent of business-relevant information is in unstructured form

Search

► Shift in data access:

> More full-text search

> Higher user expectations

► Keyword search and link directories become impractical

16.05.2013 2

Page 4: Download slides - NoSQL Roadshow

Agenda

► Lucene full-text search

► NoSQL:

> Architectural drivers

> MongoDB

> Neo4j

> Apache Cassandra

> Apache Hadoop

► Summary

16.05.2013 3

Page 5: Download slides - NoSQL Roadshow

Full-text search

► Techniques for searching documents in collections

► grep-like naive approach:

> Serial scanning is slow

> No negation

> No distinction between phrase and keyword search

► Build inverted index

> Term Document

> Contains references to documents for each token

16.05.2013 4

Page 6: Download slides - NoSQL Roadshow

Apache Lucene

► Java lib for full-text searches

► De facto standard for open source software

► Attributes:

> Application-agnostic

> Scalable, high performance

► Features:

> Ranked searching

> Multiple query types, faceting

> Sorting

> Multi-Index searching

16.05.2013 5

Page 7: Download slides - NoSQL Roadshow

Text Analysis

16.05.2013 6

Extraction

Parsing

Character

Filter

Tokenizer

Token Filter

Documents

de.GermanAnalyzer:

StandardTokenizer > StandardFilter

> LowerCaseFilter > StopFilter > GermanStemFilter

Inverteted

Index

Page 8: Download slides - NoSQL Roadshow

Text Analysis

16.05.2013 7

ID Term Document

1 come 2

2 dog 1

3 eat 1

4 exception 3

5 first 2

5 food 1

6 own 1

7 prove 3

8 rule 3

9 serve 2

10 your 1

Eat your

own dog

food.

First

come,

first

served.

The

exception

proves the

rule.

a

and

around

every

for

from

in

is

it

not

on

one

the

to

under

Stop word List

Page 9: Download slides - NoSQL Roadshow

Query types

Type Example

Term

(MUST, MUST_NOT, SHOULD)

+adesso –italy

Phrase „foo bar“

Wildcard fo*a?

Fuzzy fobar~

Range [A TO Z]

16.05.2013 8

Page 10: Download slides - NoSQL Roadshow

Agenda

► Lucene full-text search

► NoSQL:

> Architectural drivers

> MongoDB

> Neo4j

> Apache Cassandra

> Apache Hadoop

► Summary

16.05.2013 9

Page 11: Download slides - NoSQL Roadshow

NoSQL and Search

One size fits all approach

► Which NoSQL store satisfies our requirements best?

► Is full-text search supported?

5/16/2013 10

Data Structure Access Patterns

Volume Performance

Availability Updates

Consistency

Page 12: Download slides - NoSQL Roadshow

NoSQL and Search

5/16/2013 11

Let‘s take a closer look on:

► MongoDB

► Neo4j

► Apache Cassandra

► Apache Hadoop

Page 13: Download slides - NoSQL Roadshow

Document-oriented Databases

{ "_id" : ObjectId(„42"),

"firstname" : "John",

"lastname" : "Lennon",

"address" : { "city" : "Liverpool",

"street" : "251 Menlove Avenue“ }

}

5/16/2013 12

► Designed for storing and retrieving documents

► Semi-structured content such as BSON documents

Page 14: Download slides - NoSQL Roadshow

MongoDB

► Supports ad-hoc CRUD operations

db.things.find({firstname:"John"})

► Server-side execution of JavaScript

► Aggregations, MapReduce

► Simple keyword search with multikey indexes:

> Index array content as separate entries

16.05.2013 13

{ article : “some long text",

_keywords : [ “some" , “long" , “text“]

}

Page 15: Download slides - NoSQL Roadshow

MongoDB

► Version 2.4 supports text indexes

► Language-specific stemming based on Snowball

► Still a beta feature

16.05.2013 14

db.foo.runCommand(“text“, {search: “adesso –italy”, language: “english”})

Page 16: Download slides - NoSQL Roadshow

MongoDB

► Mongo Connector integrates MongoDB with another system (backup MongoDB cluster, Solr, elasticsearch,)

► System architecture with separate search engine possible

16.05.2013 15

MongoDB SolrMongo

Connector

1 2 3 4 5

update synccreatedocument index search

Page 17: Download slides - NoSQL Roadshow

Choosing the Right Approach

MongoDB MongoDB

+ Search Engine

Search Engine

No result set merging

Complex queries with

aggregations

Simple text search

(but experimental text

index)

Full-text search with

faceting

Complex queries with

aggregations

Result set merging

Increased complexity

(ops, dev)

No result set merging

Full-text search with

faceting

Backup?

Aggregations?

16.05.2013 16

Page 18: Download slides - NoSQL Roadshow

Graph Databases

► Stored data is represented as graph structures

> Nodes

> Edges (Relationships)

> Properties

► Universal datamodel

► Traversing

► Example: Neo4j

5/16/2013 17

id=1

name=“John“

id=2

name=“George“

id=3

name=“Paul“

friend friend

Page 19: Download slides - NoSQL Roadshow

Neo4j

► Traversing

> Visiting nodes by following relationships

> Breadth- and depth-first traversing

> Gremlin, Cypher

Result = George

5/16/2013 18

START john=node:peoplesearch(name=‘John’)

MATCH john<-[:friend]->afriend RETURN afriend

Page 20: Download slides - NoSQL Roadshow

Neo4j

► Database itself is a natural index consisting of its edges and nodes

> Example: „name“, „city“

► Auto indexing keeps track of property changes

16.05.2013 19

personRepository.findByPropertyValue("name", "John");

Page 21: Download slides - NoSQL Roadshow

Neo4j

► The default separate index engine used is Apache Lucene

16.05.2013 20

Index<PropertyContainer> index = template.getIndex("peoplesearch");

index.query("name", "Jo*");

@NodeEntity class Person {

@Indexed(indexName="peoplesearch", indexType=IndexType.FULLTEXT) private String name;

..

}

Page 22: Download slides - NoSQL Roadshow

Wide Column Store

► Google BigTable: „a sparse, distributed multi-dimensional sorted map“

► Data is organized in rows, column families, and columns

► Ideal for sharding (horizontal partitioning)

16.05.2013 21

jlennon

pmccart

gharris

name

„Lennon“

name

„McCartney“

name

„Harrison“

state

„UK“

address

„Liverpool ..“

address

„Liverpool ..“

different columns per row

unique

row keys

Page 23: Download slides - NoSQL Roadshow

Apache Cassandra

► BigTable clone

► Distributed Hash Table (Amazon Dynamo)

► Eventual consistency (configurable levels)

► Cassandra Query Language (CQL) = SQL dialect without joins

► Hadoop integration

5/16/2013 22

SELECT name FROM user WHERE firstname=„John“;

Page 24: Download slides - NoSQL Roadshow

Apache Cassandra

► Solandra = Solr using Cassandra as backend

► DataStax Enterprise Search

> One local Solr instance per Cassandra node

> Integration is based on secondary index API

> CQL supports Solr Queries

> Cassandra’s ring information is used to

construct Solr distributed search queries

16.05.2013 23

SELECT title FROM solr WHERE solr_query=‘name:jo*';

Page 25: Download slides - NoSQL Roadshow

Agenda

► Lucene full-text search

► NoSQL:

> Architectural drivers

> MongoDB

> Neo4j

> Apache Cassandra

> Apache Hadoop

► Summary

16.05.2013 24

Page 26: Download slides - NoSQL Roadshow

Apache Hadoop

► Hadoop:

> Framework for distributed processing of large data sets in computer clusters

> Distributed filesystem + MapReduce implementation

► Scalable and reliable platform of a comprehensive data analysis ecosystem

16.05.2013 25

Page 27: Download slides - NoSQL Roadshow

Hadoop MapReduce

5/16/2013 26

Persistent Data

Map Map Map Map

Transient Data

Persistent Data

Reduce Reduce Reduce

► Map Phase:

> Records are processed by map function

► Shuffle Phase:

> Distributed sort and grouping

► Reduce Phase:

> Intermediate results are processed by reduce function

Page 28: Download slides - NoSQL Roadshow

Hadoop MapReduce

► Data is processed by mappers and reducers

5/16/2013 27

map(k, v) -> [(K1,V1), (K2,V2), ... ]

reduce(Kn, [Vi, Vj, …]) ->

(Km, R)

Data

Mapper

Shuffle Reducer Result

Page 29: Download slides - NoSQL Roadshow

What kind of problems does MapReduce solve?

► Problems processed without reducer

> Searching

> File converting

> Sorting

> Map-side join

► Problem processed with reducer

> Grouping and aggregation

> Reduce-side join

► More complex problems:

> Solved by combinations of multiple MapReduce jobs

5/16/2013 28

Page 30: Download slides - NoSQL Roadshow

Hadoop MapReduce: Searching

► Search document including „A“

5/16/2013 29

1

2 1

Documents

1: A,B,C

3

4

5

Mapper emits only documents that fit

the searching criteria

4

5

2: D,E

3: B,E

4: A,D

5: A,C,E

Result = 1, 4, 5

Page 31: Download slides - NoSQL Roadshow

Hadoop MapReduce: Indexing

16.05.2013 30

HDFS

MapReduce

Job

Lucene Lucene

Index

► HDFS:

> Stores raw data

► Mapper:

> Extracts text (creates e.g. SolrInputDocument)

> Calls Lucene for indexing (calls e.g. StreamingUpdateSolrServer)

Page 32: Download slides - NoSQL Roadshow

Hadoop MapReduce: Indexing

16.05.2013 31

1

2

Daten

1: text

3

4

5

2: text

3: text

4: text

5: text

Mapper

@Override public void map( LongWritable key, Text val, OutputCollector<NullWritable, NullWritable> output, Reporter reporter) throws IOException { st = new StringTokenizer(val.toString()); lineCounter = 0; while (st.hasMoreTokens()) { doc= new SolrInputDocument(); doc.addField("id", fileName + key.toString() + lineCounter++); doc.addField("txt", st.nextToken()); try { server.add(doc); } catch (Exception exp) { … } }}

Page 33: Download slides - NoSQL Roadshow

Apache Tika

16.05.2013 32

HDFS

MapReduce

Job

Lucene Lucene

Index

Tika

► Extracts metadata and structured text content

> HTML, MS Office documents, PDF, etc.

► Stream parser can process large files

Page 34: Download slides - NoSQL Roadshow

Apache Solr / elasticsearch

5/16/2013 33

HDFS

MapReduce

Job

Lucene Lucene

Index

Tika

ela

sticse

arc

h

► Lucene is only a libary, not a standalone search engine

► Complete search engines:

> Solr

> ElasticSearch

Page 35: Download slides - NoSQL Roadshow

Apache Flume

5/16/2013 34

HDFS

MapReduce

Job

Lucene Lucene

Index

Tika

Flume

Web Server,Applikations

ela

sticse

arc

h

► Distributed service for collecting, aggregating and moving large amounts of data (e.g. log data)

► Streaming techniques

► Fault tolerant

Page 36: Download slides - NoSQL Roadshow

Alternatives

5/16/2013 35

HDFS

MapReduce

Job

Lucene Lucene

Index

Tika

Flume

Web Server,Apps, DBs

Crawler DistCp Sqoop

ela

sticse

arc

h

► Nutch Crawler creates one entry in CrawlDB per URL

► Hadoop DistCp copies data within and between hadoop systems

► Apache Sqoop transfers bulk data between Hadoop and RDMBS

Page 37: Download slides - NoSQL Roadshow

Apache Hadoop

5/16/2013 36

Web Content,Intranet

Loading

Tools

Hadoop

Search Analysis Export Visualization

► Fundamental mismatch:

> MapReduce for batch processing

> Lucene for interactive searching

► MapReduce for indexing large datasets

► Basis for (offline) BigData solutions

Page 38: Download slides - NoSQL Roadshow

Summary

► More semi-structured data

► Increasing relevance of full-text searching

► Combination of NoSQL and Lucene:

> MongoDB: integration via MongoDB Connector

> Neo4j: native Lucene integration

> Cassandra: Datastax‘s Solr integration

> Hadoop: indexing large datasets with MapReduce

► Alternative: search engine as document-oriented database

5/16/2013 37

Page 39: Download slides - NoSQL Roadshow

Thank you for your attention!

16.05.2013 38