Wanna search? Piece of cake!

16

description

Fast, scalable and easy to setup search engine for your data.

Transcript of Wanna search? Piece of cake!

Page 1: Wanna search? Piece of cake!
Page 2: Wanna search? Piece of cake!

Wanna search? Piece of cake!

Fast, scalable and easy to setup search engine for your data.

by Alexey Kursovhttp://www.linkedin.com/in/kursov

Page 3: Wanna search? Piece of cake!

ElasticSearch is a● distributed● RESTful ● free/open source search server ● based on Apache Lucene.

It is developed by Shay Banon(@kimchy) and is released under the terms of the Apache License. ElasticSearch is developed in Java.

http://elasticsearch.org/http://elasticsearch.com/

WTF?

Page 4: Wanna search? Piece of cake!

Apache Lucene is a ● free/open source information retrieval software library● originally created in Java ● it is supported by the Apache Software Foundation ● it is released under the Apache Software License

While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching.

http://lucene.apache.org/core/

Lucene?

Page 5: Wanna search? Piece of cake!

Indexing.ElasticSearch is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead.

This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).

ElasticSearch uses Apache Lucene to create and manage this inverted index.

Basic Concepts

Page 6: Wanna search? Piece of cake!

In computer science, an inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database. Simple example:

Given the texts:

T[0] = "it is what it is"T[1] = "what is it"T[2] = "it is a banana"

we have the following inverted file index (where the integers in the set notation brackets refer to the indexes (or keys) of the text symbols, T[0], T[1] etc.):

"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}

Inverted index

Page 7: Wanna search? Piece of cake!

Basic ConceptsData representation.In ElasticSearch, a Document is the unit of search and index. An index consists of one or more Documents, and a Document consists of one or more Fields (in database terminology, a Document corresponds to a table row, and a Field corresponds to a table column). Schema declares:- what fields there are - which field should be used as the unique/primary key- which fields are required- how to index and search each field- etc.An index may store documents of different "mapping types". You can associate multiple mapping definitions for each mapping type. A mapping type is a way of separating the documents in an index into logical groups.

Page 8: Wanna search? Piece of cake!

Competitors?

http://lucene.apache.org/solr/

http://sphinxsearch.com/

Page 9: Wanna search? Piece of cake!

What's the same?

VS

Lucene Query, Facet, Index functionality implementation:

Very similar, but have some differences and nuances, as the one or the other side (in the internet a lot of information about this, you can read for example this series of articles http://blog.sematext.com/2012/08/23/solr-vs-elasticsearch-part-1-overview/ )

Page 10: Wanna search? Piece of cake!

What's the difference?

VS

ElasticSearch main advantages (IMHO):

1. Low barriers to entry. ElasticSearch is a more "intuitive, accessible" system (significantly less configuration, as it's dynamic via HTTP schema builder and sensible defaults)2. JSON-based API is cleaner and easier to use 3. The replication and sharding capabilities are much simpler to configure4. Complex documents (nested)5. Multiple document types per schema 6. Joins (parent/child relationships)7. Online schema changes 8. Self-contained cluster

Page 11: Wanna search? Piece of cake!

What's the difference?

VS

Solr main advantages (IMHO):

1. Solr has a bigger, more mature user, dev, and contributor community2. Solr is more mature and maybe more stable3. Solr has more response formats (XML,CSV,JSON)4. Better 3rd-party product integration 5. Pivot Facets6. More customizable

Page 12: Wanna search? Piece of cake!

Who wins?

VS

We are all!

Page 13: Wanna search? Piece of cake!

ES Clients and "river" plugins

There are clients for languages and platforms (from official site):Java, .Net, Perl, Python, Python, Ruby, PHP, Javascript, Scala, Clojure, Go, Erlang, EventMachine, OCaml, Smalltalk

There are "river" (data import) plugins for:

JDBC, CouchDB, Wikipedia, Twitter, RabbitMQ, RSS, MongoDB, Open Archives Initiative (OAI) , St9, Sofa, Amazon SQS, LDAP, Dropbox, ActiveMQ, Solr, CSV, JMS

Page 14: Wanna search? Piece of cake!

Who use ?

Page 15: Wanna search? Piece of cake!

How to connect from my code?

NEST(Guys from stackowerflow.com and I think it is the best .net client for ElasticSearch)

NEST aims to be a .net client with a very concise API. (http://github.com/Mpdreamz/NEST)

Its main goal is to provide a solid strongly typed Elasticsearch client. It also has string/dynamic overloads for more dynamic use cases.

Why NEST?

● Fluent. Looks like:

ElasticClient.Search<Foo>(s => s.From(0).Size(10).SortAscending(f => f.Name).Query(...

● Json serializer/deserializer - Newtonsoft Json.NET with all its advantages● Strongly typed● Useful attributes for configuring● kept improving and developing● Open-source● Clear and beauty source code● Available on NuGet

Other clients you can find here: http://www.elasticsearch.org/guide/clients/

Page 16: Wanna search? Piece of cake!

Practice