Wanna search? Piece of cake!

Post on 14-Jan-2015

1.025 views 0 download

Tags:

description

Fast, scalable and easy to setup search engine for your data.

Transcript of Wanna search? Piece of cake!

Wanna search? Piece of cake!

Fast, scalable and easy to setup search engine for your data.

by Alexey Kursovhttp://www.linkedin.com/in/kursov

ElasticSearch is a● distributed● RESTful ● free/open source search server ● based on Apache Lucene.

It is developed by Shay Banon(@kimchy) and is released under the terms of the Apache License. ElasticSearch is developed in Java.

http://elasticsearch.org/http://elasticsearch.com/

WTF?

Apache Lucene is a ● free/open source information retrieval software library● originally created in Java ● it is supported by the Apache Software Foundation ● it is released under the Apache Software License

While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching.

http://lucene.apache.org/core/

Lucene?

Indexing.ElasticSearch is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead.

This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).

ElasticSearch uses Apache Lucene to create and manage this inverted index.

Basic Concepts

In computer science, an inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database. Simple example:

Given the texts:

T[0] = "it is what it is"T[1] = "what is it"T[2] = "it is a banana"

we have the following inverted file index (where the integers in the set notation brackets refer to the indexes (or keys) of the text symbols, T[0], T[1] etc.):

"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}

Inverted index

Basic ConceptsData representation.In ElasticSearch, a Document is the unit of search and index. An index consists of one or more Documents, and a Document consists of one or more Fields (in database terminology, a Document corresponds to a table row, and a Field corresponds to a table column). Schema declares:- what fields there are - which field should be used as the unique/primary key- which fields are required- how to index and search each field- etc.An index may store documents of different "mapping types". You can associate multiple mapping definitions for each mapping type. A mapping type is a way of separating the documents in an index into logical groups.

Competitors?

http://lucene.apache.org/solr/

http://sphinxsearch.com/

What's the same?

VS

Lucene Query, Facet, Index functionality implementation:

Very similar, but have some differences and nuances, as the one or the other side (in the internet a lot of information about this, you can read for example this series of articles http://blog.sematext.com/2012/08/23/solr-vs-elasticsearch-part-1-overview/ )

What's the difference?

VS

ElasticSearch main advantages (IMHO):

1. Low barriers to entry. ElasticSearch is a more "intuitive, accessible" system (significantly less configuration, as it's dynamic via HTTP schema builder and sensible defaults)2. JSON-based API is cleaner and easier to use 3. The replication and sharding capabilities are much simpler to configure4. Complex documents (nested)5. Multiple document types per schema 6. Joins (parent/child relationships)7. Online schema changes 8. Self-contained cluster

What's the difference?

VS

Solr main advantages (IMHO):

1. Solr has a bigger, more mature user, dev, and contributor community2. Solr is more mature and maybe more stable3. Solr has more response formats (XML,CSV,JSON)4. Better 3rd-party product integration 5. Pivot Facets6. More customizable

Who wins?

VS

We are all!

ES Clients and "river" plugins

There are clients for languages and platforms (from official site):Java, .Net, Perl, Python, Python, Ruby, PHP, Javascript, Scala, Clojure, Go, Erlang, EventMachine, OCaml, Smalltalk

There are "river" (data import) plugins for:

JDBC, CouchDB, Wikipedia, Twitter, RabbitMQ, RSS, MongoDB, Open Archives Initiative (OAI) , St9, Sofa, Amazon SQS, LDAP, Dropbox, ActiveMQ, Solr, CSV, JMS

Who use ?

How to connect from my code?

NEST(Guys from stackowerflow.com and I think it is the best .net client for ElasticSearch)

NEST aims to be a .net client with a very concise API. (http://github.com/Mpdreamz/NEST)

Its main goal is to provide a solid strongly typed Elasticsearch client. It also has string/dynamic overloads for more dynamic use cases.

Why NEST?

● Fluent. Looks like:

ElasticClient.Search<Foo>(s => s.From(0).Size(10).SortAscending(f => f.Name).Query(...

● Json serializer/deserializer - Newtonsoft Json.NET with all its advantages● Strongly typed● Useful attributes for configuring● kept improving and developing● Open-source● Clear and beauty source code● Available on NuGet

Other clients you can find here: http://www.elasticsearch.org/guide/clients/

Practice