OpenSearchLab and the Lucene Ecosystem

24
OpenSearchLab and Lucene Grant Ingersoll Chief Scientist @LucidWorks Member, Committer at Apache Soft. Found. Co-Founder, Apache Mahout

description

Keynote slides from http://opensearchlab.otago.ac.nz/FullProceedings.pdf

Transcript of OpenSearchLab and the Lucene Ecosystem

Page 1: OpenSearchLab and the Lucene Ecosystem

OpenSearchLab and Lucene

Grant IngersollChief Scientist @LucidWorks

Member, Committer at Apache Soft. Found.Co-Founder, Apache Mahout

Page 2: OpenSearchLab and the Lucene Ecosystem

Hats

I’m here as an individual who happens to contribute (and commit) to Lucene, Solr, Mahout and other open source projects.

I don’t officially represent the ASF or even Lucene/Solr/Mahout.

Page 3: OpenSearchLab and the Lucene Ecosystem

Topics

• Openness

• What are some OpenSearchLab (OSL) needs?

• The Lucene Ecosystem

• Lucene for Research?

• A Sample Architecture

Page 4: OpenSearchLab and the Lucene Ecosystem

Putting the Open in OpenSearchLab

• Open Development >> Open Source

• Open community

• Open corpora

• Open evaluations

• Open Research• w/o being onerous http://www.facebook.com/photo.php?

fbid=10151728075710181&set=a.10151045050120181.780469.68096845180&type=1&theater

Page 5: OpenSearchLab and the Lucene Ecosystem

OSL Needs?Community

• Openness Model

• Contributions:• Who?• Where?• How?

• Ownership/Legal:• Code• Contributions• Infrastructure

• Privacy• …

Code

• Architecture• Flexible• Scalable

• Experiment Mgmt

• Content Acquisition• Analysis• Indexing• Querying• Downstream Tools

• Faceting, highlighting, auto-suggest, spellchecking, etc.

• Records Mgmt• Testing• …

Infrastructure

• Hardware• Cloud or hosted?• Network/Bandwidth• Production/Staging/

Dev

• $$$$

• Release Management

• Devops• …

Page 6: OpenSearchLab and the Lucene Ecosystem

What’s this have to do with Lucene?

Page 7: OpenSearchLab and the Lucene Ecosystem

Code

Committers

Contributors

ASF

Users

“An ecosystem is a community of living organisms in conjunction with the nonliving components of their environment interacting as a system.”

– Wikipedia

Page 8: OpenSearchLab and the Lucene Ecosystem

The ASF and ASL• ASF == Apache Software Foundation

– Volunteer-based, but many are paid to work on open source by their employer

– Community Over Code• Consensus-driven development

– Meritocracy• “Those who do, make the decisions”

– 100+ Top Level Projects– Infrastructure to support projects– “The Apache Way”

• ASL == Apache Software License (v2)

ASL ≠ ASF

Page 9: OpenSearchLab and the Lucene Ecosystem

Lucene Community

• In a nutshell: Large, Active Community• 30+ committers, many, many more contributors• (Tens of?) Thousands of Practitioners• Thousands of production instances– Twitter, Apple, IBM Watson, LinkedIn, Netflix, Commercial

Search Engines, …– “… they frequently turn to real-time search: our system

serves over two billion queries a day, with an average query latency of 50 ms. Usually, tweets are searchable within 10 seconds after creation.” -- EarlyBird, Busch et. al.

Page 10: OpenSearchLab and the Lucene Ecosystem

The Code Ecosystem

Lucene Core

Solr

Hadoop

Mahout

OpenNLP

Nutch

Tika

Page 11: OpenSearchLab and the Lucene Ecosystem

• Flagship Java library for building search applications– Indexing, Searching, Language Analysis

• Powers apps large and small the world over• More in Apache Lucene 4 talk later• Fast, small footprint• Lots of useful related modules

– Highlighting, Joins, Spatial, etc.

• http://lucene.apache.org/core

Page 12: OpenSearchLab and the Lucene Ecosystem

• Search server built using Lucene and HTTP• Faceting, highlighting, most Lucene features,

easy admin• Highly Extensible• Scalable (query volume and index size)

• Lucene Best Practices• http://lucene.apache.org/solr

Page 13: OpenSearchLab and the Lucene Ecosystem

• Originally built for Nutch to solve large scale crawling problems

• Distributed File System and Computation Model– HDFS and MapReduce, YARN coming

• Common Use Cases: storage, log analysis, ETL

• http://hadoop.apache.org

Page 14: OpenSearchLab and the Lucene Ecosystem

• Web-scale crawler and search built on Lucene/Solr and Hadoop

• Link analysis (aka PageRank)• Plugin framework• Parsers for common document formats (PDF,

Word, HTML, etc.)

• http://nutch.apache.org

Page 15: OpenSearchLab and the Lucene Ecosystem

• Scalable machine learning– Utilize Hadoop where appropriate

• Primary Focus: “The 3 C’s”– Clustering, classification, collaborative filtering

• Others– Frequent pattern mining, topic extraction,

statistically interesting phrases

• http://mahout.apache.org

Page 16: OpenSearchLab and the Lucene Ecosystem

• Toolkit for detecting and extracting content from MIME types

• Support for many common file formats– Office, PDF, HTML, etc.

• Intuitive API (think SAX parser)• Wraps best of breed open source extractors• Plug in your own

• http://tika.apache.org

Page 17: OpenSearchLab and the Lucene Ecosystem

• Supports common NLP tasks– NER, POS tagging, Chunking, Parsing, CoRef

resolution• MaxEnt and Perceptron based– Working to make the machine learning pluggable

• Some Multilingual support• New life at the ASF• Related: cTakes, Stanbol

Page 18: OpenSearchLab and the Lucene Ecosystem

Other Useful Tools

• Apache Zookeeper – Distrib. Coordination• Apache Pig – Hadoop scripting w/o Java• Apache HBase/Accumulo/Cassandra –

BigTable/Dynamo • Avro and Protobufs – Serialization

frameworks• Netty: Server framework – easy to add

protocols and to scale• Stanbol – Semantic Content Management

using Solr, OpenNLP, others• UIMA – Unstructured Info Management

Page 19: OpenSearchLab and the Lucene Ecosystem

LUCENE CAN HAS RESEARCH?

• Dispelling a few misconceptions:–No such thing as Lucene OOTB– Lucene ≠ Solr

• Researchers are welcome!– Large audience and many domains– http://wiki.apache.org/lucene-java/HowToContribu

te– Battle-tested code– Speed v. Quality tradeoffs

http://1.bp.blogspot.com/_T2ki5Em5dnI/S8gxtImG7wI/AAAAAAAAAEs/N7aZKZ6g6g4/s1600/cat%2520typing.jpg

Page 20: OpenSearchLab and the Lucene Ecosystem

Research/Contribution Areas

• Work with the community to do evaluations• Scoring

– BM25, LM, IM, DFR others already implemented– Easy to add your own

• Codecs– Extensible compression/storage– Many already implemented approaches and more coming– SimpleText FTW!

• Others:– Faceting, auto-suggest, spell-checking, highlighting, expansion and

more– Different domains: machine generated data, mobile,

Page 21: OpenSearchLab and the Lucene Ecosystem

Abstract OSL Architecture

*

Page 22: OpenSearchLab and the Lucene Ecosystem

Lucene Ecosystem Implementation

Page 23: OpenSearchLab and the Lucene Ecosystem

Takeaways

• Open Development >> Open Source >> Shared Source– Corollary: You never know where good ideas are coming

from• ASF is a proven model for collaboration• Lucene ecosystem: extensive, production ready• Lucene 4 is viable for IR algorithms and data

structure research• OSL (IMO) needs a services-based, pluggable

architecture

Page 24: OpenSearchLab and the Lucene Ecosystem

Resources

• Getting Started– {Lucene|Mahout|Hadoop} In Action– Taming Text

[email protected]• @gsingers• http://www.lucidworks.com