Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor:...

34
Apache Solr Module Presentation Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010

Transcript of Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor:...

Slide 1

Apache Solr Module PresentationIccha SethiSerdar AslanTeam 1Virginia TechInformation Storage and Retrieval CS 5604Instructor: Dr. Edward Fox

10/11/2010

1OutlineHistoryWhats LuceneWhats SolrGetting Starting with Solr (Indexing, updating, deleting)Querying DataOther features of SolrIR Concepts and SolrLight demo of SolrQuestions

2HistorySearch for a replacement search platformcommercial: high license feesopen-source: no full solutionsCNET grants code to Apache, Solr enters Incubator 17 Jan 2006Solr is a Lucene sub-project

3What is Lucene?Solr uses the Lucene Search library and extends it.Open source, high-performance text search engine library.Lucene is not a server and not a web crawler either.Uses scoring algorithms based on Information Retrieval principles. Uses rich set of text analyzers and query syntax with a parser.4Lucenes index (conceptual)Index

DocumentDocumentDocumentDocument

FieldFieldFieldFieldField

NameValueFigure 1: Lucene index (Kataria S., Khabsa M. ,Document Indexing and Scoring algorithm, 2010)5What is SolrSolr is an open source enterprise search platform.Used by ITunes,CNET, Zappos, Netflix as well as intranet sites. Written in Java.XML/HTTP interface.Schema to define types and fields.Web administration interface.DBSolrWebDataFigure 2: Common Solr UsageData6Major Features of SolrPowerful full-text searchHit highlightingFaceted searchDynamic clusteringDatabase integration

Explain each of these in a sentence.Check them again.7Architecture of SolrSolr CoreLuceneAdminInterfaceStandardRequestHandlerDisjunctionMaxRequestHandlerCustomRequestHandlerUpdate HandlerCachingXMLUpdate Interface

ConfigAnalysisHTTP Request ServletConcurrencyUpdate ServletXMLResponseWriterReplicationSchemaFigure 3: Architecture of Solr (Seeley Y. , Apache Solr, 2006) 8Solr DocumentsSolr accepts well formatted XML documents

CNN Breaking News Obama wins 2008-11-06T23:59:59.999Z

9Getting Started with SolrHow to run Solr on the IBM cloud systemLog in to the systemUsing putty and generated private keyGo to team1->apache-solr->exampleStart Solr server

Load the http://localhost:8983/solr/admin/ in your web browser

Need to add pictures .Might need another page.10Indexing DataSolr server is up and running.To index data:Open a new terminal Follow path team1/apache-solr/example/example-docs/Run "java -jar post.jar" on some of the XML files in that directory

11Indexing Data ContdTo index all data:Run java jar post.jar *.xml

Indexed all sample filesin the example directory12Solr Admin pageRun http://localhost:8983/solr/admin in your web browser

Explain each tap

13Updating DataUser can edit the existing XML file to change dataRun java -jar post.jar command

14Deleting DataDelete operation can be done by:Posting a delete command and specifying the value of a documents unique key field.java -Ddata=args -Dcommit=no -jar post.jar "SP2514N Posting a delete command and a query that matches multiple documents.java -Ddata=args -jar post.jar "name:DDRDont forget to update data java -jar post.jar!!!

Emphasize that to make an effective delete operation YOU HAVE TO UPDATE collection . java jar post.jar command15Querying DataSearches are done with the query string in the q parameter. Example query:q=videoCan pass a number of request parameters to control what information is returned.Example:fl" parameter to control what stored fields are returnedExample query:q=video&fl=name,id,score (return estimated relevancy score)16Querying Data contdExample query : q=video

Number of documents found in the collectionDifferent fields from theretrieved documentqueryExplain the the results e.g. numfound, wheres the query. Which places shows the all fields of the document.17Querying Data contdExample query : q=name:video

18Querying Data contdExample query : q=video&fl=name,id,score

19Querying Data contdExample query : q=video&fl=*,score (return all stored fields, as well as estimated relevancy score)

Estimated relevancy score20Querying Data contdExample query : q=video&sort=price desc&fl=name,id,price

21Querying Data contdExample query : q=video&wt=json

Can be pythonphp, ruby, xmlTalk about other formats of ruby etc.22HighlightingExample query : ...&q=video card&fl=name,id&hl=true&hl.fl =name,features

Highlighted fields are listed atthe bottom of the pageExplain what is highlighting23Faceted SearchIts a dynamic clustering of search results into categoriesAllow users to refine their search resultGenerates counts for various properties or categories.Also called faceted browsing, faceted navigationThe benefits:Superior feedbackNo surprises or dead endsNo selection hierarchy is imposed24Faceted Search Example : CNET website

25Faceted SearchExample query: ...&q=*:*&facet=true&facet.field=cat

Generated countsRefers all documents26Faceted SearchExample query: ...&q=ipod&facet=true&facet.query=price:[0 TO 100]&facet.query=price:[100 TO *]

Generated counts27Search RelevancyPowerShot SD 500PowerShotSD500SD500PowerShotPowerShotsd500powershotpowershotWhitespaceTokenizerWordDelimiterFilter catenateWords=1 LowercaseFilterpower-shot sd500power-shotsd500sd500powershotsd500powershotWhitespaceTokenizerWordDelimiterFilter catenateWords=0 LowercaseFilterQuery AnalysisA Match!Document AnalysisFigure 4 : Search Relevancy(Seeley Y. , Apache Solr, 2006)28What weve CoveredBasic information about SolrStructure of SolrHow to run Solr instanceAdding, deleting, updating documents Make changes to the indexMake a query and run itUse Solr admin interface29Other features of SolrDistributed searchNumeric field statisticSearch result clusteringFunction queriesBoostingMore Like This

30Relation with IR ConceptsTokenizationScoring tf-idf(Lucene Class Similarity)Lucene Practical Scoring:

Boosting documents, queriesWildcard queries (te?t,test*, te*t)Clustering(result clustering via Carrot2)Lucenes Conjunctive Search Algorithm uses skip pointers

coord(q,d) is a score factor based on how many of the query terms are found in the specified documentqueryNorm(q) is a normalizing factor used to make scores between queries comparablenorm(t,d) encapsulates a few (indexing time) boost and length factors:Documents boostField boostddd

31Relation with IR Concepts

Figure 5 : Chapter 7,Information Storage and Retrieval(Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze)Figure 6 : Chapter 1, Lucene In Action(Otis Gospodnetic and Erik Hatcher)32Videofile:///C:/Users/Sethi/Documents/Camtasia%20Studio/Apache-solr-team1/Apache-solr-team1.html33QuestionsAny questions???Are you ready for exercises???

34