Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor:...
Transcript of Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor:...
Slide 1
Apache Solr Module PresentationIccha SethiSerdar AslanTeam 1Virginia TechInformation Storage and Retrieval CS 5604Instructor: Dr. Edward Fox
10/11/2010
1OutlineHistoryWhats LuceneWhats SolrGetting Starting with Solr (Indexing, updating, deleting)Querying DataOther features of SolrIR Concepts and SolrLight demo of SolrQuestions
2HistorySearch for a replacement search platformcommercial: high license feesopen-source: no full solutionsCNET grants code to Apache, Solr enters Incubator 17 Jan 2006Solr is a Lucene sub-project
3What is Lucene?Solr uses the Lucene Search library and extends it.Open source, high-performance text search engine library.Lucene is not a server and not a web crawler either.Uses scoring algorithms based on Information Retrieval principles. Uses rich set of text analyzers and query syntax with a parser.4Lucenes index (conceptual)Index
DocumentDocumentDocumentDocument
FieldFieldFieldFieldField
NameValueFigure 1: Lucene index (Kataria S., Khabsa M. ,Document Indexing and Scoring algorithm, 2010)5What is SolrSolr is an open source enterprise search platform.Used by ITunes,CNET, Zappos, Netflix as well as intranet sites. Written in Java.XML/HTTP interface.Schema to define types and fields.Web administration interface.DBSolrWebDataFigure 2: Common Solr UsageData6Major Features of SolrPowerful full-text searchHit highlightingFaceted searchDynamic clusteringDatabase integration
Explain each of these in a sentence.Check them again.7Architecture of SolrSolr CoreLuceneAdminInterfaceStandardRequestHandlerDisjunctionMaxRequestHandlerCustomRequestHandlerUpdate HandlerCachingXMLUpdate Interface
ConfigAnalysisHTTP Request ServletConcurrencyUpdate ServletXMLResponseWriterReplicationSchemaFigure 3: Architecture of Solr (Seeley Y. , Apache Solr, 2006) 8Solr DocumentsSolr accepts well formatted XML documents
CNN Breaking News Obama wins 2008-11-06T23:59:59.999Z
9Getting Started with SolrHow to run Solr on the IBM cloud systemLog in to the systemUsing putty and generated private keyGo to team1->apache-solr->exampleStart Solr server
Load the http://localhost:8983/solr/admin/ in your web browser
Need to add pictures .Might need another page.10Indexing DataSolr server is up and running.To index data:Open a new terminal Follow path team1/apache-solr/example/example-docs/Run "java -jar post.jar" on some of the XML files in that directory
11Indexing Data ContdTo index all data:Run java jar post.jar *.xml
Indexed all sample filesin the example directory12Solr Admin pageRun http://localhost:8983/solr/admin in your web browser
Explain each tap
13Updating DataUser can edit the existing XML file to change dataRun java -jar post.jar command
14Deleting DataDelete operation can be done by:Posting a delete command and specifying the value of a documents unique key field.java -Ddata=args -Dcommit=no -jar post.jar "SP2514N Posting a delete command and a query that matches multiple documents.java -Ddata=args -jar post.jar "name:DDRDont forget to update data java -jar post.jar!!!
Emphasize that to make an effective delete operation YOU HAVE TO UPDATE collection . java jar post.jar command15Querying DataSearches are done with the query string in the q parameter. Example query:q=videoCan pass a number of request parameters to control what information is returned.Example:fl" parameter to control what stored fields are returnedExample query:q=video&fl=name,id,score (return estimated relevancy score)16Querying Data contdExample query : q=video
Number of documents found in the collectionDifferent fields from theretrieved documentqueryExplain the the results e.g. numfound, wheres the query. Which places shows the all fields of the document.17Querying Data contdExample query : q=name:video
18Querying Data contdExample query : q=video&fl=name,id,score
19Querying Data contdExample query : q=video&fl=*,score (return all stored fields, as well as estimated relevancy score)
Estimated relevancy score20Querying Data contdExample query : q=video&sort=price desc&fl=name,id,price
21Querying Data contdExample query : q=video&wt=json
Can be pythonphp, ruby, xmlTalk about other formats of ruby etc.22HighlightingExample query : ...&q=video card&fl=name,id&hl=true&hl.fl =name,features
Highlighted fields are listed atthe bottom of the pageExplain what is highlighting23Faceted SearchIts a dynamic clustering of search results into categoriesAllow users to refine their search resultGenerates counts for various properties or categories.Also called faceted browsing, faceted navigationThe benefits:Superior feedbackNo surprises or dead endsNo selection hierarchy is imposed24Faceted Search Example : CNET website
25Faceted SearchExample query: ...&q=*:*&facet=true&facet.field=cat
Generated countsRefers all documents26Faceted SearchExample query: ...&q=ipod&facet=true&facet.query=price:[0 TO 100]&facet.query=price:[100 TO *]
Generated counts27Search RelevancyPowerShot SD 500PowerShotSD500SD500PowerShotPowerShotsd500powershotpowershotWhitespaceTokenizerWordDelimiterFilter catenateWords=1 LowercaseFilterpower-shot sd500power-shotsd500sd500powershotsd500powershotWhitespaceTokenizerWordDelimiterFilter catenateWords=0 LowercaseFilterQuery AnalysisA Match!Document AnalysisFigure 4 : Search Relevancy(Seeley Y. , Apache Solr, 2006)28What weve CoveredBasic information about SolrStructure of SolrHow to run Solr instanceAdding, deleting, updating documents Make changes to the indexMake a query and run itUse Solr admin interface29Other features of SolrDistributed searchNumeric field statisticSearch result clusteringFunction queriesBoostingMore Like This
30Relation with IR ConceptsTokenizationScoring tf-idf(Lucene Class Similarity)Lucene Practical Scoring:
Boosting documents, queriesWildcard queries (te?t,test*, te*t)Clustering(result clustering via Carrot2)Lucenes Conjunctive Search Algorithm uses skip pointers
coord(q,d) is a score factor based on how many of the query terms are found in the specified documentqueryNorm(q) is a normalizing factor used to make scores between queries comparablenorm(t,d) encapsulates a few (indexing time) boost and length factors:Documents boostField boostddd
31Relation with IR Concepts
Figure 5 : Chapter 7,Information Storage and Retrieval(Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze)Figure 6 : Chapter 1, Lucene In Action(Otis Gospodnetic and Erik Hatcher)32Videofile:///C:/Users/Sethi/Documents/Camtasia%20Studio/Apache-solr-team1/Apache-solr-team1.html33QuestionsAny questions???Are you ready for exercises???
34