Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006...
-
Upload
terence-nelson -
Category
Documents
-
view
218 -
download
0
Transcript of Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006...
![Page 1: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/1.jpg)
SearchSearch
Dr Ian Boston
University of Cambridge
Image © University of Cambridge 2006
6 December 200610:30INTL 6
![Page 2: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/2.jpg)
Search: Problem Area
• Stovepipe Applications– All wanted search
• Cant search each tool
• Unified Search of all content– 1 Text box + a button– Just like Google
• To Start with• Slightly less content
![Page 3: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/3.jpg)
Possible Solutions
Image © University of Cambridge 2006
![Page 4: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/4.jpg)
Public/Private Search Engine
– Register your site with Google• What about the content/permissions?• Non starter, content missing.
– Google Scholar• Eg DSpace
– Google Researcher ? Google Learner ?• Sakai is not OpenAccess• Why would they ?
![Page 5: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/5.jpg)
Private Search Application– Intranet solution
• Install Apache Nutch› Add AuthZ code
• Buy a Google Appliance› Configure to do some AuthZ› ~£40K 0.5M pages
– Rendered content is only a view• Misses properties• Approximates linkage
› Doesn’t know about Sakai
– Nutch Prototype in 1.5.1
![Page 6: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/6.jpg)
Entity Search– Write a search engine!
• Full time job.
– Reuse Lucene• Scalability
› Most have < 5M active documents› Nutch benchmarked
» 5 boxes, 2TB == 100M+ docs» http://wiki.apache.org/nutch/HardwareRequirements
• Plumb in Lucene› Connect to Sakai Entity Bus› Connect to Entity Produces at the object level.
• Learn from Nutch› Index Storage and Management› Scalability Reliability
– MUST Cluster OOTB
![Page 7: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/7.jpg)
Search Tool
Image © University of Cambridge 2006
![Page 8: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/8.jpg)
Search Tool
![Page 9: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/9.jpg)
Search Tool
• Permissions– Owning Entity checks permission on each
result
• Rendering Highlighting– Matching terms highlighted
• RSS Feed of search results
• OpenSearch (FF2.0, IE7) and Sherlock/Mycroft (FF1.5) integration
![Page 10: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/10.jpg)
Admin Tool
![Page 11: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/11.jpg)
Admin Tool
• Monitor Indexing progress
• Monitor Segments
• Request Worksite Index Rebuilds
• Request Complete Index Rebuilds– Expensive!
![Page 12: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/12.jpg)
Tag Tool
![Page 13: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/13.jpg)
Tag Tool
• Search for a term
• Discover other terms– Size indicates relevance within result set
• Needs some windowing on the word vectors– High frequency words not significant– Short words not significant
![Page 14: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/14.jpg)
Search API
• Simple API, one method .. Search()• Results paged at lowest level• Access to secondary Indexes
– “+Tool:wiki +Site:<siteid> +cowslips +bluebell
• Content terms use Porter Stemmer and Stop words– Stop words “and” “the” “a” ignored– Stemmer looks == look, try == trying
• May be some i18n issues
![Page 15: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/15.jpg)
InternalArchitecture
Image © Wikipedia Commons 2006
![Page 16: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/16.jpg)
Search Service
Lucene
Architecture
Sakai Entity Bus
Wik
i Se
rvic
e
Con
tent
Ser
vice
Mes
sage
Ser
vice
Event Listener
IndexQueue
Index Builder
Entity Content Producer
LocalSegment
Store
Clustered Index Store
SharedSegment
Store
Index Builder
Search Service
Search API
Sea
rch
Too
l
Tag
Too
l
RW
iki S
earc
h
Res
ourc
es T
ool
OS
P T
ools
Cha
t T
ool
Em
ail T
ool
Ann
ounc
emen
ts
Wik
i Too
l
![Page 17: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/17.jpg)
Indexer– Indexing Queue
• Events arrive on the Bus• Added to the Queue transitionally
– Indexing• Index workers run concurrently ( 2 per Sakai node)• Take Events from the queue• Open an Abstract Lucene segment• Distributed lock manager
Search Service
LuceneEvent Listener
IndexQueue
Index Builder
Entity Content Producer
LocalSegment
Store
Clustered Index Store
SharedSegment
Store
Index Builder
Search Service
![Page 18: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/18.jpg)
Content– Entity Content Producer
• Digests a Token Stream› On Content› Using Stemmer and Stop Words
• Provides index terms› Site ID› User info› Properties› Tool› Custom
• RDF Structure› Requires A triple Store› Sesame in Contrib› Mulgara/Kowali needs work.
Search Service
LuceneEvent Listener
IndexQueue
Index Builder
Entity Content Producer
LocalSegment
Store
Clustered Index Store
SharedSegment
Store
Index Builder
Search Service
![Page 19: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/19.jpg)
Cluster Index Storage– Not Distributed
• Mirrored for Central Deposit Not as scalable as Nutch with Google MapReduce
• BUT No setup required
– Local Segments• Opened by IndexReaders, IndexWriters, IndexSearchers• High performance Seek
– Shared Segments• Central deposit of search segments• Synchronized with local copies
– Periodic Merging• Reduce open files• Eliminated Deleted items
Search Service
LuceneEvent Listener
IndexQueue
Index Builder
Entity Content Producer
LocalSegment
Store
Clustered Index Store
SharedSegment
Store
Index Builder
Search Service
![Page 20: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/20.jpg)
Production Deployment
Image © University of Cardiff 2006
![Page 21: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/21.jpg)
Sites• In production
– Cambridge• 73K documents, 6GB index, content in index.• Rebuild time = 45 minutes
– Cape Town• 93K documents, 200MB index, content not in index.• Rebuild time = ?
– Others ?
• Considering – Michigan
• 1.7M documents• Rebuild time…. Weeks ?• Should not put the content in the index
![Page 22: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/22.jpg)
Deployment Issues• Indexing Times
– Acceptable for smaller sites, a few hours– Pain at larger sites
• Rolling per worksite index build• Dedicated indexing cluster (not serving pages)
• Storage strategies– First Attempts - Cambridge - Cape Town
• Cape Town identified many problems - Thank you!• MySQL - Don’t put segments in DB! - Extremely slow tables.
– Node Layout• All nodes are indexers
– Content in the Index or Out of the index• No content in index now• Results re-digested on search
![Page 23: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/23.jpg)
Roadmap
Image from:http://marlin.sourceforge.netA Gnome2 media editor
Image © Marlin Project 2006
![Page 24: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/24.jpg)
New Features• Tagged Search Discovery
– Based on word vectors– In trunk– Needs a lens - focus on distribution segment
• RDF Faceted Discovery– Merged word vectors and triples– Needs per worksite ontology tools– Needs triple Store
• Should be a Sakai wide store. › Kowali - issues with community› Mulgara
![Page 25: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/25.jpg)
Roadmap• Parallel Indexing
– Implemented, needs heavy testing– Learn from Nutch– Multiple active indexes– Big sites in production– Better merge algorithm
• Other tools using search– Use indexes for PK search– Issues over Queue delays
• Text Mining - Sydney - Rafael Calvo
![Page 26: Search Dr Ian Boston University of Cambridge Image © University of Cambridge 2006 6 December 2006 10:30 INTL 6.](https://reader036.fdocuments.in/reader036/viewer/2022062409/56649f115503460f94c24359/html5/thumbnails/26.jpg)
Questions
Image © University of Cambridge 2006