Best practices in museum search
-
Upload
nate-solas -
Category
Technology
-
view
2.908 -
download
1
description
Transcript of Best practices in museum search
"Best Practices" inMuseum Search
.. in my (researched) opinion
Nate Solas#mcn2011 #[email protected]@walkerart.orghttp://bit.ly/mcn2011search
Search is hard...
... shouldn't we just leave this to Google?
"Leave it to Google" IS a best practice!
For them, it's a solved problem. They have absolutely solved searching for content on websites, especially a finite domain like a museum website.
http://www.powerhousemuseum.com/search/index.php?cx=018242116655519399236%3A4srvv8yns7w&q=blue&sa=&cof=FORID%3A11&siteurl=www.powerhousemuseum.com%2Fvisit%2F
http://www.tate.org.uk/search/default.jsp?q=bluehttp://www.brooklynmuseum.org/http://www.amnh.org/http://si.edu/ (GSA)
<title>We can do more to help</title>
<article>
Mark the content! Google indexes ALL the words, so all of our nav, advertising, footer... If we don't indicate what's the "content", it's all fair game (sort of. They're actually smarter than that.)
<sidebar>Meta tags (OG), RDFa, valid HTML5 markup, etc.</sidebar>
</article>
Internal search: yes
• We (should) know the most about our content, so we know:o how to suggest thingso how to interpret queries in context (run the search)o how to present things to make sense
It's no longer just a 'web page'!
• We (should) have the content as discrete pieces of metadata: title, date, body, author, etc.o We can therefore index just the content, none of the other
chrome on the page.o Facets: we can use this metadata to drill down.
Phases of search:
... let's just look at three parts:
the query,results,&dead ends
Search box, top right. Done.(Powerhouse Museum has it bottom left, but they're in Australia so this makes sense. ;)
• If there's text in the box ("search"), clear it when they click in!
• Autocomplete / suggest isn't really common (yet), but seems very useful where it shows up.o Three strategies I see:
– Suggest page ("live search taxonomy") (http://www.imamuseum.org/)– Suggest tag/title (http://www.vam.ac.uk/)– Suggest phrase from full corpus (http://beta.walkerart.org/ (beta))
The Query
Suggest / Autocom
Full text autocomplete is sort of the holy grail, IMO, but we can't be as smart as Google.
IMA does "live search" (auto-suggest) instead of autocomplete, very useful but it doesn't help me spell Lichtenstein.
The real point is to eliminate dead ends.
Suggest / Autocomplete
Results
Questions your result page should answer immediately:
1. What are these things?– Why did they match (and why in that order?)– Was I understood / can I try again easily?
Finally:• What's next?
o try some results oro narrow (refine) search oro broaden search
WHAT are these things?
Mixed results ("All")
MOMA gets it:• http://www.moma.org/search?quer
y=blueo Full breadcrumb, excerpt, title,
media if they have it.
This is confusing at first:• http://www.metmuseum.org/search-re
sults?ft=blue
Separate results
V&A splits into sections• http://www.vam.ac.uk/contentapi/search/?q=blue&searc
h-submit=Goo .. but some of the "articles" aren't
articles.
MFA sections and staggers• http://www.mfa.org/search/mfa/blue
Careful. This sort of assumes people know what they're looking for.
...um
Why did they match (in this order)?
• Highlight the match, if possible
• Sort by relevanceo (But see section on "boosting"...)
• If you're splitting up content, it's hard to explain.o ...best result could be at the bottom of the page
... so ... don't. Let user do this.
Was I understood / Can I try again?
MFA site: http://www.mfa.org/search/mfa/blue• Without the URL hint, can you even tell what was searched for?
o And what if you want to add a single word? (WAC site is guilty of this. Blame the designer. ;-)
A few "not like this" examples:
• "blue phase"o http://www.vam.ac.uk/contentapi/search/?q=%22blue+phase%22&search-submit=Goo http://www.imamuseum.org/search/ima/%22blue%20phase%22
(People are going to use quotes!)
Was I really understood?
We know what you want: "Hours"• http://www.britishmuseum.org/search_results.aspx?searchText=hours• http://www.moma.org/search?query=hours&page=1• http://beta.walkerart.org/search/?q=hours
"We have a special “live search” taxonomy for explicitly boosting content pages we know people are searching for. E.g. “jobs” on our employment page; “love” is our Love sculpture, not the hundreds of other works, “wedding” is for facility rentals, not our hundred wedding dresses in the collection."
-- Charlie Moad, IMA
Do me a favor:• http://beta.walkerart.org/search/?q=articel• http://www.vam.ac.uk/contentapi/search/?suggest=article&q=articel
o (again, a bit confusing but right)
Narrow results with facets
Awesome:• si.edu collections
o http://collections.si.edu/search/results.jsp?q=blue
Good:• IMA
o http://www.imamuseum.org/search/ima/blue• WAC (I'm biased)
o http://beta.walkerart.org/magazine/type/articles/genre/film
Less awesome:• British Museum
o http://www.britishmuseum.org/search_results.aspx?searchText=blue&searchPrevious=blue&itemsPerPage=10
Broaden results
• Similar searches / More Like Thiso http://beta.walkerart.org/search/?q=absent+landlordo
http://www.powerhousemuseum.com/collection/database/search_tags.php?tag=blue
o http://www.vam.ac.uk/contentapi/search/?q=%22blue+phase%22&search-submit=Go Sort of weird, though.
• More Like Thiso We're trying it on detail pages:
http://beta.walkerart.org/calendar/2011/merce-cunningham-dance-company
Dead ends / spell check
"Did you mean?"• http://beta.walkerart.org/search/?q=absent+landlord• http://www.vam.ac.uk/contentapi/search/?q=blu&search-sub
mit=Go
This is really just spellcheck. But it's apparently really hard, since nobody's doing it.
Final thoughts
Can we just spider our own pages like Google?• Sure. Lots of tools to do this, and it looks like that's how MOMA does it.
o However... http://www.moma.org/search?query=%22ad+reinhardt%22+%22sum+of+days%22&page=1
o http://www.moma.org/search?query=blu&page=1 (look at the mp4!)
Boosting• what kind of boosting makes sense?
o weight towards recent content push down past events, maybe
o "we know what you want" look at logs to see what people are searching for
So... "best practices"
• Unified search across all contento full-text search with stemming, phrases, etc.
• Coherent, user-centric divisions of content for faceting• Prevent dead ends
o show #s for facetso autocomplete query
• Help the usero "Did you mean?"
Or just give it to them, don't ask
Let's build that!
"Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable."-- http://en.wikipedia.org/wiki/Solr
There's a tool for you...
http://wiki.apache.org/solr/IntegratingSolr
• ColdFusion - ColdFusion 9 now includes Apache Solr• Django - Haystack• Drupal - A Drupal module that integrates Apache Solr in Drupal.• eZ Find - eZ Find, a solid solr integration to the open source CMS eZ Publish• Forrest/Cocoon - SolrForrest• Foswiki - A Foswiki plugin that integrates Apache Solr in Foswiki.• Plone - collective.solr• SVN - reposearch• TYPO3• Various Library Catalog Applications - Solr4Lib• Woltlab Community Framework - A WCF package working with the burning board, the blog and all other WCF
components.• WordPress - solr-for-wordpress A WordPress plugin that replaces the default WordPress search with Solr.• ZooKeeperIntegration• OpenCms - opencms-solr
Hurry, hurry!
1. introducing Solr2. build fulltext search & introduce dismax3. facets4. build autocomplete5. did you mean?
Installation, fast test
user:~solr$ lssolr-nightly.zipuser:~solr$ unzip -q solr-nightly.zipuser:~solr$ cd solr-nightly/example/user:~/solr/example$ java -jar start.jar
That's it! You can actually do local development against that sort of setup and it works fine.
Installation, f'realz (Ubuntu)
apt-get install build-essential jetty \ libjetty-extra openjdk-6-jdkcp dist/apache-solr-3.4.0.war \ /usr/share/jetty/webapps/solr.warcp -r example/solr /usr/share/jetty/
edit /usr/share/jetty/solr/conf/schema.xml and solrconfig.xml
edit /etc/default/jetty: turn off no-start, make it bind to all ips, and set the java opts:JAVA_OPTIONS="-Dsolr.solr.home=/usr/share/jetty/solr -Dsolr.data.dir=/usr/share/jetty/solr/data $JAVA_OPTIONS"
/etc/init.d/jetty start
For today:
http://172.16.0.67/
Explore the fieldtypes: core0
Get the sample text onyour clipboard.In core0, click Admin,then Analysis
Field Names:id (string)text_wstext_generaltext_enphonetictext_general_revalphaonlysort
core1: fulltext search engine
Click search on core1. Try it out.(dataset is Walker Art Center events)
Click "edit" on core1. Discuss.
core1: dismax query parser
DisMax is an abbreviation Disjunction Max, and is a popular query mode with Solr.
Disjunction refers to the fact that your search is executed across multiple fields, e.g. title, body and keywords, with different relevance weights
Max means that if your word "foo" matches both title and body, the max score of these two (probably title match) is added to the score, not the sum of the two as a simple OR query would do. This gives more control over your ranking.
core1: dismax in practice
The DisMaxQParserPlugin is designed to process simple user entered phrases (without heavy syntax) and search for the individual words across several fields using different weighting (boosts) based on the significance of each field.
In English: it does a really good job helping you figure out what the user meant to look for.
Try some quotes
chuck close
vs.
chuck "close"
Debug: what's going on?
core2: facets
99% chance your Solr library will abstract this for you, but it's good to know what's under the hood.
... we won't do it today, but you can facet by queries, not just field names.
So you can do things like this in one call:• Give me all events matching the query• Show how many by type (like we're doing)• Show how many are happening today• Show how many are happening "this weekend"• ... etc.• http://beta.walkerart.org/calendar/type/free-events
core3: Autocomplete (a)
Read this later:http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
This is a very popular and decent solution. It really only works the way he suggests, though, by seeding with popular queries (since it starts at character 0). If you have this data, go for it, but our top queries actually aren't very interesting: "jobs", "staff", "hours", etc.
We want something that can complete any phrase that occurs in our corpus (a), ideally in the middle of the phrase (b).
Key technologies
ShingleFilterFactoryMake tokens out of phrases.
TermsComponent"return terms and document frequency of those terms"
Post-processing for stopwordsIndex them in phrases, but remove from suggestions in certain scenarios
ShingleFilterFactory
<fieldType name="shingle_text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <charFilter class="solr.HTMLStripCharFilterFactory" /> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <!--<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />--> <filter class="solr.ShingleFilterFactory" maxShingleSize="5" /> </analyzer> <analyzer type="query"> <charFilter class="solr.HTMLStripCharFilterFactory" /> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType>
TermsComponent
<!-- in solrconfig.xml --> <arr name="last-components"> <str>terms</str> </arr>
# strict "starts with"/select?terms=true&terms.fl=auto_text&terms.prefix=term
OR
# attempt at "infix" (sloooow on big corpus)/select?terms=true&terms.fl=auto_text&terms.rege=(^|.* +)term.*
core4: Autocomplete (b)
Infix. Big challenges, decent hacks.
Smaller shingles.
Less words (only title & subtitle).
Still... kinda slow in our beta site. Probably have to move to prefix. :(
core5: spellcheck
Similar to the setup for autocomplete
Just remember to call a url with spellcheck.build=true to get things started.
For better results, use spellcheck.q and escape spaces. This makes it a phrase instead of spellchecking individual words and correcting them to deadends.
select?q=chuc+closee&spellcheck.q=chuc\+closee
Search is hard...
Our content team (and I know the MET too with their new site) constantly struggle to understand why certain results come up over others. They always ask us to make tweaks which inevitably hurt other results. It’s a constant battle for perfection and I have to do a lot of educating.
· Retail results come up over artworks because they actually write good descriptions! We even set our boost on retail to 0.5.· Why does “after van Gogh” show up before the real “van Gogh”?· Why does last year’s event show up before this year’s?
While there are answers to all these, it’s inevitably a slippery slope. My final answer is to usually use the live search taxonomy. It is in place to tell the search engine what users are looking for specific to your institution. People just need to understand that it is a content task just as much as creating a page.
-- Charlie Moad, IMA
If we're bored
ASCII / UTF8http://beta.walkerart.org/search/?q=jerome+belhttp://beta.walkerart.org/search/?q=J%C3%A9r%C3%B4me+Bel<!-- remove diacritics BEFORE stemming to match cases without diacritics --><filter class="solr.ASCIIFoldingFilterFactory"/>
boost in general, elevate.xml
bq=(instances:{20110927 TO *})^1000 OR (display_type:Walker\ Shop)^20 OR (display_type:Events)^1
http://wiki.apache.org/solr/QueryElevationComponent - "sponsored search"Index non-data resources (pdf, docs, etc.): Apache Tika