VIVO 1.8 SearchIndexer .

35
VIVO 1.8 SearchIndexer http://gist.github.com/j2blake/388cbc50efb6114 81698

Transcript of VIVO 1.8 SearchIndexer .

Page 1: VIVO 1.8 SearchIndexer .

VIVO 1.8SearchIndexer

http://gist.github.com/j2blake/388cbc50efb611481698

Page 2: VIVO 1.8 SearchIndexer .

IntroductionEfficient

Configurable

Visible

Maintainable

Page 3: VIVO 1.8 SearchIndexer .

OutlineTheory

Flow

Modularity

Configuration

Performance

Testing

Questions

Task Force

Page 4: VIVO 1.8 SearchIndexer .

TheoryAdjust automatically to changes in the model.

User experience does not require that updates are synchronous.

When a triple is changed, several search records may require updating.

Several altered triples may affect the same search record.

Each search record can be updated independently.

Page 5: VIVO 1.8 SearchIndexer .

TheoryRespond to search Indexer admin page: request

a rebuild.

Respond to search indexing service API: specify a list of URIs to be re-indexed.

Page 6: VIVO 1.8 SearchIndexer .

FlowRespond to changes in the Abox

From the list of altered triples, build a list of URIs that require re-indexing.

Respond to request for rebuildBuild a list of all URIs in the model and re-index.

Respond to the search indexing serviceAccept a list of URIs for re-indexing.

Page 7: VIVO 1.8 SearchIndexer .

Flow – asynchronous

Update URIs Task

Rebuild Index Task

Update Statements

Task

Work Unit

Work Unit

Work Unit

Task Queue Thread Pool

Search Indexing Service

Search Indexer Admin page

Change to ABox model

Page 8: VIVO 1.8 SearchIndexer .

Flow – update URIsAccept a list of URIs to be indexed.

Determine the eligible URIs from the list.Must be an existing individual with at least one

VClass (type).Must pass the list of SearchIndexExcluders.

Remove the ineligible URIs from the index.

Build a search document for each eligible URI.Execute the list of DocumentModifiers.Add the completed document to the index.

Page 9: VIVO 1.8 SearchIndexer .

Flow – update URIs

Page 10: VIVO 1.8 SearchIndexer .

Flow – update URIs

Page 11: VIVO 1.8 SearchIndexer .

Flow – rebuild indexGet a list of all individuals in the ABox.

Update these URIs.(use the existing logic to update URIs)

Remove obsolete records from the index.Anything that was indexed prior to the rebuild.

Page 12: VIVO 1.8 SearchIndexer .

Flow – rebuild index

Page 13: VIVO 1.8 SearchIndexer .

Flow – rebuild index

Page 14: VIVO 1.8 SearchIndexer .

Flow – change to ABoxAccumulate changes into a batch

Delimited by a quiescent interval.Optional delimiting calls to pause() and unpause()

Produce a set of URIs to be updatedEach statement is examined by the list of

IndexingUriFinders.

Update these URIs.(use the existing logic to update URIs)

Page 15: VIVO 1.8 SearchIndexer .

Flow – change to ABox

Page 16: VIVO 1.8 SearchIndexer .

Flow – change to ABox

Page 17: VIVO 1.8 SearchIndexer .

ModularityApplication.java

SearchIndexer.java

ConfigurationBeanLoader.java

[vivo-home]/config/applicationSetup.n3Read at runtime.

Page 18: VIVO 1.8 SearchIndexer .

Application.javapublic interface Application { ServletContext getServletContext(); VitroHomeDirectory getHomeDirectory();

SearchEngine getSearchEngine(); SearchIndexer getSearchIndexer();

ImageProcessor getImageProcessor(); FileStorage getFileStorage();

ContentTripleSource getContentTripleSource(); ConfigurationTripleSource getConfigurationTripleSource();

TBoxReasonerModule getTBoxReasonerModule();

void shutdown();}

https://gist.githubusercontent.com/j2blake/388cbc50efb611481698/raw/c25ef7c0638f46473e935fec86545c689ffd4fc9/Application.java

Page 19: VIVO 1.8 SearchIndexer .

SearchIndexer.javapublic interface SearchIndexer extends Application.Module { void startup(Application app, ComponentStartupStatus ss); void shutdown(Application app);

void pause(); void unpause();

void addListener(Listener listener); void removeListener(Listener listener);

void scheduleUpdatesForStatements(List<Statement> changes); void scheduleUpdatesForUris(Collection<String> uris); void rebuildIndex();

SearchIndexerStatus getStatus();}

https://gist.githubusercontent.com/j2blake/388cbc50efb611481698/raw/a1ffb6330dec0010cdff0ce283178a57b1fb1048/SearchIndexer.java

Page 20: VIVO 1.8 SearchIndexer .

ConfigurationBeanLoaderpublic ConfigurationBeanLoader(Model model);

/** * Load the instance with this URI, * if it is assignable to this class. */public <T> T loadInstance(String uri, Class<T> resultClass)

throws ConfigurationBeanLoaderException;

/** * Find all of the resources with the specified class, * and instantiate them. */public <T> Set<T> loadAll(Class<T> resultClass)

throws ConfigurationBeanLoaderException;

Page 21: VIVO 1.8 SearchIndexer .

ApplicationImpl.java@Overridepublic SearchIndexer getSearchIndexer() {

return searchIndexer;}

@Property(uri = "http://vitro.mannlib.cornell.edu/ns/vitro/ApplicationSetup#hasSearchIndexer")public void setSearchIndexer(SearchIndexer si) {

searchIndexer = si;}

@Validationpublic void validate() throws Exception {

if (searchIndexer == null) {throw new IllegalStateException("Configuration did not include a

SearchIndexer.");}

}

Page 22: VIVO 1.8 SearchIndexer .

applicationSetup.n3@prefix : <http://vitro.mannlib.cornell.edu/ns/vitro/ApplicationSetup#> .

:application a <java:edu.cornell.mannlib.vitro.webapp.application.ApplicationImpl> , <java:edu.cornell.mannlib.vitro.webapp.modules.Application> ; :hasSearchEngine :instrumentedSearchEngineWrapper ; :hasSearchIndexer :basicSearchIndexer ; :hasImageProcessor :jaiImageProcessor ; :hasFileStorage :ptiFileStorage ; :hasContentTripleSource :sdbContentTripleSource ; :hasConfigurationTripleSource :tdbConfigurationTripleSource ; :hasTBoxReasonerModule :jfactTBoxReasonerModule .

# ...

:basicSearchIndexer a <java:edu.cornell.mannlib.vitro.webapp.searchindex.SearchIndexerImpl> , <java:edu.cornell.mannlib.vitro.webapp.modules.searchIndexer.SearchIndexer> ; :threadPoolSize "10" .

https://gist.githubusercontent.com/j2blake/388cbc50efb611481698/raw/7b009909ee8f812366cde14f3c1253ded514f85e/applicationSetup.n3

Page 23: VIVO 1.8 SearchIndexer .

ConfigurationUriFinders

What URIs are affected by an altered triple?

ExcludersWhat URIs should not have documents in the

index?

DocumentModifiersWhat data belongs in the search document?

Page 25: VIVO 1.8 SearchIndexer .

Configuration - examples:searchExcluder_typeExcluder a <java:edu.cornell.mannlib.vitro.webapp.searchindex.exclusions.ExcludeBasedOnType> , <java:edu.cornell.mannlib.vitro.webapp.searchindex.exclusions.SearchIndexExcluder> ; :excludes "http://www.w3.org/2002/07/owl#AnnotationProperty" , "http://www.w3.org/2002/07/owl#DatatypeProperty" , "http://www.w3.org/2002/07/owl#ObjectProperty" .

:uriFinder_forDataProperties a <java:edu.cornell.mannlib.vitro.webapp.searchindex.indexing.IndexingUriFinder> , <java:edu.cornell.mannlib.vitro.webapp.searchindex.AdditionalURIsForDataProperties> .

:documentModifier_NameFieldBooster a <java:edu.cornell.mannlib.vitro.webapp.searchindex.document.FieldBooster> , <java:edu.cornell.mannlib.vitro.webapp.searchindex.document.DocumentModifier> ; :hasTargetField "nameRaw" ; :hasTargetField "nameLowercase" ; :hasTargetField "nameUnstemmed" ; :hasTargetField "nameStemmed" ; :hasBoost "1.2"^^xsd:float .

Page 26: VIVO 1.8 SearchIndexer .

PerformanceSimple efficiency:

Don’t ask for more information than we need.Don’t discard information we have obtained.

Multi-threading:Finding URIs for a collection of altered statements.

Need to remove duplicate URIs.Building the search records for a collection of URIs.

Page 27: VIVO 1.8 SearchIndexer .

Performance – memoryKeeping the memory footprint low:

A list of URIs One URI is likely to be ~100 bytes.100,000 URIs is likely to be ~10 megabytes.

A list of Individuals.One Individual may be ~100 kilobytes.50,000 Individuals may be ~5 gigabytes.

Page 28: VIVO 1.8 SearchIndexer .

Performance - timingContinuous improvement: timings from the

developer panel:

Page 30: VIVO 1.8 SearchIndexer .

TestingAdd to the battery of Selenium tests

For example:Create a person, and a book written by that person.Search for the person’s name: both the person and

the book should be returned.Search for the title of the book: both the person and

the book should be returned.

A continuing effortSo far, not very successful.

Page 31: VIVO 1.8 SearchIndexer .

QuestionsOther configurable document modifier classes?

Other configurable excluder classes?

What about configurable URI finder classes?

Page 32: VIVO 1.8 SearchIndexer .

QuestionsBoost – multiplicative would make it order-

independent. Do we care?

Private data – the SPARQL-based DocumentModifiers do not use a filtered connection. Is that a problem?

At what point does it make sense to do a full rebuild instead of responding to model changes?Finding URIs can be more expensive than building

search documents.

Page 33: VIVO 1.8 SearchIndexer .

QuestionsWhat should be the default configuration?

Page 34: VIVO 1.8 SearchIndexer .

Task ForceDon Ellsborg and I have talked about creating a

group from the community.

Short time-frame, specific deliverable:The default configuration for the SearchIndexer,

post-release 1.8.

Page 35: VIVO 1.8 SearchIndexer .

Questions?