Improving the Solr Update Chain
-
Upload
cominvent-as -
Category
Technology
-
view
1.249 -
download
2
description
Transcript of Improving the Solr Update Chain
![Page 1: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/1.jpg)
Improving the Solr Update Chain
Jan Høydahl
![Page 2: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/2.jpg)
What will I cover?Who is Jan Høydahl?Intro to Solr’s (hidden) UpdateChainHow to write your own UpdateProcessorsExample: Web crawl @ Oslo UniversityA vision for future improvementsConclusion
2
![Page 3: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/3.jpg)
![Page 4: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/4.jpg)
![Page 5: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/5.jpg)
Jan Høydahl
1995: Developer telecom1998: Java developer2000: Search - FAST2006: Lucene2007: Cominvent2011: Lucene committer
> 100 projects
5
![Page 6: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/6.jpg)
Cominvent AS
6
Consulting & supportLucene/Solr
FAST
www.solrtraining.com
![Page 7: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/7.jpg)
Why document processing?
7
Analysis is Field orientedFilters only see the “local” field
![Page 8: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/8.jpg)
Why document processing?
8
But what if you want to:Add or remove fields?Make decisions based on other fields?
We need a way to modify the Document
![Page 9: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/9.jpg)
Why document processing?
9
name
postcode
cv_pdf_url
Doc1
programmer near Barcelona
![Page 10: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/10.jpg)
Why document processing?
10
name
postcode
cv_pdf_url
Doc1
cv_text
latlong programmer near Barcelona
![Page 11: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/11.jpg)
Why document processing?
11
name
postcode
cv_pdf_url
Doc1
cv_text
latlong
Client
![Page 12: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/12.jpg)
Why document processing?
12
name
postcode
cv_pdf_url
Doc1
cv_text
latlong
Client
3rd party pipeline
![Page 13: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/13.jpg)
Solr’s Update Chain
13
![Page 14: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/14.jpg)
The Update Chain
14
![Page 15: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/15.jpg)
The Update Chain
15
name
postcode
cv_pdf_url
Doc
![Page 16: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/16.jpg)
The Update Chain
15
name
postcode
cv_pdf_url
Docname
postcode
cv_pdf_url
Doc
latlong
PostcodeToLatLongProcessor
![Page 17: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/17.jpg)
The Update Chain
15
name
postcode
cv_pdf_url
Docname
postcode
cv_pdf_url
Doc
latlong
PostcodeToLatLongProcessor
name
postcode
cv_pdf_url
Doc
cv_pdf_bin
latlong
UrlFetcherProcessor
![Page 18: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/18.jpg)
The Update Chain
15
name
postcode
cv_pdf_url
Docname
postcode
cv_pdf_url
Doc
latlong
PostcodeToLatLongProcessor
name
postcode
cv_pdf_url
Doc
cv_pdf_bin
latlong
UrlFetcherProcessor
name
postcode
cv_pdf_url
Doc
cv_pdf_bin
latlong
TikaExtractingProcessor
cv_text
![Page 19: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/19.jpg)
![Page 20: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/20.jpg)
![Page 21: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/21.jpg)
How it’s wired
17
Choose chain in your update request:.../solr/update/xml?..&update.chain=cv-chain
Chain definition in solrconfig.xml:
![Page 22: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/22.jpg)
Other examples
18
Language Identification
![Page 23: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/23.jpg)
Other examples
19
Entity extraction
The Apache Software Foundation (ASF) is a non-profit corporation to support Apache software projects. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999.
Company
Location Date
![Page 24: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/24.jpg)
![Page 25: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/25.jpg)
Writing your own processor
21
![Page 26: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/26.jpg)
Writing your own processor
21
![Page 27: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/27.jpg)
Writing your own processor
22
![Page 28: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/28.jpg)
Writing your own processor
23
![Page 29: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/29.jpg)
Writing your own processor
24
•Make generic processors - parameterized•Use SchemaAware, SolrCoreAware and
ResourceLoaderAware interfaces•Prefix param names to avoid name clash•Testing and testable methods•Donate back to Apache & document on Wiki
![Page 30: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/30.jpg)
Web crawl withLanguage Detection@ Oslo University
25
![Page 31: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/31.jpg)
Solr @ Oslo University
26
![Page 32: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/32.jpg)
<?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>
Solr @ Oslo University
27
![Page 33: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/33.jpg)
<?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>
Solr @ Oslo University
27
![Page 34: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/34.jpg)
<?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>
Solr @ Oslo University
27
![Page 35: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/35.jpg)
<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>
Solr @ Oslo University
28
![Page 36: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/36.jpg)
<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>
Solr @ Oslo University
28
![Page 37: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/37.jpg)
<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>
Solr @ Oslo University
28
![Page 38: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/38.jpg)
<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>
Solr @ Oslo University
28
![Page 39: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/39.jpg)
Donations back to Apache
29
SOLR-2599: FieldCopyProcessorSOLR-2825: RegexReplaceProcessorSOLR-2826: URLClassifyProcessorSOLR-2827: RegexpBoostProcessorSOLR-2828: StaticRankProcessorBinary Document Dumper (?)
Many thanks for the donations!
![Page 40: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/40.jpg)
![Page 41: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/41.jpg)
![Page 42: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/42.jpg)
Room forimprovement?
32
![Page 43: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/43.jpg)
Improvements
34
Processors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support
![Page 44: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/44.jpg)
Improvements
35
Pain:Potentially expensive initializationStaticRankProcessor: read&parse 50.000 lines
Proposed cure:Keep persistent state object in factory: private final Map<Object,Object> sharedObjCachenew StaticRankProcessor(params, request, response, nextProcessor, sharedObjCache);Processor uses sharedObjCache for state
![Page 45: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/45.jpg)
Improvements
36
Processors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support
![Page 46: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/46.jpg)
Improvements
37
Pain:Multi chains often need identical ProcessorsUiO’s two chains share 80% -> copy/paste
Proposed cure:Allow sharing of named instancesDefine:<processor name="langid" class="..">
Refer:<processor ref="langid" />
See SOLR-2823
![Page 47: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/47.jpg)
Improvements
38
Processors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support
![Page 48: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/48.jpg)
Improvements
39
Pain:Chains are linear onlyHard to do branching, sub chains, conditional...
Proposed cure (SOLR-2841):New scriptable Update Chain - alternative to XMLScript chain logic in solr/conf/updateproc.groovyFull flexibility:chain myChain {
if(doc.getFieldValue("type").equals("pdf")) process(tikaproc) }
![Page 49: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/49.jpg)
Improvements
40
Processors re-created for every requestDuplication of config between chainsNo support for non-linear chains or sub chainsDoes not scale very wellLack native scripting language support
![Page 50: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/50.jpg)
Improvements
41
Pain:Single threadedHeavy processing not efficient
Proposed cure:Local: Use multi threaded update requestsSolrCloud: Dedicated nodes, role=“processor” ?Wrap an external pipeline in UpdateProcessor
Example: OpenPipelineUpdateProcessor ?
![Page 51: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/51.jpg)
Improvements
42
Processors re-created for every requestDuplication of config between chainsNo support for non-linear chains or sub chainsDoes not scale very wellLack native scripting language support
![Page 52: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/52.jpg)
Improvements
43
Pain:Not really a “problem” :-)Nice to write processors in Python, Groovy, JS...
Proposed cure:Now: Finish SOLR-1725: Script based ProcessorLater: Make scripts first-class processors
<processor script="myScript.py" />or<processor ref="myScript" />
![Page 53: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/53.jpg)
One last thing...
44
![Page 54: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/54.jpg)
New standalone framework?
45
•The UpdateChain is Solr specific•Interest for a pure pipeline framework•Search engine independent•Scalable•Rich pool of processors•Several existing candidates
•Some initial thoughts:http://wiki.apache.org/solr/DocumentProcessing
![Page 55: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/55.jpg)
Summary
46
![Page 56: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/56.jpg)
Summary•Document centric vs field centric processing•UpdateChain is there - use it!•Works well for most “light” cases•Scaling issues, but caching config may help•More processors welcome!
47
![Page 57: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/57.jpg)
Questions?Jan Høydahl, Cominvent [email protected]
![Page 58: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/58.jpg)
Extra
49
![Page 59: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/59.jpg)
Alternative pipelinesOpenPipeline (Dieselpoint)
•OpenPipe (T-Rank, now on GitHub)•Pypes (ESR)•UIMA (Apache)•Eclipse SMILA•Apache commons pipeline•Piped (FoundIT, Norway)•Behemoth (DigitaPebble)•FindWise and TwigKit also has some technology
50
![Page 60: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/60.jpg)
Calling out from UpdateChain
51
This is one way an external pipeline system can be integrated with Solr.
The main benefit of such a method is you can continue to feed content with SolrJ, DIH or other Update Request Handlers.
![Page 61: Improving the Solr Update Chain](https://reader034.fdocuments.in/reader034/viewer/2022052208/558cdf51d8b42a3b768b4591/html5/thumbnails/61.jpg)
Scaling with external pipeline
52
Here is a more advanced, distributed case, where a Solr node is dedicated for processing, and the entry point Solr only dispatches the requests.