Post on 24-Jan-2015
description
Deviance Nerd Feeler Declared Refine Even Candid Reefer Eleven Fleeced Reindeer Van Freelance Never Died Deliverance Need Ref Dereference And Live End Free Deliverance Deliverance Nerd Fee Deface Vender Reline Refaced Veneered Nil Refaced Relined Even Vile Acne Feeder Nerd Declare Define Nerve Cleared Define Never Fancied Reverend Lee Dance Fender Relieve Canned Refereed Vile Canned Refereed Evil Canned Deliverer Fee Canned Relieved Free Freelance Need Drive Irrelevance Feed End File And Decree Nerve Card Need Relief Even Revealed Nice Fender
Relevance redefinedLukas Koster
Library of the University of Amsterdam@lukask
IGeLU 2014 - Oxford
Main discovery tool feedback issues
Content
Not enough
Too many
Wrong types
No ‘full text’
Relevance
Not #1
Too many
Known item!?
WTF?
Main issues reported as feedback/survey results on discovery tools are about content and relevance.
Funny thing is that the use of facets for refining is somehow not very popular?
Usual responses to feedback issues
Change the front end!Tabs - Facets - Filters - - Font
Positions
More/less content!More of the same same same
Improve relevance ranking algorithms!Very shhhophisticated - Very shhhecret
Usual responses to feedback issues: Front end, content, relevance (ranking)Front end UI changes: it’s just about cosmetics and perception: more tabs with specific data sources, element positioning, etc.More or less content can have various effects. Either more or less relevant results. But usually still the same traditional content typesAlgorithm: only for a small part influenced by libraries, customers. Most of it in the software, which is confidential, not transparent, for competitive reasons
Before
Example: before.University of Amsterdam Primo: originally Google experience: one box (apart from advanced search), all sources, one blended results list.
After
Example: after. University of Amsterdam Primo now: three tabs: All, Local catalogue, Primo Central
Same old
Same old UX tricks
Same old Content types
Same old View on relevance
Basically these changes are not actual changes at all: it’s all cosmetic.UX/UI changes: perception, not actual improvement of relevance.Content: usually still the same resource types+search indexesRelevance from a system +result set perspective
iNTERLiNKED
R
SE
AR
CH
C
L
A
O
E
N
N
V
K
T
A
E
N
X
CO
NT
EN
T
E
But every aspect is dependent on all others: search, rank, content, context, relevance. No search without context. Search is executed in a specific limited content index. Ranking is performed on the results within this limited index. Relevance is completely defined by a user's context.
Relevance
Context
+
Content
Objectively, we can say that relevance is determined by context and content.
Relevance=Relative:Subjective:Contextual
Person Context
Role
Task
Goal
Need
Workflow
System Algorithm
Content
Index
Query
Collection
Configuration
Clash between personal context and system/collection. Personal context, defined by a person's specific needs in a specific role for a specific task/goal in a specific time, culminates in a specific Query, which consists of a limited number of words in character string format.System doesn’t know personal context, only has the indexed content, made up from specific collections, that is indexed in a certain way with specific system configurations, and the string based query to run through that structure.
Relevance
RecallThe fraction of relevant instances that are retrieved
retrieved relevant instances
total relevant instances
PrecisionThe fraction of retrieved instances that are relevant
retrieved relevant instances
total retrieved instances
Basic concepts used for determining relevance of result sets: Recall and Precision.This cannot be used to determine actual relevance of specific results! That is dependent on context and can only be determined by the user.
Total: 1000 items
Relevant: 300
Retrieved: 180
Retrieved relevant: 120
Retrieved unrelevant: 60
Unrelevant: 700
Recall:120/300=0.4
Precision:
120/180=0.66
Relevance
Recall and PrecisionExample: Searched index: 1000 itemsRelevant for query: 300Retrieved items: 180Retrieved relevant items: 120Retrieved unrelevant items: 60
Recall=120/300=⅖ (0.4)Precision=120/180=⅔ (0.66)
Relevance ranking is NOT Relevance
Relevance = Finding appropriate items
Recall, Precision
Relevance ranking = Determining most relevant within retrieved set
Term Frequency, Inverse Document Frequency, Proximity, Value Score
Retrieved set may not contain any relevant items at all, but can still be ordered according to relevance.
Relevance is NOT relevance ranking!Relevance is finding/retrieving appropriate items, using the words in the query, and if available: context information.Recall and Precision are used to measure the degree of relevance of a result set.
Relevance ranking is determining the most relevant items in a result set based on the query terms and the content of retrieved items, using a number of standard measures:TF, IDF, Proximity.Value score: a specific Primo algorithm that is looking at number of words, type of words, etc.
Also possible: local boosting. This method does not take into account any content relevance, but just uses brute force to promote items from specific (local) data sources.
Primo Central search and ranking enhancement - July 8, 2014
As part of our continuing efforts to enhance search and ranking performance in Primo, we changed the way Primo finds matches for keyword searches within indexed full text. As part of this approach Primo lowers the ranking of, or excludes, items of low relevance from the result set that were previously included. You may find as part of this change that the number of results for some searches is reduced, although result lists have become more meaningful.
Official Ex Libris announcement July 8, 2014.Combined with improvements to known item search/incomplete query terms in Primo 4.7.Something changed!? This announcement implies mixing up of getting relevant results and relevance ranking. Some results are actually excluded.
Only for full text resources.Only in Primo Central.Not clear if this is independent of software version/SP?
Unclear to libraries, customers what and how relevance/search/ranking are modified: an example of the not transparent nature of discovery tools' relavance algorithms.There were a number of complaints on the Primo mailing list about this.
The System Perspective
Objectivizing a subjective experience
Let's look at the traditional system perspective on relevance. It's trying to make a subjective process into an objective one.
Recall issues
Discovery tool index limits recall scope in advance
Relevance is calculated on:
available
selected
indexed
(scholarly) content
By vendors
By libraries
Everything
System
First let’s have a look at some recall issues in discovery tools.Recall is limited in advance, because only a limited set of items of certain content types are available for searching. A lot of relevant content is not considered at all.Decided by vendors, publishers and libraries.In Primo Central: by Ex Libris agreements with publishers, metadata vendors.In Primo Central: libraries decide what is enabled, what is subscribed, free for searchIn Primo Local: libraries decide which (part of) collections are indexed.
Recall issues
NOT indexed:
Not accessible
Not subscribed
Not enabled
Unusual resource types
Connections
Not digital
Not indexed, thus not searched:
Content not accessible to index vendors, librariesUnusual resource types: theatre performances, television interviews, research project information, historical eventsNot physical, tangible content.Connections: influenced by, collaborating with, temporal, genre
May not fit in bibliographical/proprietary format (MARC, DC, PNX)
Recall issues
Indexed, but NOT found:
By author name (string based)
By subject (string based, languages)
Related but unlinked items (chapter in book)
Content that IS indexed, but can't be found:
Author names: only strings, textual variations of name/pseudonyms, etc. that are indexed. Only items with explicit author search term are found.Subject: strings, individually indexed 'as is' from data sources, multiple languages. Only items with explicit specific subject search term are found.Related: a chapter may be indexed with a textual reference to the book it is a part of. The book (relevant for delivery) is not retrieved, neither a link to that item.
Author
Author name example.Charlotte Brontë pseudonym/pen name Currer Bell (male) used for Jane Eyre. (Left screenshot Wikipedia)In this case no links between both names, so the very relevant Charlotte Brontë stuff is not retrieved. (Right screenshot University of Amsterdam Primo)
Subject
Subject example.Topic/discipline “philosophy” (English) does not find stuff with Dutch “filosofie” (which also appears to be Czech).
Chapter
Connections example.Chapter written by UvA researcher, in local institutional repository, harvested in local Primo.Book in Aleph catalogue, harvested in local Primo.Book is not retrieved as item to present delivery options directly.
Precision issues
Discovery tool limits precision by ambiguous indexing
Next: some precision issues.Problems caused by using strings instead of identifiers/concepts
Precision issues
Indexed and/but erroneously found
By author name (string based)
By subject (string based, languages)
Query too broad
Indexed irrelevant items that are retrieved erroneously:
Author: common names result in items of all authors with that name.Subject: similar terms with different/ambiguous meanings give noise (voc)Broad query (few terms) gives too much noise
Author
Example of author names.
J. (Johanna, Jan, Joop, etc.) de Vries is a very common Dutch name.Results consist of all items by different authors.
Subject
Example of subjects.
Ambiguous/Multilingual topic VOC: physics (Volatile Organic Compounds), music (Vocals), history (Verenigde Oostindische Compagnie, Dutch East Indies Company).
Too broad
Example of too broad search terms.
Way too many results with a very common search term.
Recall and precision issues
Content of index
Quality of search index units
Lack of connections (isolated string items)
Algorithms for retrieving and ranking not transparent
Summary of Recall and Precision issues in discovery tools and relevance:
Content of index: resource types, connections, dataSearch index units (individual search index fields): strings, isolated items
Cause: system perspective with legacy data There is no way to determine if all relevant items have been retrieved.
Research cycle
http://commons.wikimedia.org/wiki/File:Research_cycle.png
Intermezzo: closer look at Context: workflow and use case. Example: research cycle (Cameron Neylon). Many different versions of this cycle. Important is: the nature of someone’s information need differs depending on the stage. Broad, focused in several dimensions
Context example - theatre research
Play
Author
Text
Productions
Use case: Theatre play researcher.A theatre Play is written by Author, is represented as text, but most importantly it is performed (or not) for an audience.
Context example - theatre research
Play
Author
Text
Productions
Background
Connections
Influences
Period
For the Author there is biographical information, important things are background, connections with others (artists, funders, relatives, etv.), influences, the period in which they live and work.Libraries/discovery tools may have some biographical information.
Context example - theatre research
Play
Author
Text
Productions
Background
Connections
Influences
Period
Versions
Translations
Editions
Text: there may be several versions, translations, editions etc. FRBR can be used to model this.This belongs to the traditional library domain.
Context example - theatre research
Play
Author
Text
Productions
Background
Connections
Influences
Period
Versions
Translations
Editions
Performances
Reception
TheatresProducers
ActorsVisitor stats
Directors
Props
Posters
Recordings
PhotosCostumes
Productions and performances: a whole different world. People involved in a number of different roles. Different Productions, various actual performances, physical props, costumes, audio and video recordings, etc.Reception both of the play as such, as of the various productions, always related to the period.
What’s in a discovery tool? Could be anything, but in individual texts/items, not as separate retrievable items, and certainly not as connections/crossreferences/related information.Authors in authority files or individual biographical databases.Text/Editions treated separately as individual items.Productions: maybe, depending on types of indexed (local) databases.Reception: individual reviews possibly.
Relevance - New perspective
Instead of
SYSTEMCollections, Indexed content, Query
Context-Workflow-Goals- Environment of
USER
It is time to switch perspectives, from collection based System algorithms to context based User needs.
Is this technically possible, feasible?
Extend Content?
Know Context?
Important questions:Is it even possible, feasible to extend content without limits, to interpret personal context?Can commercial vendors and publishers benefit?
Relevance Redefined and Primo
What is already possible?
Content
Additional content types
Additional indexed fields
Third nodes (not merged)
External links (not searchable, link out only)
Context
Discipline (for ranking, not searching)
Algorithm improvements (for current items)
Let's look at this from the Primo perspective.
What is already possible in current version of Primo?
ContentOther resource types can be added, both in Primo Central, by ExLibris; and in Primo local, by individual libraries.Indexed fields: extend the PNX search section, extra entries (for authors, subjects for instance, needs normalization rules) + locally defined fieldsThird nodes, external data sources via API, like distributes federated search(EBSCO, Worldcat), but results unclear, can't be merged very well.
ContextUsers can enter their Discipline and Degree, but this is only used for ranking, not for retrieving.
Relevance Redefined and Primo
What is missing?
Content
Internal links
Integrated Primo Central/Local
External links
External indexes
Normalised/multilingual authors/subjects
Context
Context
What is still missing/not possible in Primo?Internal links: chapter-book(s), article-journal(s), article-datasets, qualitative relationships etc.Primo Central-Primo Local: two separate indexes to be searched, no deduplication etc.External links, for instance to related content in external databases, not indexed in Primo: theater performances, research information, etc.External indexes: non-Primo data sources searchable (maybe with Third Nodes, but not merged)Normalised/multilingual indexes: there is no use of identifiers instead of string indexes
Relevance Redefined and Primo
Options
Content
Universal record format: RDF!
Identifier based authorities: VIAF! MACS? (DBpedia?)
Global metadata index!
Transparent algorithms!
Context
What would Google do?
What options can we distinguish for future Primo development?Record format not proprietary, but universal: RDFRDF also requires identifiers + relations (triples).Existing authorities: VIAF, LCSH, MACS etc. (RDF/Linked data).Global metadata index: not silos for separate discovery layers, but open, global, unified format. Could be decentralized, distributed; managed by multiple partiesTransparent algorithms: to make it clear how relevance is computed.
New features announced by Ex Libris on earlier occasions: URIs in Primo PNX Links SectionKnowledge Graph type additional info (Wikipedia, …)Announced during conference: Primo/third generation discovery, with related information and serendipity, using identifiers, external sources, linked data.
Context: Google: next slide.
A word about Google vs Primo
Google knows
IP addresses
Account
Searches
Clicks
Location
Primo makes an educated guess
Discipline?
Query type
The difference between Google and library discovery.Google knows a lot about the user, and can target search results at user's history, location, email etc.Library discovery tools do not have that knowledge. They have to guess.
VIAF
Example of identifier based person authority files.
VIAF consolidates names for large number of authoritative sources.Also has Related names.
MACS
Multilingual ACcess to Subjects
Since 1997
Manual linking between strings
New future?
The European Library...
http://www.nb.admin.ch/nb_professionnel/projektarbeit/00729/00733/index.html?lang=en
Example of multilingual subjects.
MACS, since 1997, manually maintained, input from four national libraries. Used in The European Library.
Discussed at IFLA 2014 Linked Data for Libraries Satellite Meeting ParisThere are plans for extending and adjusting MACS for future, automated, linked data concepts.This would be a very important development.
The European Library uses MACS. Multilingual AND disambiguated
WikiPedia/DBpedia
SLUB Dresden local Primo addon SLUB-Semantics, using multilingual and disambiguated topics from Wikipedia/DBPedia
But, wait a minute...
RDF?
Identifiers?
Global index?
Transparency?
What are we talking about here? What would be the consequences of applying these suggestions?
Ŧ ᶙ©Ѥ
**** the system
Open independent transparent web based connected data infrastructure
Linked Open Data
Should libraries, vendors invest in data infrastructure instead of systems?
Discovery layers should be separated (decoupled) from proprietary systems, closed data stores and indexes. Main focus should be a global data infrastructure. Which can be accomplished with RDF/LOD. Tools, services built on top of global infrastructure.
This is exactly what Linked Open Data is all about.
Main issue here: would this be commercially beneficial for current discovery layer vendors?
And should libraries focus on data infrastructure instead of systems?
Ŧ ᶙ©Ѥ
**** the system
Open independent transparent web based connected data infrastructure
Linked Open Data
Should libraries, vendors invest in data infrastructure instead of systems?
No, if you look closely, it doesn't say what your mind thinks ;-)
NISO Open Discovery Initiative
“Transparency in discovery” 2014
(http://www.niso.org/workrooms/odi/)
“... facilitate increased transparency in the content
coverage of indexbased discovery services …
Full transparency will enable libraries to objectively
evaluate discovery services …”
NISO Open Discovery Initiative report 2014 objectives.Transparency in discovery, sounds promising.
NISO Open Discovery Initiative
In scope:
Quantity of content
Form of content
Do not favor or disfavor items from any given
content source or material type
Specific metadata fields indexed
Whether controlled vocabularies or ontologies are
included
NISO ODI topics declared “in scope”Most of these topics confirm suggestions made in this presentation.
NISO Open Discovery Initiative
Out of scope:
“Relevancy ranking” (may fall within the realm of
proprietary technologies used competitively to
differentiate commercial offerings)
APIs exposed by discovery service (initially,
reluctantly)
However: NISO ODI topics declared “out of scope”
Relevance rankingAPIs (system independent access to data, more or less)
These are exactly the things that are most important for transparency in discovery.
NISO Open Discovery Initiative
Nothing about:
Content linking/identifiers
Normalised/multilingual authority files
Relevancy ranking
System independent data infrastructure
NISO ODI ignores all issues that improve relevance in discovery.
NISO Open Discovery Initiative
Stakeholders/Working group members:
Content providers
Discovery service providers
Libraries
Who’s missing?
Most important stakeholders are missing from NISO ODI committees, the end users.
Relevance redefined
Context
User needs
User input
User feedback
Content
Open connected data infrastructure
Systems (Primo) Services
Algorithms Transparency
SOA - Service Oriented Architecture + Context
Conclusion/recommendation:Instead of closed systems with limited content, a transition to a new 3 component environment is required: - content (open global data infrastructure)- context (user needs, input, feedback)- services, systems that access the content and context layers in transparent waysSOA! Service Oriented Architecture + Context
How this can be achieved is still to be investigated. However, SOA is already widely implemented elsewhere.Linked Open Data is technically possible, we only need the will to cooperate.Context is the hardest part to realize. But it is not impossible.