Dynamic Potential of Semantic Enrichment
description
Transcript of Dynamic Potential of Semantic Enrichment
The Dynamic Potential of Semantic Enrichment
Allen Press Emerging Trends in Scholarly Publishing™ Seminar 14 April 2011
Pam Harley VP, Product & Market Development SemedicaTM A DIVISION OF SILVERCHAIR [email protected] (434) 296-6333 x372
or, Everything You Always Wanted to Know About Semantic Enrichment
OK, not everything. Not even most things. Just some things you probably should be aware of.
Why me?
Me 20+ years in STM publishing, many hats worn
print, digital
books, journals, news, continuing education…
editorial, production, product development
Silverchair 10+ years working with STM publishers to build products
and features from semantically tagged content
2
Here’s the plan
WHAT is semantic enrichment
WHY you should care (benefits)
HOW to get started
(with a few side trips to make sure we’re all on the same page re: lingo)
3
First…
DON’T
do what I’m about to do Don’t start by exploring technology
(Hint: Start with user stories)
4
What’s a user story?
a user story captures what the user wants to achieve—who wants the functionality
and why it allows that user to achieve something useful
5
Creating user stories
Focus your tagging strategy on user stories—how people want to use your content:
What tasks are they trying to do when they use your
product? What answers are they looking for? At what point
in their workflow is your product used?
Almost all information sites have multiple user stories. Know them for your products
Remember that your organization is also a key
user of your product
6
WHAT is semantic…
enrichment
tagging
markup
indexing
fingerprinting
classification
categorization
?
7
Semantics are about meaning The meaning of content is currently written for
human understanding, not computers
Semantics adds a layer of meaning to your content, so that computers can make sense of it and build connections to it
Semantic metadata answers the most important question of all for content producers and users:
What is this content about?
captured in a way that computers can process
8
“Atomizing” information
A semantic approach requires you to go beyond documents and think of your content as data
Semantic markup allows knowledge in your publications to be acted on as distinct bits of data
For example:
1 practice guideline = 1 document OR 1 practice guideline = 312 distinct pieces of data
9
Taxonomy is the semantic foundation Taxonomy is the framework for the semantic layer
and semantic tagging
It allows… Normalization
Consistency in tagging
Concept grouping and hierarchical relationships
Integrations/interoperability (internal and external)
10
Equivalent relationships are critical Synonyms, abbreviations, jargon, misspellings,
codes are a critical component
Necessary to normalize the natural and constantly evolving variations in the language that authors use to describe concepts and searchers use to find them
Vastly improve performance of autotagging systems
Precise strings are easier to match programmatically, and a thesaurus magnifies the number of strings available to match to a given concept
11
Normalization
Authors use different terminology to represent the same topics
Examples: Synonyms (newborn = neonate) Abbreviations (GHB = gamma hydroxybutyrate) Shorthand (c diff = clostridium difficile)
Searches for these language variations produce different results
A semantic layer controlled by a taxonomy/ thesaurus normalizes these variations
12
Normalization in action at McGraw-Hill’s AccessEmergency Medicine
13
Consistency in tagging
14
Dynamic concept grouping and hierarchical relationships
15
Hooks for integrations/ interoperability
16
Where does a taxonomy come from? Your content collection
Inputs from your users (e.g., author keywords, search logs)
Subject matter expert consultation
Industry standard terminologies
Source for concepts, equivalents, guidance on hierarchy
17
The importance of industry standard terminologies
Your taxonomy must be able to interact with standards of your domain to forge meaningful external integrations
Many terminologies are in use in different scientific domains (UMLS, ACS, ACM, AIP, IEEE, OSA, EPA, NASA, USGS…). Investigate what’s available
Great case example for domain-level taxonomy:
For medical content, UMLS metathesaurus maps together 100+
constituent health care vocabularies (MeSH, SNOMED, ICD,
RxNorm…) to support health care interoperability
18
Don’t reinvent the wheel!
If there’s a taxonomy available that’s a good fit, use it
BUT make sure you have a mechanism for adapting it to meet the needs of your content your users the pace of change/new concepts in your field
[Note to STM publishers in cutting-edge areas: You can’t wait for the standards to catch up to your research output—you’ll need to be able to add concepts at the time of publication]
19
Ongoing taxonomy management
Taxonomies must be continually enhanced as
your domain evolves, your content set grows,
and your user needs and expectations change
Make sure it is easy to update your taxonomy and
make it available to your systems (tagging, web
applications), ideally in real time
Taxonomies should always be
considered a work in progress!
20
Application of taxonomy to content—semantic tagging
Semantic tagging is the insertion of semantic information at the level of XML elements
Example: <root-term termID="47521">t cells, regulatory</root-term>
Tagging can be embedded directly in XML, provided as separate reference files, or placed in database tables that reference elements
If the content is inaccessible (e.g., images and videos, PDFs) tagging can be placed in header files
21
Who/what tags? Automated tagging—software analyzes content, adds tags
based on concept matching, patterns, grammar Pros: Highly scalable, good at finding trends in large bodies of content. Sometimes the
only option for very large data sets Cons: False positives, missed concepts
Manual tagging—humans with appropriate expertise (sometimes called Subject Matter Experts, or SMEs) read the content and apply tags
Pros: Precise, ideal when clinical judgment is required Cons: Cost-prohibitive for large volumes of content, hard to scale, inconsistent
(humans make subjective choices!)
Hybrid—automated process followed by manual review/modification For high-value, specialized sites (such as clinical decision support that require “one best
answer” results) this extra human touch can be necessary Some content types aren’t accessible to automated systems (multimedia)
22
<collection1, collection2> <summary>
Disease <summary>
Diagnosis Lorem ipsum dolor sit amet, cras sagittis velit velit fermentum dignissim, <odio purus>, in enim phasellus eget, tincidunt suspendisse tempus. <Egestas tempor> eu id velit rutrum, per diam arcu eget nec placerat.
<summary>
<summary>
Subheading. <Pretium consequat> luctus nascetur. Interdum
et quis malesuada pellentesque. Lorem nonummy <massa tristique> augue viverra., ridiculus eleifend at.
<summary>
<summary>
Treatment <Tincidunt> suspendisse amet, cras sagittis velit velit fermentum dignissim, odio purus, in enim phasellus eget, <tincidunt suspendisse tempus>. Egestas tempor eu id <lorem ipsum dolor> sit amet.
References 1. Lorem ipsum dolor sit amet, cras sagittis velit velit 2. Lorem ipsum dolor sit amet, cras sagittis velit velit fermentum
TABLE. Rewrewqrq <rewqrewreq dsfdsafsda>
fdsfsdafdsfds fdsfdsfdsafds fdsfdsfdsfds
rewrewrq rewqrwq rewrwq
Tagging for different uses
FIGURE. <Tincidunt suspendisse> tempus cras.
<Collections> What “buckets” does this content object belong in?
Assignment of content into topical collections for major site navigation or product definition
topic collections; microsites; virtual journals…
<Section Summaries> What is this section/article/chapter about?
Most significant topics discussed at the article/chapter/ section (wrapper) level
answers to clinical questions; review; skills assessment…
<Entities> What is this thing?
Important concepts at the paragraph/list/ table/figure (granular) level
complex search queries; concept overlap analysis; specific entity types like drugs, genes, clinical trials, manufacturers… 23
WHY
should you care
(What are the benefits?)
24
Failure of the status quo
Information scarcity is no longer the issue. Attention scarcity is the problem.
The publisher’s role in information curation and filtering has never been more important. However, the tools to achieve them are changing.
“Information is a source of learning. But unless it is organized, processed, and available to the right people in a format for decision making, it is a burden, not a benefit.”– William Pollard, Physicist
25
Search accuracy, precision
Faster, more accurate and reliable answers to questions enhance user productivity and thus improve your application’s usability and user satisfaction ratings.
The accuracy threshold for STM information is very high! Users increasingly will not tolerate ambiguous results.
Time-strapped users are struggling with information overload—fewer, better answers are often preferred.
Tagging allows exposure of hard-to-find media like images, videos.
26
“Which did you mean?” at McGraw-Hill’s AccessMedicine
27
28
Pathways to related content
Related search terms
Links to related content within and across resources
Dynamically generated as new content is added
Goal: Increases serendipitous discovery, site stickiness, and usage metrics like number of page views and time on site
29
30
31
Contextual integrations
Internally—across titles and content types (journals, books, videos, images, e-learning…)
Externally—with partners and external data sets
Increasingly important to integrate content into customer workflows—to bring content to them in context as they do their daily work clinicians at point of care students as prepare for exam
32
New products
Content recycling: Create new products from content you already have Image collections
Mashup and micro products that serve specialized audiences and fit specific workflows
Topically constructed objects like virtual journals, knowledge environments, coursepacks, learning objects
You can cost-effectively create
niche products not possible before
33
AIP/APS virtual journals
34
Search engine optimization
Granular topic exposure leads to better ranking in major search engines
Next wave of discovery tools (intelligent agents, virtual research assistants) will give greater weight to content they can understand
Tags can also be exposed to help create auto-extracts for content that doesn’t have abstracts (like book chapters)
35
36
Semantic users As users search and navigate semantic content, you can attach the
tags on that content to them
A semantic profile for a user can be created from his/her site activity
What topics are they interested in? How are their interests evolving?
Use this information to create personalized information services
Excellent method for encouraging anonymous institutional users to register/log in
Use topical affinities between users to create communities of practice—groups of people who share a passion for something they do and learn how to do it better through social interaction
37
Contextual advertising
Match article and ad semantic tags to precisely target ads based on topic
OR, block ads from appearing next to articles on related topics
OR (even better): Alternative advertising method Advertising can be targeted to the user profile, not just the article
Avoid targeting editorially sensitive pages but still deliver ads that match that user’s interests on neutral pages or alerts
For classified/job ad targeting, user interests can be matched up with demographics like location
38
What about mobile?
Reduction in number of clicks!
Precision in search
Quick links to what most users need
Targeted navigation that leads to content most important (answers to clinical questions)
39
HOW
to get started
40
Questions for you and your application/hosting providers
What are your user stories/use cases?
What are the business benefits/ROI for your organization?
What content do you need to tag, how is that content delivered, and can those delivery systems/platforms use taxonomy and tagging in a way that supports your user needs?
What’s your plan for keeping your taxonomy up to date?
Can your “living” taxonomy be integrated into your applications? In real time as you make updates?
41
Questions for semantic tech providers
Does the technology support your user stories/ use cases?
Does it offer/integrate with a constantly evolving taxonomy?
Does it meet the accuracy threshold for your users and your content?
Can it tag at the depth—both granular and summary level—necessary? Figures and tables? Top-level collections?
42
The semantic user story
I am specifically identifying --------------
because -------------------- is very important
to my ------------------- users
when they are ------------------ -.
43
The semantic user story
I am specifically identifying concise disease
treatment content because immediate access to
treatment options is very important to my
emergency physician users when they are seeing
20 patients an hour.
44
McGraw-Hill: metadata targeted to deliver fast, concise treatment info to ER doc
45
The semantic user story
I am specifically identifying skin disorder images
on all body locations and all types of skin because
visual diagnosis is very important to my family
physician users.
46
Derm101: images displayed in diagnosis search results
47
What are your user stories?
Problems/needs to solve for your users Delivering top quality care under serious time constraints
Explosion of new research to keep up with and integrate into practice
Need to pass a licensing exam
Problems/needs to solve for your organization Creating new products that grow and diversify revenue
Creating more value from advertising
Gaining insight into users
48
Thank you!
Pam Harley
VP, Product & Market Development SemedicaTM A DIVISION OF SILVERCHAIR
[email protected] (434) 296-6333 x372 www.silverchair.com www.semedica.com
“Organizing is what you do before you do something, so that when you do it, it is not all mixed up.”
–A. A. Milne
49