Post on 26-Dec-2015
Automatic Metadata Generation & Evaluation
Automating & Evaluating Metadata Generation
Elizabeth D. Liddy
Center for Natural Language ProcessingSchool of Information Studies
Syracuse University
Automatic Metadata Generation & Evaluation
Outline
• Semantic Web
• Metadata
• 3 Metadata R & D Projects
Automatic Metadata Generation & Evaluation
Semantic Web
• Links digital information in such a way as to make the information easily processable by computers globally
• Enables publishing data in a re-purposable form
• Built on syntax which uses URIs and RDF to represent and exchange data on the web– Maps directly & unambiguously to a model– Generic parsers are available
• However, requisite processing is still largely manual
Automatic Metadata Generation & Evaluation
Metadata
• Structured data about resources • Supports a wide range of operations:
– Management of information resources– Resource discovery
• Enables communication and co-operation amongst: – Software developers– Publishers – Recording & television industry – Digital libraries– Providers of geographical & satellite-based information– Peer-to-peer community
Automatic Metadata Generation & Evaluation
Metadata (cont’d)
• Value-added information which enables information objects to be:– Identified– Represented– Managed– Accessed
• Standards within industries enable interoperability between repositories & users
• However, produced manually
Automatic Metadata Generation & Evaluation
Educational Metadata Schema Elements
GEM Metadata Elements
• Audience
• Cataloging
• Duration
• Essential Resources
• Pedagogy
• Grade
• Standards
• Quality
Dublin Core Metadata Elements• Contributor• Coverage• Creator• Date• Description• Format• Identifier• Language• Publisher• Relation• Rights• Source• Subject• Title• Type
Automatic Metadata Generation & Evaluation
Educational Metadata Schema Elements
GEM Metadata Elements
• Audience
• Cataloging
• Duration
• Essential Resources
• Pedagogy
• Grade
• Standards
• Quality
Dublin Core Metadata Elements• Contributor• Coverage• Creator• Date• Description• Format• Identifier• Language• Publisher• Relation• Rights• Source• Subject• Title• Type
Automatic Metadata Generation & Evaluation
Semantic Web MetaData ?
• But both….– Seek same goals
– Use standards & crosswalks between schema
– Look for comprehensive, well-understood, well-used sets of terms for describing content of information resources
– Enable mutual sharing, accessing, and reuse of information resources
Automatic Metadata Generation & Evaluation
NSDL MetaData Projects
• Breaking the MetaData Generation Bottleneck– CNLP– University of Washington
• StandardConnection– University of Washington– CNLP
• MetaTest– CNLP– Center for Human Computer Interaction –
Cornell University
Automatic Metadata Generation & Evaluation
Breaking the MetaData Generation Bottleneck
• Goal: Demonstrate feasibility of automatically generating high-quality metadata for digital libraries through Natural Language Processing
• Data: Full-text resources from clearinghouses which provide teaching resources to teachers, students, administrators and parents
• Metadata Schema: Dublin Core + Gateway for Educational Materials (GEM) Schema
Automatic Metadata Generation & Evaluation
Method: Information Extraction
• Natural Language Processing– Technology which enables a system to accomplish
human-like understanding of document contents– Extracts both explicit and implicit meaning
• Sublanguage Analysis– Utilizes domain and genre-specific regularities vs.
full-fledged linguistic analysis• Discourse Model Development
– Extractions specialized for communication goals of document type and activities under discussion
Automatic Metadata Generation & Evaluation Types of Features recognized & utilized:• Non-linguistic
• Length of document• HTML and XML tags
• Linguistic• Root forms of words• Part-of-speech tags • Phrases (Noun, Verb, Proper Noun, Numeric Concept)• Categories (Proper Name & Numeric Concept)• Concepts (sense disambiguated words / phrases)• Semantic Relations• Discourse Level Components
Information Extraction
Automatic Metadata Generation & Evaluation Stream Channel Erosion Activity
Student/Teacher Background: Rivers and streams form the channels in which they flow. A river channel is formed by the quantity of water and debris that is carried by the water in it. The water carves and maintains the conduit containing it. Thus, the channel is self-adjusting. If the volume of water, or amount of debris is changed, the channel adjusts to the new set of conditions. …..…..Student Objectives: The student will discuss stream sedimentation that occurred in the Grand Canyon as a result of the controlled release from Glen Canyon Dam.…
Sample Lesson Plan
Automatic Metadata Generation & Evaluation
Input:The student will discuss stream sedimentation that occurred in the Grand Canyon as a result of the controlled release from Glen Canyon Dam.
Morphological Analysis:The student will discuss stream sedimentation that occurred in the Grand Canyon as a result of the controlled release from Glen Canyon Dam.
Lexical Analysis:The|DT student|NN will|MD discuss|VB stream|NN sedimentation|NN that|WDT occurred|VBD in|IN the|DT Grand|NP Canyon|NP as|IN a|DT result|NN of|IN the|DT controlled|JJ release|NN from|IN Glen|NP Canyon|NP Dam|NP .|.
NLP Processing of Lesson Plan
Automatic Metadata Generation & Evaluation Syntactic Analysis - Phrase Identification:
The|DT student|NN will|MD discuss|VB <CN> stream|NN sedimentation|NN </CN> that|WDT occurred|VBD in|IN the|DT <PN> Grand|NP Canyon|NP </PN> as|IN a|DT result|NN of|IN the|DT <CN> controlled|JJ release|NN </CN> from|IN <PN> Glen|NP Canyon|NP Dam|NP </PN> .|.
Semantic Analysis Phase 1- Proper Name Interpretation:
The|DT student|NN will|MD discuss|VB <CN> stream|NN sedimentation|NN </CN> that|WDT occurred|VBD in|IN the|DT <PN cat=geography/location> Grand|NP Canyon|NP </PN> as|IN a|DT result|NN of|IN the|DT <CN> controlled|JJ release|NN </CN> from|IN <PN cat=geography/structure> Glen|NP Canyon|NP Dam|NP </PN> .|.
NLP Processing of Lesson Plan (cont’d)
Automatic Metadata Generation & Evaluation
Semantic Analysis Phase 2 - Event & Role Extraction
Teaching event: discuss actor: student
topic: stream sedimentation
event: stream sedimentation location: Grand Canyon
cause: controlled release
NLP Processing of Lesson Plan (cont’d)
Automatic Metadata Generation & Evaluation
Potential
Keyword data
Html Document
ConfigurationHTML
Converter
MetadataRetrieval Module
Cataloger
Catalog Date
Rights
Publisher
Format
Language
Resource Type
eQueryExtraction
ModuleCreator
Grade/Level
Duration
Date
Pedagogy
Audience
Standard
HTML Document with Metadata
PreProcessorTf/idf
Keywords
Title
Description
Essential Resources
Relation
Output Gathering Program
MetaExtract
HTML Document
Automatic Metadata Generation & Evaluation Title: Grand Canyon: Flood! - Stream Channel
Erosion ActivityGrade Levels: 6, 7, 8GEM Subjects: Science--Geology
Mathematics--GeometryMathematics--Measurement
Keywords: Named Entities: Colorado River (river), Grand Canyon
(geography / location), Glen Canyon Dam (geography / structures)
Subject Keywords: channels, conduit, controlled_release, dam, flow_volume, hold, reservoir, rivers, sediment, streams
Material Keywords: clayboard, cookie_sheet, cup, paper_towel, pencil, roasting_pan, sand, water
Automatically Generated Metadata
Automatic Metadata Generation & Evaluation
Pedagogy: Collaborative learning
Hands on learning
Tool For: Teachers
Resource Type: Lesson Plan
Format: text/HTML
Placed Online: 1998-09-02
Name: PBS Online
Role: onlineProvider
Homepage: http://www.pbs.org
Automatically Generated Metadata (cont’d)
Automatic Metadata Generation & Evaluation
Metadata Evaluation Experiment
• Blind test of automatic vs. manually generated metadata
• Subjects:– Teachers– Education Students– Professors of Education
• Web-based experiment– Subjects provided with educational resources
and metadata records – 2 conditions tested
Automatic Metadata Generation & Evaluation
Metadata Evaluation Experiment
Blind Test of Automatic vs. Manual Metadata
Expectation Condition – Subjects reviewed:1st - metadata record2nd - lessson plan
and then judged whether metadata provided an accurate preview of the lesson plan on 1 to 5 scale
Automatic Metadata Generation & Evaluation
Metadata Evaluation Experiment
Blind Test of Automatic vs. Manual Metadata
Expectation Condition – Subjects reviewed:1st - metadata record2nd - lessson plan
and then judged whether metadata provided an accurate preview of the lesson plan on 1 to 5 scale
Satisfaction Condition– Subjects reviewed: 1st – lesson plan 2nd – metadata recordand then judged the accuracy and coverage of metadata on 1 to 5 scale, with 5 being high
Automatic Metadata Generation & Evaluation
Qualitative Experimental Results
Expec Satis Comb
# Manual Metadata Records 153 571 724
# Automatic Metadata Records 139 532 671
Automatic Metadata Generation & Evaluation
Qualitative Experimental Results
Expec Satis Comb
# Manual Metadata Records 153 571 724
# Automatic Metadata Records 139 532 671
Manual Metadata Average Score 4.03 3.81 3.85
Automatic Metadata Average Score 3.76 3.55 3.59
Automatic Metadata Generation & Evaluation
Qualitative Experimental Results
Expec Satis Comb
# Manual Metadata Records 153 571 724# Automatic Metadata Records 139 532 671
Manual Metadata Average Score 4.03 3.81 3.85Automatic Metadata Average Score 3.76 3.55 3.59
Difference 0.27 0.26 0.26
Automatic Metadata Generation & Evaluation
MetaData Research Projects
1. Breaking the MetaData Generation Bottleneck
2. StandardConnection
3. MetaTest
Automatic Metadata Generation & Evaluation
StandardConnection
• Goal: Determine feasibility & quality of automatically mapping teaching standards to learning resources
• “Solve linear equations and inequalities algebraically and non-linear equations using graphing, symbol-manipulating or spreadsheet technology.”
• Data: Educational Resources: Lesson Plans, Activities, Assessment Units, etc.
• Teaching Standards: Achieve/McREL Compendix
Automatic Metadata Generation & Evaluation
“Simultaneous Equations Using Elimination”
URI: M8.4.11ABCJWashington Mapping
CompendixArkansas Mapping
Alaska Mapping
Michigan Mapping
California Mapping
New York Mapping
Florida Mapping
Texas Mapping
Cross-mapping through the Compendix Meta-language
Automatic Metadata Generation & Evaluation
StandardConnection Components
CompendixMathematics: 6.2.1 C
Adds, subtracts, multiplies,& divides whole numbers
and decimals
State Standards
Educational Resources:Lesson Plans, Activities,Assessment Units, etc.
Automatic Metadata Generation & Evaluation
Lesson Plan: “Simultaneous Equations Using Elimination”
Submitted by: Leslie Howe Email: teachhowe2@hotmail.comSchool/University/Affiliation: Farragut High School, Knoxville, Tn
Grade Level: 9, 10, 11, 12, Higher education, Vocational education, Adult/Continuing education
Subject(s): Mathematics / Algebra
Duration: 30 minutes
Description: The Elimination method is an effective method for solving a system of two unknowns. This lesson provides students with immediate feedback using a computer program or online applet.
Goals: The student will be able to solve a system of two equations when there are two unknowns.
Materials: Online computer applet / program http://www.usit.com/howe2/eqations/index.htm Similar downloadable C++ application available at the same site.
Procedure: A system of two unknowns can be solved by multiplying each equation by the constant that will make the coefficient of one of the variables become the LCM (least common multiple) of the initial coefficients. Students may use the scroll bars on the indicated applet to multiply the equations by constants until the GCF is located. When the "add" button is activated after the correct constants are chosen one of the variables will be eliminated. The process can be repeated for the second variable. The student may enter the solution of the system by using scroll bars. When the "check" button is pressed the answer is evaluated and the student is given immediate feedback. (The same procedure can be done using the downloadable C++ application.) After 5-10 correct responses the student should make the transition to paper and solve the equations without using the applet. The student can still use the applet to check the answer. The applet will generate problems in a random fashion. All solutions are integers.
Assessment: The lesson itself provides alternative assessment. The correct responses are recorded.
Automatic Metadata Generation & Evaluation
Lesson Plan: “Simultaneous Equations Using Elimination”
Submitted by: Leslie Howe Email: teachhowe2@hotmail.comSchool/University/Affiliation: Farragut High School,
Knoxville, Tn Grade Level: 9, 10, 11, 12, Higher education,
Vocational education, Adult/Continuing educationSubject(s): Mathematics / Algebra Duration: 30 minutes
Standard: McREL 8.4.11 Uses a variety of methods (e.g., with graphs, algebraic methods, and matrices) to solve systems of equations and inequalities
Description: The Elimination method is an effective method for solving a system of two unknowns. This lesson provides students with immediate feedback using a computer program or online applet.
Goals: The student will be able to solve a system of two equations when there are two unknowns.
Materials: Online computer applet / program http://www.usit.com/howe2/eqations/index.htm Similar downloadable C++ application available at the same site.
Procedure: A system of two unknowns can be solved by multiplying each equation by the constant that will make the coefficient of one of the variables become the LCM (least common multiple) of the initial coefficients. Students may use the scroll bars on the indicated applet to multiply the equations by constants until the GCF is located. When the "add" button is activated after the correct constants are chosen one of the variables will be eliminated. The process can be repeated for the second variable. The student may enter the solution of the system by using scroll bars. When the "check" button is pressed the answer is evaluated and the student is given immediate feedback. (The same procedure can be done using the downloadable C++ application.) After 5-10 correct responses the student should make the transition to paper and solve the equations without using the applet. The student can still use the applet to check the answer. The applet will generate problems in a random fashion. All solutions are integers.
Assessment: The lesson itself provides alternative assessment. The correct responses are recorded.
Automatic Metadata Generation & Evaluation
Index of terms from Standards
Automatic Assigning of Standards as a Retrieval Process
Automatic Metadata Generation & Evaluation
Standards Assembled Standard
Indexed
DOCUMENT COLLECTION = Compendix Standards
Processed
Index of Standards is assembled from the subject heading, secondary subject, actual standard, and vocabulary.
Automatic Metadata Generation & Evaluation
Index of terms from Standards
Automatic Assigning of Standards as a Retrieval Process
Automatic Metadata Generation & Evaluation
Lesson Plan as Query
Index of terms from Standards
Automatic Assigning of Standards as a Retrieval Process
Automatic Metadata Generation & Evaluation New Lesson Plan
Query=Top 30 terms: equation, eliminate solve
TF/IDF: Relative frequency weights of words, phrases, proper names, etc
QUERY = NLP Processed Lesson Plan
Filtering: Sections are eliminated or given greater weight (e.g. citations are removed).
Relevant parts of lesson plan
Simultaneous|JJ Equations|NNS Using|VBG Elimination|NN
Natural Language Processing: Includes part-of-speech tagging, bracketing of phrases & proper names
Automatic Metadata Generation & Evaluation
Lesson Plan as Query
Index of terms from Standards
Automatic Assigning of Standards as a Retrieval Process
Assignment of Standard to Lesson Plan
Automatic Metadata Generation & Evaluation
Teaching Standard Assignment as Retrieval Task Experiment
• Exploratory test run
– 3,326 standards (documents)
– TF/IDF term weighting scheme
– 2,239 lesson plans (queries)
– top 30 weighted terms from each as a query vector
• Manual evaluation
– Focusing on understanding of issues & solutions
Automatic Metadata Generation & Evaluation
Information Retrieval Experiments
• Baseline Results
– 68 queries (lesson plans) evaluated
– 24 (35%) queries - appropriate standard was ranked first
– 28 (41%) queries - predominant standard was in top 5
– Room for improvement, but promising
Automatic Metadata Generation & Evaluation
Future Research
• Improve current retrieval performance– Matching algorithm, document expansion, etc
• Apply classification approach to Standard Connection Project
• Compare information retrieval approach and classification approach
• Improve browsing access for teachers & administrators
Automatic Metadata Generation & Evaluation Automatic Assignment of Standards to Lesson PlansStandard 8.3.6: Solves simple
inequalities and non-linear equations with rational number solutions, using concrete and informal methods .
Standard 8.4.11:Uses a variety of methods (e.g., with graphs, algebraic methods, and matrices) to solve systems of equations and inequalities
Lesson Plan with Standards attached
Standard 8.4.12 Understands formal notation (e.g., sigma notation, factorial representation) and various applications (e.g., compound interest) of sequences and series
Browsable Map of Standards, e.g. Strand Maps
Standard 8.4.11Linked
Browsing Access to Learning Resources
Automatic Metadata Generation & Evaluation
MetaData Research Projects
1. Breaking the MetaData Generation Bottleneck
2. StandardConnection
3. MetaTest
Automatic Metadata Generation & Evaluation
Life-Cycle Evaluation of Metadata
1. Initial generation 2. Accessing DL resources - Methods - Users’ interactions - Manual - Browsing - Automatic - Searching- Costs - Relative contribution of - Time each metadata
element - Human Resources
- Technology 3. Search Effectiveness - Precision
- Recall
Automatic Metadata Generation & Evaluation
MetadataGeneration
SystemUser
Metadata
Understanding
Evaluation
GOAL: Measure Quality & Usefulness of Metadata
PrecisionRecall
BrowsingSearching
METHODS:Manual
Semi-AutomaticAutomatic
COSTS:Time
Human ResourcesTechnology
Automatic Metadata Generation & Evaluation
Evaluation Methodology
• Automatically metatag a Digital Library collection that has already been manually meta-tagged.
• Solicit range of appropriate Digital Library users.
• For each metadata element:1. Users qualitatively evaluate it in light of the digital resource.2. Conduct a standard IR experiment.3. Observe subjects while searching & browsing.
• Monitor with eye-tracking & think-aloud protocols
Automatic Metadata Generation & Evaluation
Information Retrieval Experiment
• Users ask queries of system
• System retrieves documents using either:
– Manually assigned metadata
– Automatically generated metadata
• System ranks documents in order by system estimation of relevance
• Users review retrieved documents & judge relevance
• Compute precision & recall
• Compare results according to:
– Method of assignment
– The Metadata element which enabled retrieval
Automatic Metadata Generation & Evaluation
User Studies: Methods & Questions
1. Observations of Users Seeking DL Resources
– How do users search & browse the digital library?
– Do search attempts utilize the available metadata?
– Which metadata elements are most important to users?
– Which are used consistently for the best results?
Automatic Metadata Generation & Evaluation
User Studies: Methods & Questions (cont’d)
2. Eye-tracking with Think-aloud Protocols– Which metadata elements do users spend most time viewing?– What are users thinking about when seeking digital library
resources?– Show correlation between what users are looking at and
thinking.– Use eye-tracking to measure the number & duration of
fixations, scan paths, dilation, etc.
3. Individual Subject Data– How does expertise / role influence seeking resources from
digital libraries?
Automatic Metadata Generation & Evaluation
Sample Lesson Plans
Automatic Metadata Generation & Evaluation
Eye Scan Path For Bug Club Document
Automatic Metadata Generation & Evaluation
Eye Scan Path For Sigmund Freud Document
Automatic Metadata Generation & Evaluation
What, When, Where, and How Long
Word FixatedFixation NumberFixation Duration
Automatic Metadata Generation & Evaluation
In Summary: Metadata Research Goals
1. Improve access via automatic metadata generation: • Provide richer, more complete and consistent metadata. • Increase the number of resources available electronically.• Increase the speed with which they are added.
2. Add appropriate teaching standards to each resource.
3. Provide empirical results on quality, utility, and cost of automatic vs. manual metadata generation.
4. Show evidence as to which metadata elements are needed.
5. Inform HCI design with a better understanding of users’ behaviors when browsing and searching Digital Libraries.
6. Employ automatic metadata generation to build the Semantic Web.