Quality Taxonomies

Post on 10-Jan-2016

47 views 1 download

Tags:

description

Quality Taxonomies. Jim Nisbet Senior Vice President of Technology Semio Corporation Knowledge Technologies 2001 March 5 th , 2001. Ontology / Taxonomy. Static Discovery. Root Ontology. Taxonomy Generation. Dynamic Discovery. What is Quality ?. “Best value for the money” - PowerPoint PPT Presentation

Transcript of Quality Taxonomies

Quality TaxonomiesQuality Taxonomies

Jim NisbetSenior Vice President of Technology

Semio Corporation

Knowledge Technologies 2001March 5th, 2001

Ontology / Taxonomy

Root Ontology

Taxonomy Generation

Static Discovery

Dynamic Discovery

What is Quality ?

“Best value for the money” According to this definition, you are entitled to

get high performance from a costly product; likewise a low cost product or service is expected to be a poor delivery. For example, a loose demo delivery is both predictable and acceptable, since its quality is: low conformance / low cost.

What is Quality ?

“Good Quality is Nominal Conformance”

Taxonomy Quality is defined as Taxonomy Conformance to: • Valid requirements;• Explicitly documented development standards; and, • Implicit characteristics that are expected of all

professionally developed taxonomies, such as the desire for good maintainability.

Standards

ISO 2788-1986• International Organization for Standardization. Documentation—Guidelines for the Establishment and

Development of Monolingual Thesauri. 2nd ed. n.p.: ISO, 1986. (ISO 2788-1986(E)). (Available in the U.S. from American National Standards Institute)

ISO 5964-1985 • International Organization for Standardization. Documentation—Guidelines for the Establishment and

Development of Multilingual Thesauri. n.p.: ISO, 1985. (ISO 5964-1985(E)). (Available in the U.S. from American National Standards Institute)

ANSI/NISO Z39.19-1993• National Information Standards Institute. Guidelines for the Construction, Format, and Management of

Monolingual Thesauri. Bethesda, MD: NISO Press, 1994. 69p. (ANSI/NISO Z39.19-1993)

SEMIO Quality Plan v1 2000 ISO/IEC 13250 Topic Maps RDF

• Please refer to RDF at http://www.w3.org/RDF and XML at http://www/w3/org/XML

Project Plan

1. Kick-off2. Requirements Review3. Lexicon Review4. Taxonomy Review5. Tags Review6. Final Review

1. Kick-off Objectives

• Purpose• Scope• Scale• Users• Conditions of receipt

Roles• Supplier• Customer

– Admin– KE– Experts– Users

Planning Training and Transfer

2. Requirements Review

Sources Lexicon Ontology Install

Sources

Dispersion (Multiplicity, Size, Homogeneity) Refresh AccessFeatures Internet,

News,E-Mail

Reports,Patents

E-Trade,Logs

Informative content - + +Number of topics covered + + -Structured information - + +Size of records - + -Number of records + - +

Typical Patterns

Disparity Adjust sources Adjust crawl strategy Isolate communities / taxonomies

Lexicon

Vocabularies, etc. Substitutions: Acronyms, Synonyms, etc. Preferred Keywords: Brand Names, etc. Banned Keywords

Typical Patterns

Lack of requirements Use Librarian Resources

Ontology

Thesaurus ? Is the information domain analysis

complete, consistent, and accurate ? Is the partitioning of the problem

complete ?

Typical Patterns

Directory versus Taxonomy Isolate “directory” branches

Thesaurus versus Taxonomy Put an ontology on top of thesaurus Check ASAP match of thesaurus generics with

extracted lexicon

Very high level design for top categories requirements Plan to work bottom-up

See also Taxonomy (functions, combinations, etc.)

Install

Implementation / Integration:• Are external and internal interfaces properly

defined? • Are all requirements traceable to the system level? • Has prototyping been conducted for the

user/customer? • Is performance achievable within the constraints

imposed by other system elements? • Are requirements consistent with schedule,

resources, and budget?

Typical Patterns

Scale Security Missing Documents

3. Lexicon Review

Coverage• Extracted words / Words• (Extracted Index / Index)

Sources bench-marking• Coverage• Extraction quality• Topic distribution

Structure• Most Frequent Phrases• Most Productive Generics

Substitutions Exceptions

Typical Patterns

Low level of frequency / quality for the most meaningful content Increase size of value corpus Filter and re-import lexicon

4. Taxonomy Review Taxonomy Operation

• Correctness• Reliability• Usability• Integrity• Efficiency

Taxonomy Revision• Maintainability• Flexibility• Testability

Taxonomy Transition• Portability• Reusability• Interoperability

UB

i j

lf lflf1 2

g g gn 1 2 i

n3 4 mg g g g g g s s s s s s25 6 1 3 4

s s s s s s5 6 7 8 m n

v v v v1 2 m n

Level 0

Level 1

Level 2

Level 3

Level 4

UB = unique beginner lf = life-form g = generic s = specific v = varietal

Tax

Liability

Loan

Term loan

Short-term loan

Unique Beginner

Life Form

Generic

Specific

Varietal

Folk Taxonomies Design

The Berlin and Kay model: Taxonomy = Nomenclature + Terminology

Correctness

Accuracy Completeness Consistency

Accuracy

Precision Recall

Completeness

Taxonomy Maps Lexicon Collection

Concentration Works Against Quality

Lexicon

Document Collection

Maps

Taxonomy

Tagging

Tagging Coverage Ontology Coverage Hook Coverage Map Coverage Lexical Coverage Collection Coverage

Consistency:Typical Patterns

Objectivization Hyperonymy Speciation Necessity

Objectivization

EmploymentFiringHiring

Salaries

Avoid functional categories

Don’t mix functions / objects

Exhaust scripts Match idiomatic phrases

Genericity

PartsAir ConditioningBelts and HosesBodyBrake SystemChassisEngineExhaust SystemFuel SystemGlassIgnition

Avoid meronymy Don’t mix

meronymy / hyperonymy

Exhaust prototypes

Speciation

Person Unwelcome person

Unpleasant personSelfish person

OpportunistBackscratcher

Avoid “strings” of categories Avoid (non-idioms) properties for categories

(WordNet)

Necessity

Tax

Individuals Corporations

Assets Liability Assets Liability

BC

D

E

FG

H

I

K

Tax

Individuals Corporations

Assets Liability

Individuals Corporations

Avoid non-productive categories

Avoid combinations of categories

Nomenclature (Design Structure) Quality Index

Depth Width Balance

UB

i j

lf lflf1 2

g g gn 1 2 i

n3 4 mg g g g g g s s s s s s25 6 1 3 4

s s s s s s5 6 7 8 m n

v v v v1 2 m n

Level 0

Level 1

Level 2

Level 3

Level 4

UB = unique beginner lf = life-form g = generic s = specific v = varietal

Complexity Index

Cyclometric complexity increases with number of Cross References within the Taxonomy, giving an indication of complexity and difficulty of testing.

Taxonomy Complexity Index combines:• autonomy• closure• similarity• typicality• commonality• redundancy• stability

Maturity index

The IEEE standard 982.1-1988 suggests a taxonomy maturity index to provide an indication of the stability of the taxonomy .

Maturity Index combines:• number of modules in current ontology / taxonomy.• number of modules in current ontology / taxonomy that have

been changed.• number of modules added to current ontology / taxonomy. • number of modules deleted from the previous version of the

ontology / taxonomy.

5. Tags Review

Document coverage Concepts coverage

<tagset> <document> <docurl>http://www.TaxSource.com</docurl> <tag> <tagname>Liability</tagname> <weight>1.289</weight> </tag> <tag> <tagname>Federal Funds</tagname> <weight>0.746</weight> </tag> </document></tagset>

6. Final Review

Receipt Maintenance

Quality TaxonomiesQuality Taxonomies

Jim Nisbetniz@semio.com

Knowledge Technologies 2001