Terminology in Statistical Information Integration Tasks: What’s the Problem?

Terminology in Statistical Information Integration Tasks:

What’s the Problem?

Open Forum 2003 on Metadata Registries

Thursday, January 23, 20032:00-2:45 pm

Sheila O. Denn

2Open Forum 2003 on Metadata Registries

Introduction

Work undertaken as part of NSF grant (EIA 0131824) to study integration of data and interfaces to work toward a statistical knowledge network.

This talk focuses on results from first phase of metadata user study to determine what kinds of problems users have with terminology and metadata on government statistical web sites.


agency backend data

agency backend data

agency backend data

agency backend data

agency intermediary: reports, tables,

“planned” DB queries







end user: generally passive

reader, little interaction, must do all integration

Current Situation: each agency has its own backend data and provides its own intermediary. End user has little opportunity for interaction or active manipulation. Burden of finding information and integrating it across agencies (and occasionally within one agency) is on user.

firewall


agency backend data

agency backend data

agency backend data

agency backend data

Goal: In the SKN, each agency has its own backend data, which feeds into a common public intermediary (PI) outside of firewall. User Interfaces link to the PI under user control.

public intermediary: variable/concept level,

XML-based, single point of access to information

from all agencies

Statistical Ontology

firewall

Domain ExpertsEnd User

Communities

Domain Ontologies

I n t

e r

f a c

e s

U s

e r

end user

end user

end user

end user

end users: interactwith data frominformation/conceptperspective, notagency perspective

end user

end user

end user


What kinds of problems does terminology cause for users?

Miss

Collision

Categorization

User Agency

Term

Term

??

??

Termuser Termagency

Termuser category Termagency category

Termuser TermuserTermuser Termuser



Misses There is no agency term or concept that is linked

to a term or concept that the user is interested in or

The user encounters a term on the system with which she is unfamiliar or about which she has only a vague understanding.

Examples: Seasonal adjustment Consumption vs. production Farm profits vs. market value of agricultural products



Collisions A user has an understanding of a concept that is different from the way the

concept is expressed by the agency. The same term is used differently by different agencies, making integration

of data difficult. Can also apply to clusters of terms where it is not clear what the distinction

between them is. Examples

Labor, labor force, labor supply, workforce, labor force participation rate, labor market

Full-time employment Sector

Categorization – when category groupings do not make sense to the user. Example

Soybeans


Data Collection

In previous work: Transaction logs User queries Interviews

In the first phase of current study, interviews with agency and non-agency domain experts

These sources of evidence yielded categories of terms that can cause difficulty.


Categories of Terms

Statistical terms Date/currency/time Geography Domain terms User terms


Implications for Vocabulary Support Tools

Goals: Provide a basic level of statistical literacy Not intended to be a highly technical, or comprehensive

resource Include terms users frequently encounter while browsing

statistical agency sites Sources of Evidence:

Terminology used on frequently visited pages Anecdotal evidence from agency and non-agency

consultants Metadata user study Web crawl of agency sites



We need to explore how we can use metadata to map between the user terms and the agency terms, and between terms as used by different agencies.

Users are not likely to browse the glossary as a distinct activity, so they need “just-in-time” vocabulary support.

Vocabulary support should allow users to remain in context, not lose sight of the task they are working on. Context specificity – explanations should be provided at varying

levels of specificity General (context-free or “universal”) Agency or context-specific (term as used by particular agency or within

particular domain) Table or statistic-specific (term as it relates to a particular row, column,

or statistic)



Provide explanations of term or concept that are as relevant to the user’s current context as possible.

The most specific explanations available should be offered at the time a user first invokes help.

If there are no explanations appropriate for a specific statistic, row, or column, offer an explanation one level up in generality.

Pathways from specific to general will be based on a statistical ontology currently under development.

The ontology will also be used to provide patterns (templates) for definitions at each level of specificity.


Vocabulary Support Tool Examples

The tools we are working on will provide a basic level of explanation of statistical terms.

Tools may include: Definitions Examples Brief tutorials Demonstrations Interactive simulations Pointers to related terms/concepts Pointers to more complete (or more technical) explanations


IndexAn index combines numbers measuring different things into a single number. The single number represents all the different measures in a compact, easy-to-use form. Values for an index can be compared to each other, for example, over time.

combiner

index = 12.3

10.1

103

24.759

6

42

12

12.5

13

13.5

14

14.5

Jan Apr Jul Oct

Jan.combiner

Apr.combiner

Jul.combiner

Oct.combiner

12.3 13.1 13.9 14.3

The index has increased this year.


Consumer Price Index (CPI)

The Consumer Price Index (CPI) represents changes in prices of all goods and services produced for consumption by urban households. It combines prices into a single number that can be compared over time.

Items are classified into 8 major groups:•Food and Beverages•Housing•Apparel•Transportation•Medical Care•Recreation•Education and Communication•Other

Consumer Price Index

medical careother

CPI combiner

transportationfood & beverage

apparel

recreation

housing

education & communication

Telephone

The Consumer Price Index has increased since 1995.

1997 CPICombiner

1998 CPICombiner

1999 CPICombiner

2000 CPICombiner

2001 CPICombiner

160

165

170

175

180

1997 1998 1999 2000 2001


Antiknock Index, also known as Octane Rating

A number used to indicate gasoline’s antiknock performance in motor vehicle engines. The two recognized laboratory engine test methods for determining the antiknock rating, i.e., octane rating, of gasolines are the Research method and the Motor method. In the United States, to provide a single number as guidance to the consumer, the antiknock index (R+M)/2, which is the average of the Research and Motor octane numbers, was developed.

http://www.eia.doe.gov/glossary/glossary_a.htm

Research method

Motor method

Antiknock Index, also known as Octane Rating

Regular:

85 - 88

Midrange:

88 - 90

Premium:

90 or above

(R + M)/2

AntiknockCombiner


Evaluation

What do we need to evaluate? Technical accuracy Usability of interface “Effectiveness”

Is it attractive enough to entice people to use it? Is it helpful? Is it informative? Does it help the user in completion of task?

How do we measure these things? What other kinds of vocabulary support issues do we

need to address?


Other Issues

Implementation Ongoing maintenance/responsibility


Project Teams

Metadata User Study Team Carol Hert Stephanie Haas Jenny Fry Lydia Harris Sheila Denn

Vocabulary Support Team Stephanie Haas Ron Brown Cristina Pattuelli Jesse Wilbur

GovStat PIs Gary Marchionini, UNC-CH Stephanie Haas, UNC-CH Carol Hert, Syracuse

Catherine Plaisant, UMd Ben Shneiderman, UMd


For More Information

Sheila O. Denn

School of Information and Library Science

University of North Carolina at Chapel Hill

[email protected]

http://ils.unc.edu/govstat/

Terminology in Statistical Information Integration Tasks: What’s the Problem?

Documents

Transcript of Terminology in Statistical Information Integration Tasks: What’s the Problem?