John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual...

16
John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Building the Federal Multilingual Multilingual Infrastructure in Infrastructure in Unicode Unicode Foreign Language Dictionary Foreign Language Dictionary Tools Tools .

Transcript of John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual...

Page 1: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

1

Building the Federal Multilingual Building the Federal Multilingual Infrastructure in UnicodeInfrastructure in Unicode

Foreign Language Dictionary ToolsForeign Language Dictionary Tools

.

Page 2: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

2

Project GoalsProject Goals

Unite federal foreign language analysts in communities of interest by language to increase the speed and accuracy of multilingual work

Outgrowth of NSA legacy individual foreign language dictionary tools

Share Next Generation tool suite across the federal government in 90 languages

Page 3: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

3

Foreign Language Work 1970’sForeign Language Work 1970’s

Manual tools– Hardcopy dictionaries (2-10 per person)– 3x5 card files for specialized vocabulary– Pen and paper only

Work environment– Career analysts revered as subject matter

experts rule the work place.– College graduates hired right out of school,

some with military experience, enter the job.

Page 4: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

4

Foreign Language Challenge IForeign Language Challenge IThe The classicclassic sparse data problem sparse data problemNever enough vocabulary

Never enough grammar training

Never enough cultural knowledge

Page 5: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

5

Foreign Language Challenge IIForeign Language Challenge IIWhyWhy it’s a sparse data problem. it’s a sparse data problem. Communication is usually spontaneous

between 2 or more people who share a great deal of special knowledge in common

Ultimate goals often not explicit Ambiguity reigns for outsiders No simple rules for filling in the blanks

Page 6: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

6

An example— An example— 女人 去 打敲 竹鋼的 密醫 來 女人 去 打敲 竹鋼的 密醫 來 解決 她的 問題 。解決 她的 問題 。 All glossed (4 min/chr 17chrs) meaning obscure—”Female

people go hit knock bamboo curtain’s secret doctor come untie decide her ask issue.”

All phrases verified (longest string match—9) clearer—”A woman goes and knocks on the bamboo curtain’s secret doctor to come resolve her problem.”…but still uncertain

Check for neologism—go to FBIS recent translations, look to clarify meaning of new term “knock bamboo curtain”.

“Knock on the bamboo curtain for a secret doctor” = “seek out an illegal quack”

“A woman (must) go seek out an illegal quack to resolve her problem.”

Page 7: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

7

People say, “What’s the big deal People say, “What’s the big deal with just an on-line dictionary?”with just an on-line dictionary?”

“I never/seldom use a dictionary!”– Native speaker syndrome– Vast majority of people must use a dictionary

in a second/third language

“Hardcopy dictionaries are better.”– Can’t do wild-card searches by hand– Not engineered for 10 sec. avg. response– Humans tire; machines do not.

Page 8: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

8

19911991First Generation Dictionary DB ToolFirst Generation Dictionary DB Tool200,000 entries from 3x5 cards

collected over 20 yearsWild card searchableCross referenced 4 ways in

accordance with user requirementsDisplayed in native scriptCan cut and paste queries/responses

Page 9: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

9

Reactions to 1Reactions to 1stst Generation Tool Generation Tool

Younger analysts used it; liked it; made great suggestions to improve it

Senior analysts usually would not use it

Page 10: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

10

1995199522ndnd Generation Dictionary DB Tool Generation Dictionary DB Tool

Responses faster on queries with leading wild card

GUI customized per user inputCandidate entry system establishedUsership doubled !Senior analysts start to use it

Page 11: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

11

1998199833rdrd Generation Dictionary DB Tool Generation Dictionary DB Tool

Database re-encoded in UTF8Simultaneous simplified and

traditional Chinese display enabledAverage 1,000-3,000 candidate

entries approved annually ’98-’02Usership again doubled !

Page 12: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

12

Today Today WordscapeWordscapeThe The Next GenerationNext Generation Dictionary Tool Dictionary Tool Retains all Chinese capabilities Expands to all language collections Neologism newswire research tools Over 90 languages represented in one

Unicode DB unified under one XML schema and one suite of tools

Under LASER ACTD funding, extending all across the federal government!

Page 13: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

13

Technology and StandardsTechnology and Standards

New technology being used– Benefits of scale from use of UTF8, XML

Standards adopted—leading change– Participating in ISO standards group

Technical Committee 37 on terminology and language resources (developing standardized formats for foreign language lexical resources and data exchange)

Page 14: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

14

When do Unicode standards fail? When do Unicode standards fail? When When Unicode standards are Unicode standards are not standardnot standard!!3rd World languages less commonly

taught in the United StatesHindi (many different script

rendering implementations)Mongolian (no standardized spelling,

many newswire web sites employ non-standard fonts)

Page 15: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

15

Language Knowledge Services Language Knowledge Services Team/ResourcesTeam/ResourcesJohn L. George Program Manager

(301) 688-9133 Over 20 computer scientists/techsCurrently deploying Beta versionLearning from testing with earlier

version instantiations at FBI and NSAon JWICS now, SIPRnet/NIPRnet next

Page 16: John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

John J. Kovarik, NSA/CSS Senior Language Technology Authority

16

Contact InformationContact Information

John J. Kovarik Senior Language Technology Authority

NSA Representative to LASER ACTD 

National Security Agency9800 Savage Road

Suite 6486 S2 Phone: (301) 688-7198

[email protected]