John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual...
-
Upload
rachel-chambers -
Category
Documents
-
view
213 -
download
1
Transcript of John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual...
John J. Kovarik, NSA/CSS Senior Language Technology Authority
1
Building the Federal Multilingual Building the Federal Multilingual Infrastructure in UnicodeInfrastructure in Unicode
Foreign Language Dictionary ToolsForeign Language Dictionary Tools
.
John J. Kovarik, NSA/CSS Senior Language Technology Authority
2
Project GoalsProject Goals
Unite federal foreign language analysts in communities of interest by language to increase the speed and accuracy of multilingual work
Outgrowth of NSA legacy individual foreign language dictionary tools
Share Next Generation tool suite across the federal government in 90 languages
John J. Kovarik, NSA/CSS Senior Language Technology Authority
3
Foreign Language Work 1970’sForeign Language Work 1970’s
Manual tools– Hardcopy dictionaries (2-10 per person)– 3x5 card files for specialized vocabulary– Pen and paper only
Work environment– Career analysts revered as subject matter
experts rule the work place.– College graduates hired right out of school,
some with military experience, enter the job.
John J. Kovarik, NSA/CSS Senior Language Technology Authority
4
Foreign Language Challenge IForeign Language Challenge IThe The classicclassic sparse data problem sparse data problemNever enough vocabulary
Never enough grammar training
Never enough cultural knowledge
John J. Kovarik, NSA/CSS Senior Language Technology Authority
5
Foreign Language Challenge IIForeign Language Challenge IIWhyWhy it’s a sparse data problem. it’s a sparse data problem. Communication is usually spontaneous
between 2 or more people who share a great deal of special knowledge in common
Ultimate goals often not explicit Ambiguity reigns for outsiders No simple rules for filling in the blanks
John J. Kovarik, NSA/CSS Senior Language Technology Authority
6
An example— An example— 女人 去 打敲 竹鋼的 密醫 來 女人 去 打敲 竹鋼的 密醫 來 解決 她的 問題 。解決 她的 問題 。 All glossed (4 min/chr 17chrs) meaning obscure—”Female
people go hit knock bamboo curtain’s secret doctor come untie decide her ask issue.”
All phrases verified (longest string match—9) clearer—”A woman goes and knocks on the bamboo curtain’s secret doctor to come resolve her problem.”…but still uncertain
Check for neologism—go to FBIS recent translations, look to clarify meaning of new term “knock bamboo curtain”.
“Knock on the bamboo curtain for a secret doctor” = “seek out an illegal quack”
“A woman (must) go seek out an illegal quack to resolve her problem.”
John J. Kovarik, NSA/CSS Senior Language Technology Authority
7
People say, “What’s the big deal People say, “What’s the big deal with just an on-line dictionary?”with just an on-line dictionary?”
“I never/seldom use a dictionary!”– Native speaker syndrome– Vast majority of people must use a dictionary
in a second/third language
“Hardcopy dictionaries are better.”– Can’t do wild-card searches by hand– Not engineered for 10 sec. avg. response– Humans tire; machines do not.
John J. Kovarik, NSA/CSS Senior Language Technology Authority
8
19911991First Generation Dictionary DB ToolFirst Generation Dictionary DB Tool200,000 entries from 3x5 cards
collected over 20 yearsWild card searchableCross referenced 4 ways in
accordance with user requirementsDisplayed in native scriptCan cut and paste queries/responses
John J. Kovarik, NSA/CSS Senior Language Technology Authority
9
Reactions to 1Reactions to 1stst Generation Tool Generation Tool
Younger analysts used it; liked it; made great suggestions to improve it
Senior analysts usually would not use it
John J. Kovarik, NSA/CSS Senior Language Technology Authority
10
1995199522ndnd Generation Dictionary DB Tool Generation Dictionary DB Tool
Responses faster on queries with leading wild card
GUI customized per user inputCandidate entry system establishedUsership doubled !Senior analysts start to use it
John J. Kovarik, NSA/CSS Senior Language Technology Authority
11
1998199833rdrd Generation Dictionary DB Tool Generation Dictionary DB Tool
Database re-encoded in UTF8Simultaneous simplified and
traditional Chinese display enabledAverage 1,000-3,000 candidate
entries approved annually ’98-’02Usership again doubled !
John J. Kovarik, NSA/CSS Senior Language Technology Authority
12
Today Today WordscapeWordscapeThe The Next GenerationNext Generation Dictionary Tool Dictionary Tool Retains all Chinese capabilities Expands to all language collections Neologism newswire research tools Over 90 languages represented in one
Unicode DB unified under one XML schema and one suite of tools
Under LASER ACTD funding, extending all across the federal government!
John J. Kovarik, NSA/CSS Senior Language Technology Authority
13
Technology and StandardsTechnology and Standards
New technology being used– Benefits of scale from use of UTF8, XML
Standards adopted—leading change– Participating in ISO standards group
Technical Committee 37 on terminology and language resources (developing standardized formats for foreign language lexical resources and data exchange)
John J. Kovarik, NSA/CSS Senior Language Technology Authority
14
When do Unicode standards fail? When do Unicode standards fail? When When Unicode standards are Unicode standards are not standardnot standard!!3rd World languages less commonly
taught in the United StatesHindi (many different script
rendering implementations)Mongolian (no standardized spelling,
many newswire web sites employ non-standard fonts)
John J. Kovarik, NSA/CSS Senior Language Technology Authority
15
Language Knowledge Services Language Knowledge Services Team/ResourcesTeam/ResourcesJohn L. George Program Manager
(301) 688-9133 Over 20 computer scientists/techsCurrently deploying Beta versionLearning from testing with earlier
version instantiations at FBI and NSAon JWICS now, SIPRnet/NIPRnet next
John J. Kovarik, NSA/CSS Senior Language Technology Authority
16
Contact InformationContact Information
John J. Kovarik Senior Language Technology Authority
NSA Representative to LASER ACTD
National Security Agency9800 Savage Road
Suite 6486 S2 Phone: (301) 688-7198