Text Mining and Knowledge Management
Junichi Tsujii
GENIA Project, Kototoi Project(http://www-tsujii.is.s.u-tokyo.ac.jp/GENI
A/)Computer Science, University of Tokyo
Increments
: accumulation
Increase in Medline
2002
2000
1998
199219941996
1990
1988
1980198219841986
1978
1970197219741976
1968
1966
1964
0
100,000
200,000
300,000
400,000
500,000
600,000
年
incr
emen
ts
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
acc
um
ula
tio
n
1. Institute for Medical Science (IMS), U-Tokyo Information Extraction from Text for Signal Pathways
2. Japan Bio-Informatics Research Centre (JBIRC) Interpretation of Micro-Array Data
3. Research Institute for Genetics (RIG) Disease-Gene Association
4. Research Institute for Natural Science (Riken) Tool for Curators for GO Annotation
TEXT MINING for Bio-Medicine in Japan
1. Institute for Medical Science (IMS), U-Tokyo Information Extraction from Text for Signal Pathways
2. Japan Bio-Informatics Research Centre (JBIRC) Interpretation of Micro-Array Data
3. Research Institute for Genetics (RIG) Disease-Gene Associations
4. Research Institute for Natural Science (Riken) Tool for Curators for GO Annotation
TEXT MINING for Bio-Medicine in Japan
Resource Building for TM in BM : GENIA Project (1998 - ) GENIA Corpus (Annotated Text) Information Exploitation System : Kototoi Project (2000 - ) Adaptable POS Tagger (Bio-Tagger), NER adapted for BM Parser based on HPSG (Enju), ML for Text Processing
TEXT Mining = DATA Mining + BOW ?
BOW : “Bag of Words” Model
The model does not work because
(1) Language is a complex system (2) Language is inherently associated with knowledge
Mining + NLP + Knowledge Management
TM products on market with fanciful visualization facilitiesand trend analysis tools
Ontology-basedKMS
Natural Language Processing
Information Exploitation
A Huge amount of Raw Data Unstructured Information (Text) Semi-structured Information (XML+Text) Structured Information (Data bases)
Effective management of knowledge and information is the key
Information ExtractionModule
•Identify & classify terms•Identify events
Raw(OCR) TextStructure
Annotated
Corpus
Document Named-Entity Event
Database
Ontology Markuplanguage
Data model
Background Knowledge
MEDLINE
Retrieval Module
•Request enhancement•Spawn request•Classify documents
Security
User
•IR Request•Abstract•Full Paper
User
•IR Request•Abstract•Full Paper
Interface Module
•GUI•HTML conversion•System integration
Concept Module
Corpus Module
•Markup generation / compilation•Annotated corpus construction
Database Module
•DB design / access / management•DB construction•BK design / construction / compilation
Overview of GENIA System
Non-Trivial Mappings
Language Domain Knowledge Domain
Concepts and Relationships among Them
Linguistic expressions
1. Size of Knowledge2. Context Dependency3. Evolving Nature of Science4. Hypothetical Nature of Ontology
5. Inconsistency
Motivated Independently of language
TerminologyNLPParaphrasing
Non-Trivial Mappings
Language Domain Knowledge Domain
Concepts and Relationships among Them
Linguistic expressions
1. Size of Knowledge2. Context Dependency3. Evolving Nature of Science4. Hypothetical Nature of Ontology
5. Inconsistency
Motivated Independently of language
TerminologyNLPParaphrase
address
Terms Concepts
address-as-a-speech
address-as-a-mail-address
address-as-a-street-address
A term is introduced, without explicit understanding whatit means, in order for one to make statements on it.
Semantic Web by Tim Berners-Lee, et.al. Scientific American (2001)
Language Domain Concept Domain
A cluster of realizations of terms
1.000 NF kappa B 128 0.500 Transcription Factor NF kappa B 0 0.429 NF-kappa B 912 0.286 NF kB, Transcription Factor 0 0.286 NF kB 0 0.286 Immunoglobulin Enhancer-Binding Protein 0 0.286 Immunoglobulin Enhancer Binding Protein 0 0.286 Enhancer-Binding Protein, Immunoglobulin 00.286 kappa B Enhancer Binding Protein 0 0.286 Transcription Factor NF-kB 00.286 Transcription Factor NF kB 0 0.286 Factor NF-kB, Transcription 0 0.286 nuclear factor kappa beta 2 0.286 NF kappaB 1 0.273 NF kappa B chain 00.273 NF kappa B subunit 0 0.214 Transcription Factor NF-kappa B 0 0.214 NF-kB, Transcription Factor 0 0.214 NF-kB 67 0.200 Neurofibromatosis Type kappa B 0
Automatically Generated Variants
Non-Trivial Mappings
Language Domain Knowledge Domain
Concepts and Relationships among Them
Linguistic expressions
1. Size of Knowledge2. Context Dependency3. Evolving Nature of Science4. Hypothetical Nature of Ontology
5. Inconsistency
Motivated Independently of language
TerminologyNLPParaphrase
Non-trivial Mapping
Language Domain Knowledge Domain
Independently motivated of Language
Spelling VariantsSynonyms
Acronyms
Same relationswith differentStructures
Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to …..
[A] protein activates [B] (Pathway extraction)
Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene.
Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription.
[sentence] > ([arg1_activate] > [protein])Retrieval usingRegional Algebra
Predicate-argument structureParser based on Probabilistic HPSG (Enju)
The protein is activated by it
DT NN VBZ VBN IN PRP
dt np vp vp pp np
np pp
vp
vp
s
arg1arg2mod
Non-Trivial Mappings
Language Domain Knowledge Domain
Concepts and Relationships among Them
Linguistic expressions
1. Size of Knowledge2. Context Dependency3. Evolving Nature of Science4. Hypothetical Nature of Ontology
5. Inconsistency
Motivated Independently of language
TerminologyNLPParaphrase
and in its absence, deficient 60 S ribosomes are assembled which are inactive in protein synthesis resulting in cell lethality.
Mutations that completely abolish recognition of 26 S rRNA, however,block the formation of 60S particles, demonstrating that binding of L25to this rRNA is an essential step in the assembly of the large ribosomalsubunit.
Depletion of Saccharmoyces cerevisiae ribosomal protein L16 causes decrease in 60S ribosomal subunits and formation of half-mer polyribosomes.
Without L3, apparent synthesis of several 60 S subunit proteins diminished, and 60S subunit did not assemble. A similar phenomenon occurred, when a second strain, synthesis of ribosomal protein L29 was prevented.
Term: Ribosomal large subunit assembly and maintenance
Language Domain Concept Domain
Process of Ribosomal subunit assembly
A cluster of realizations of terms
Information and Knowledge Exploitation System
as
an integrated management system of raw data, semi-structured data, text and
structured data base
+
Mining Tools (Task Specific Software)
Text Archive with Feature ObejctsManaging texts, data representation and their semanticsManaging texts, data representation and their semantics
binded
Eventcontent
ninomicreatordc
endp
startp
wsjtextid
extent
Pr
:
30
10
02
Text ID
Start Position of the region
End Position of the region
Annotator
Content
agent
content 核開発内容問題
Text DB
DB of Feature Objects
Data Base Module
Ubiquitincontent
agent
binded
Event
content Pr
Copy and Unification
Specialization by unification
ubiquitinagent
bindtypeeventcontent
ninomicreatordc
endp
startp
wsjtextid
extent
interactinprotein:
30
10
02
Adding more augmented information induced by inference, type restriction, unification
Adding more augmented information induced by inference, type restriction, unification
Data representation
Text
Semantics
Ubiquitin E is bound with
Information ExtractionModule
•Identify & classify terms•Identify events
Raw(OCR) TextStructure
Annotated
Corpus
Document Named-Entity Event
Database
Ontology Markuplanguage
Data model
Background Knowledge
MEDLINE
Retrieval Module
•Request enhancement•Spawn request•Classify documents
Security
User
•IR Request•Abstract•Full Paper
User
•IR Request•Abstract•Full Paper
Interface Module
•GUI•HTML conversion•System integration
Concept Module
Corpus Module
•Markup generation / compilation•Annotated corpus construction
Database Module
•DB design / access / management•DB construction•BK design / construction / compilation
Overview of GENIA System
Top Related