Improving Knowledge Discovery By Combining Text …aleks/sdm04w/ben-dov.pdfImproving Knowledge...

Improving Knowledge Discovery

By Combining Text-Mining (TDM)

And Link-Analysis Techniques

Presentation By Moty Ben-Dov

Improving Knowledge Discovery By Combining Text-Mining And Link-Analysis Techniques

Dr. Wendy Wu Middlesex University London

Dr. Ronen FeldmanBar Ilan University Israel

Dr. Paul A. Cairns University College London

The presentation framework

The reasons for combining text-mining and link-analysis

Two links extraction approaches

The experiments and the results

Discussion and Conclusions

The organization’s knowledge

The Knowledgewe havewe know and we use

The knowledge

we don’t haveand we know that

The knowledge we have but we don’t

know that and don’t use it

”Knowledge can be created by drawing inference from what is already known”

Ron Davies 1989

Record1

Record 2

B C A C ?

Arrowsmith project (Swanson & Smalheiser)

Epilepsy & Magnesium-Deficiency are related

Migraine & Epilepsy are related

Migraine Magnesium-Deficiency

What is Text-Mining (TDM)?

“TDM is the process of extracting interesting patterns from very large unstructured content database for the purposes of discovering knowledge.

TDM applies the same analytical functions used to do Data-Mining and also applies natural language (NL) and information retrieval (IR) techniques.”

Dörre, Gerstl and Seiffert 1999

What is Link Analysis?

“Link-analysis is the process of building up networks of interconnected objects in order to explore pattern and trends. Link-analysis is based on a branch of mathematics called "graph theory" ”.

Barry and Linoff, 1997

WEB SITES/HTML

NEWSFEEDS

INTERNALDOCUMENTS

OTHER“RAW” DATA

WEB SITES/HTML

NEWSFEEDS

INTERNALDOCUMENTS

OTHER“RAW” DATA

U n s t r u c t u r e d C o n t e n t

Text-Mining Process

Features

Link-Analysis process

New Knowledge Discovery

Improving Knowledge Discovery Process

Actual usage of combining text mining and link-analysis for discovering new anti-terror knowledge

The mission was to find a connection between “Bin Laden” and “John Paul II”.

We ran a Text-Mining tool over a document database and extracted all persons names in the documents (features).

We tried to find connection between the two by using just the features. No connection was found.

The next step was to used a Link-Analysis tools to find indirect connections.

For the Link-Analysis process we used the Co-occurrences links at the sentence level.

Documents supporting the connection between Osama bin Laden and Ramzi Yousef

Documents supporting the relationship between

Ramzi Yousef and the Pope

Co-occurrence links - Two features co-occur within a sentence ifthey both appear in the same sentence the co-occurrence links were created by a simple method of seeking the existence of the relevant features within the same sentence

Semantics links - The semantic links were created by using noun phrase ,verb identification ,linguistic and semantic constraints.

The experiments Framework

The target - to compare between the two links extraction methods Co-occurrence and Semantic.

The Data – 9133 web pages about “terror” from the CNN, CBS, BBC and Yahoo.

The Tools – we used the ClearForest® suite for data collection, Text mining (features extraction), link extraction and results visualization.

The experiment Framework (cont.)

We searched information about meetings between persons by using the links each approach had created.

We checked if we can find answers for the following questions:

Q1: How many meetings did Arial Sharon and Colin Powel have?

Q2: How many meetings did Yasser Arafat and Colin Powel have?

Q3: How many meetings did Yasser Arafat and Anthony Zinni have?For each question we counted the meetings each approach had found

and then calculate and compare the precision and recall of each approach.

The experiment

For the first question Q1 (How many meetings did Arial Sharon and Colin Powel have?) we did the following steps :.

2. The co-occurrence links –We chose the co-occurrence at the sentence level and we got 71 sentences (links). We found that only 20 links are actually about a meeting.

1. Preliminary stage – finding the actual number of documents that mention meetings between Ariel Sharon and Colin Powel. We did a query for all the documents in which Sharon and Powel appear and found 183. we read the documents and found 22 documents that mention a meeting between them.

The experiment (cont.)

The experiment (cont.)3. The Semantic links – The links were created by using noun phrase and

verb identification and linguistic and semantic constraints.

We used the ClearForest® tools SEDP (Semantic Extraction Discovery Process) for creating the semantic links.

The SEDP found Person_Meeting events (links).

The experiment (cont.)An example of Person_Meeting event (link):

“Powell met earlier with Israeli Prime Minister Ariel Sharon to discuss how Israel might end its military operation in Palestinian cities."

Person_Meeting event (link): <Person_Meeting><Person>Powell<\Person><Meeting> met <\Meeting> <Person> Ariel Sharon <\Person>

<\Person_Meeting>

The SEDP Process

The experiment (cont.)3. The Semantic links – The links were created by using noun phrase and

verb identification and linguistic and semantic constraints.

We used the ClearForest tools SEDP (Semantic Extraction Discovery Process) for creating the semantic links.

The SEDP found Person_Meeting events (links).

We found that 9 documents have the semantically Person_Meeting links. After reading the 9 documents, 8 of them were about meetings between Sharon and Powel.

The experiment (cont.)4. The Precision – was calculated as the number of correct links (i.e. links

that report an actual meeting) divided by the Total number of links found. (8/9)

5. The Recall - was calculated as the number of correct links that were founddivided by the total number of correct links on the on the whole database(founded at the Preliminary stage ). (8/22)

6. We did the same process for Q2 and Q3.

The resultsQ3Q2Q1

Semanticlinks

Co-occurrence links

Semanticlinks

Co-occurrence links

Semanticlinks

Co-occurrence links

58514820Correct links

1111151522 22

Total Correctlinks

59694971Total links

100%88.89%83.33%14.89%88.89%28.17%Precision

45.45%72.73%33.33%93.33%36.36%90.91%Recall

The Anaphora problem

"He will be holding his first talks with Israeli Prime Minister Ariel Sharon and new Palestinian Prime Minister Mahmoud Abbas, known informally as Abu Mazen, since the road map was published."

The anaphora "He" refers to Colin Powel but the co-occurrence process couldn’t find it

If we need very focused information then the best results will be obtained by using the semantic links.

When we look for greater coverage of information we should used the co-occurrence links.

The Knowledge Discovery Process

preprocessing Link Extraction Detailed Analysis

Time consumingquickCo-occurrence links

quickTime consuming Semantic links

Future plans

Co-occurrenceLinks

Statistical (HMM)Links

SemanticLinks

Hybrid links

Links Fusion

Thank You

Questions ?

Improving Knowledge Discovery By Combining Text …aleks/sdm04w/ben-dov.pdfImproving Knowledge...

Documents

Transcript of Improving Knowledge Discovery By Combining Text …aleks/sdm04w/ben-dov.pdfImproving Knowledge...

Combining text mining and coordinate-based meta-analysis...Text mining and CBMA Combining text and coordinate directly Construct a functional parcellation of the brain based on combined

Combining QSAR Modeling and Text-Mining Techniques to Link ...

Paper SIB-097 Combining Text and Graphics with ODS …analytics.ncsu.edu/sesug/2008/SIB-097.pdf · Combining Text and Graphics with ODS LAYOUT and ... The example illustrates use

Combining heterogeneous text-technological resources for ... · Text Technological Modelling of Information CoGETI Workshop, 24.11.2006 Combining heterogeneous text-technological

Combining Metabolite-Based Pharmacophores with Bayesian Machine Learning Models for Mycobacterium tuberculosis Drug Discovery

1 Contents Introduction Knowledge discovery from text & links Knowledge discovery from usage data Important open issues.

Semantically-Guided Clustering of Text Documents via Frequent Subgraphs Discovery · 2020. 9. 27. · Semantically-Guided Clustering of Text Documents via Frequent Subgraphs Discovery

Combining Text Mining and Microarray Analysis · Combining Text Mining and Microarray Analysis ... Textual information Public Biomedical Databases Text Mining Disease Specific Genes

Combining 2 Powerful Technologies to Enable Further Discovery in Bacterial Studies.

Combining imaging and pathway profiling: an alternative approach to cancer drug discoverycsmres.co.uk/.../Carragher_2012_Drug-Discovery-Today.pdf · 2013. 8. 19. · Drug Discovery

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News

On Combining Multiple Segmentations in Scene Text Recognition

Knowledge Discovery of Service Satisfaction Based on Text ... · Knowledge Discovery of Service Satisfaction Based ... classified as data mining and text ... of information retrieval

Paper SIB-097 Combining Text and Graphics with ODS LAYOUT and ODS

COMBINING PHENOTYPE AND GENOTYPE FOR DISCOVERY AND ...

Discovery of multiple hidden allosteric sites by combining Markov ...

Text Classification Combining Clustering and Hierar chical Appr …€¦ · 2.1 Text Classification Text classification is the process of matching a document with the best possible

Text Mining for Creative Cross-Domain Knowledge Discovery

Combining text and visual features for biomedical information ......Combining Text and Visual Features for Biomedical Information Retrieval Dina Demner-Fushman, M.D., Ph.D. Sameer

Text Me, Maybe? Discovery of Electronic Communications Under …_maybe_discovery_of_e.pdf · 2019. 6. 6. · TEXT ME, MAYBE? DISCOVERY OF ELECTRONIC COMMUNICATIONS UNDER THE STORED