Post on 24-May-2018
Improving Knowledge Discovery
By Combining Text-Mining (TDM)
And Link-Analysis Techniques
Presentation By Moty Ben-Dov
Improving Knowledge Discovery By Combining Text-Mining And Link-Analysis Techniques
Dr. Wendy Wu Middlesex University London
Dr. Ronen FeldmanBar Ilan University Israel
Dr. Paul A. Cairns University College London
The presentation framework
The reasons for combining text-mining and link-analysis
Two links extraction approaches
The experiments and the results
Discussion and Conclusions
The presentation framework
The reasons for combining text-mining and link-analysis
Two links extraction approaches
The experiments and the results
Discussion and Conclusions
The reasons for combining text-mining and link-analysis
The organization’s knowledge
The Knowledgewe havewe know and we use
The knowledge
we don’t haveand we know that
The knowledge we have but we don’t
know that and don’t use it
The reasons for combining text-mining and link-analysis
”Knowledge can be created by drawing inference from what is already known”
Ron Davies 1989
Record1
A B
Record 2
B C A C ?
The reasons for combining text-mining and link-analysis
Arrowsmith project (Swanson & Smalheiser)
Epilepsy & Magnesium-Deficiency are related
Migraine & Epilepsy are related
Migraine Magnesium-Deficiency
The reasons for combining text-mining and link-analysis
What is Text-Mining (TDM)?
“TDM is the process of extracting interesting patterns from very large unstructured content database for the purposes of discovering knowledge.
TDM applies the same analytical functions used to do Data-Mining and also applies natural language (NL) and information retrieval (IR) techniques.”
Dörre, Gerstl and Seiffert 1999
The reasons for combining text-mining and link-analysis
What is Link Analysis?
“Link-analysis is the process of building up networks of interconnected objects in order to explore pattern and trends. Link-analysis is based on a branch of mathematics called "graph theory" ”.
Barry and Linoff, 1997
The reasons for combining text-mining and link-analysis
WEB SITES/HTML
NEWSFEEDS
INTERNALDOCUMENTS
OTHER“RAW” DATA
WEB SITES/HTML
NEWSFEEDS
INTERNALDOCUMENTS
OTHER“RAW” DATA
U n s t r u c t u r e d C o n t e n t
Text-Mining Process
Features
Link-Analysis process
New Knowledge Discovery
Improving Knowledge Discovery Process
The reasons for combining text-mining and link-analysis
Actual usage of combining text mining and link-analysis for discovering new anti-terror knowledge
The mission was to find a connection between “Bin Laden” and “John Paul II”.
We ran a Text-Mining tool over a document database and extracted all persons names in the documents (features).
We tried to find connection between the two by using just the features. No connection was found.
The next step was to used a Link-Analysis tools to find indirect connections.
For the Link-Analysis process we used the Co-occurrences links at the sentence level.
The reasons for combining text-mining and link-analysis
The reasons for combining text-mining and link-analysis
Documents supporting the connection between Osama bin Laden and Ramzi Yousef
The reasons for combining text-mining and link-analysis
The reasons for combining text-mining and link-analysis
Documents supporting the relationship between
Ramzi Yousef and the Pope
The reasons for combining text-mining and link-analysis
The presentation framework
The reasons for combining text-mining and link-analysis
Two links extraction approaches
The experiments and the results
Discussion and Conclusions
Two links extraction approaches
Co-occurrence links - Two features co-occur within a sentence ifthey both appear in the same sentence the co-occurrence links were created by a simple method of seeking the existence of the relevant features within the same sentence
Semantics links - The semantic links were created by using noun phrase ,verb identification ,linguistic and semantic constraints.
The presentation framework
The reasons for combining text-mining and link-analysis
Two links extraction approaches
The experiments and the results
Discussion and Conclusions
The experiments and the results
The experiments Framework
The target - to compare between the two links extraction methods Co-occurrence and Semantic.
The Data – 9133 web pages about “terror” from the CNN, CBS, BBC and Yahoo.
The Tools – we used the ClearForest® suite for data collection, Text mining (features extraction), link extraction and results visualization.
The experiments and the results
The experiment Framework (cont.)
We searched information about meetings between persons by using the links each approach had created.
We checked if we can find answers for the following questions:
Q1: How many meetings did Arial Sharon and Colin Powel have?
Q2: How many meetings did Yasser Arafat and Colin Powel have?
Q3: How many meetings did Yasser Arafat and Anthony Zinni have?For each question we counted the meetings each approach had found
and then calculate and compare the precision and recall of each approach.
The experiments and the results
The experiment
For the first question Q1 (How many meetings did Arial Sharon and Colin Powel have?) we did the following steps :.
2. The co-occurrence links –We chose the co-occurrence at the sentence level and we got 71 sentences (links). We found that only 20 links are actually about a meeting.
1. Preliminary stage – finding the actual number of documents that mention meetings between Ariel Sharon and Colin Powel. We did a query for all the documents in which Sharon and Powel appear and found 183. we read the documents and found 22 documents that mention a meeting between them.
The experiments and the results
The experiment (cont.)
The experiments and the results
The experiment (cont.)
The experiments and the results
The experiment (cont.)3. The Semantic links – The links were created by using noun phrase and
verb identification and linguistic and semantic constraints.
We used the ClearForest® tools SEDP (Semantic Extraction Discovery Process) for creating the semantic links.
The SEDP found Person_Meeting events (links).
The experiments and the results
The experiment (cont.)An example of Person_Meeting event (link):
“Powell met earlier with Israeli Prime Minister Ariel Sharon to discuss how Israel might end its military operation in Palestinian cities."
Person_Meeting event (link): <Person_Meeting><Person>Powell<\Person><Meeting> met <\Meeting> <Person> Ariel Sharon <\Person>
<\Person_Meeting>
The SEDP Process
The experiments and the results
The experiment (cont.)3. The Semantic links – The links were created by using noun phrase and
verb identification and linguistic and semantic constraints.
We used the ClearForest tools SEDP (Semantic Extraction Discovery Process) for creating the semantic links.
The SEDP found Person_Meeting events (links).
We found that 9 documents have the semantically Person_Meeting links. After reading the 9 documents, 8 of them were about meetings between Sharon and Powel.
The experiments and the results
The experiment (cont.)4. The Precision – was calculated as the number of correct links (i.e. links
that report an actual meeting) divided by the Total number of links found. (8/9)
5. The Recall - was calculated as the number of correct links that were founddivided by the total number of correct links on the on the whole database(founded at the Preliminary stage ). (8/22)
6. We did the same process for Q2 and Q3.
The experiments and the results
The resultsQ3Q2Q1
Semanticlinks
Co-occurrence links
Semanticlinks
Co-occurrence links
Semanticlinks
Co-occurrence links
58514820Correct links
1111151522 22
Total Correctlinks
59694971Total links
100%88.89%83.33%14.89%88.89%28.17%Precision
45.45%72.73%33.33%93.33%36.36%90.91%Recall
The experiments and the results
The Anaphora problem
"He will be holding his first talks with Israeli Prime Minister Ariel Sharon and new Palestinian Prime Minister Mahmoud Abbas, known informally as Abu Mazen, since the road map was published."
The anaphora "He" refers to Colin Powel but the co-occurrence process couldn’t find it
The presentation framework
The reasons for combining text-mining and link-analysis
Two links extraction approaches
The experiments and the results
Discussion and Conclusions
Discussion and Conclusions
If we need very focused information then the best results will be obtained by using the semantic links.
When we look for greater coverage of information we should used the co-occurrence links.
Discussion and Conclusions
The Knowledge Discovery Process
preprocessing Link Extraction Detailed Analysis
Time consumingquickCo-occurrence links
quickTime consuming Semantic links
Discussion and Conclusions
Future plans
Co-occurrenceLinks
Statistical (HMM)Links
SemanticLinks
Hybrid links
Links Fusion
Thank You
Questions ?