What is Text Mining? - Welcome to the NCRM EPrints Repository

38
What is Text Mining? Sophia Ananiadou National Centre for Text Mining www.nactem.ac.uk University of Manchester

Transcript of What is Text Mining? - Welcome to the NCRM EPrints Repository

Page 1: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?

Sophia Ananiadou

National Centre for Text Mining

www.nactem.ac.uk

University of Manchester

Page 2: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

2

Outline

Aims of text mining

Text Mining steps

Text Mining uses

Applications

Page 3: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

3

Aims

Extract and discover knowledge hidden in text automatically

Aid domain experts by automatically:identifying concepts

extracting facts/relations

discovering implicit links

generating hypotheses

Page 4: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

4

Why do we need text mining? Too much information!

Information overload

Growth of searches

Medline searches over time

010

2030

4050

6070

8090

Jan-9

7Aug

-97Mar-

98Oct-

98May

-99Dec

-99Ju

l-00

Feb-01

Sep-01

Apr-02

Nov-02

Jun-0

3Ja

n-04

Aug-04

Mar-05

Oct-05

Month/year

Sear

ches

(mill

ions

)

Information overlook

Many heterogeneous resources

Text (blogs, papers, surveys, news..)

Databases

Web

Ontologies

Page 5: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

5

The role of text in information management

We are inundated with huge amounts of data

Unstructured information (text)

Different text types, genres, domains..

Semi-structured information (XML + text)

Structured information (databases)

We need to make sense of data

We need to manage information and knowledge effectively

Page 6: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

6

UK National Centre for Text Mining (NaCTeM)

The 1st national text mining centre in the world www.nactem.ac.ukRemit: Provision of text mining services to support UK researchFunded by: the JISC, BBSRC, EPSRCPhase I (2005-2008): biologyPhase II (2008-2011) : bio-medicine, social sciences

humanities

Page 7: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

7

Why is there a need in the UK for a national centre for text mining?

Some researchers knew they wanted TM

TM key component of e-Science

Involve more researchers (from all domains) in doing e-science and e-research

TM seen as key technology for researchers

And one applicable in every domain(broad interest/support from major funding bodies)

Page 8: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

8

Embedding Text Mining within e- Science in the UK

e-Science

[…] enables new research and

increases productivity through shared e-Infrastructure, the development of computational and logical models and new ways to discover and use the growing range of distributed

and interoperable

resources. It

supports multidisciplinary

and collaborative

working and a culture that adopts the emerging methods.

M. Atkinson (2007)

Beyond e-Science

Page 9: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

9

What the users want to do with their data (minimally)

Easier access to data Sharing data with their peersAnnotating data with metadata

Managing data across locationsIntegrating data within workflows, Web ServicesAids for semantic metadata creation; enriching data with related metadata e.g. experimental resultsTEXT MINING RIDES TO

THE RESCUE

Page 10: What is Text Mining? - Welcome to the NCRM EPrints Repository

FromFrom

TextText

toto

KnowledgeKnowledge: : tackling the data deluge through text miningtackling the data deluge through text mining

Unstructured Text(implicit knowledge)

Structured content(explicit knowledge)

Informationextraction

Semanticmetadata

Knowledge Discovery

InformationRetrieval

AdvancedInformation

Retrieval

Page 11: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

11

Text Mining Steps (1/3)

Information Retrieval (Google)Finding within a large document collection, a subset of documents, relevant to a user’s query

Query term “blood” or Boolean query “blood pressure”

Too many documents are retrieved, prohibitively large for humans to read

Many retrieved documents are irrelevant to our query

Many relevant documents are missing

Page 12: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

12

Text Mining Steps (2/3)

Information Extraction, nuggets of textIdentify information nuggets from text,

fill existing templates, create structured information, populate text databases

Slot Information Date 7/10/96 (today) Location SanSalvador Victim injured policeman Victim attacked guards Perpetrator urban guerrillas

San Salvador, 7/10/96

It has been officially reported that a policeman was wounded today when urban guerrillas attacked the guards at a power substation located downtown San Salvador.

Page 13: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

13

Information Extraction

recognise specific relations and events (typically expressed by verbs – attacked)

domain restricted (newswire, biology…)

more complex NLP techniquesmore than ‘bag of words’ approach

deep parsing

Output: filled templates of entities and facts

IE extracts only what we are looking for, i.e. what has been defined by patterns

Page 14: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

14

Text Mining Steps (3/3)

Data Mining: finds associations among pieces of information extracted from many different texts, implicit links

Integration with databases, ontologies

Page 15: What is Text Mining? - Welcome to the NCRM EPrints Repository

Integrating Text with DBs

and Ontologies

Ontological

resourcestext

Supporting semantics

Adding new knowledge

DatabasesSemanticInterpretation of data

Page 16: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

16

Uses (1/

Business applicationsBusiness intelligence (market analysis),

competitors, identify risks, make predictionsCustomer views and opinions from blogs

(opinion mining, trends analysis)Find nuggets of relevant information

immediately, systematicallyRemove tedious process of finding

information

Page 17: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

17

Uses (2/3)

Media

Semantic searching needed for new models of information access, linking and extraction

Personalisation of searchingDocument classification and clustering based on personalised queries

Social networking + text mining

Topic clusters of news

Frame analysis of news

Page 18: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

18

Uses (3/3)

Text Mining enriches text with semantic annotations

For authors tools for semantic annotation

Intelligent information management

For publishers enrichment of digital libraries For scientists intelligent searching,

linking and integration of text, databases

Page 19: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

19

Applications –

extracting terms

Page 20: What is Text Mining? - Welcome to the NCRM EPrints Repository

Terms derivedfrom text used asIndex terms

Page 21: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

21

Problems –

term variation & ambiguity

AcronymsER

estrogen receptor emergency room

endoplasmic reticulumSpelling

Tumour –

tumor

Oestrogen -

estrogen

NF-kB, NF-KB, NF-kappa B, nuclear factor kappa B

Gene terms may be also common English wordsBAD human gene encoding BCL-2 family of proteinsbad news, bad prediction

Page 22: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

22

The peri-kappa B site mediates human immunodeficiency

virus type 2 enhancer activation in monocytes …

Semantic searching based on named-entity annotation

Entity types (defined by Ontologies)Genes/protein namesEnzymes, substances, etc.Names (people, organisations…)

DNA virus

cell_type

Page 23: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

23

KLEIO architecture

Page 24: What is Text Mining? - Welcome to the NCRM EPrints Repository
Page 25: What is Text Mining? - Welcome to the NCRM EPrints Repository

Fewer documentswith more precisequery

Page 26: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

26

Extracting associations between entities

Click!

Page 27: What is Text Mining? - Welcome to the NCRM EPrints Repository

… Alzheimer's disease and schizophrenia. Interestingly, nicotine and similar compounds have been shown to enhance memory function and increase the expression of nAChRs and therefore, could have a therapeutic role in the aforementioned diseases.

Page 28: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

28

Text mining for social sciences

ASSERT project (NaCTeM-NCeSS-EPPI)

Assisting the process of systematic reviewing in social sciences

Engaging the user community: EPPI(Evidence for Policy and Practice

Information and Co-ordinating Centre)

Document classification

Information extraction

Summarisation

Page 29: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

29

The process of Systematic Reviewing

Searching: extensive searches to locate as much relevant research as possible according to a query.

Screening: narrows the scope of search to only the relevant documents to a specific review.

Highlights key evidence and results that may impact on the policy.

Synthesizing: correlates evidence from a plethora of resources and summarises the results.

The process is mainly manual

Page 30: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

30

Overview of ASSERT

Multi-DocumentSummarisation

DocumentClassification

DocumentClustering

SynthesizeScreen

DocumentSectioning

TermExtraction

Search

QueryExpansion

SentenceExtraction

DocumentCollections

Page 31: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

31

BBC Pilot project

Analyse, structure and visualise BBC news online, according to a user’s query using advanced text mining techniques

Concept discovery and retrieval interface allows a user to enter a query

across the document collection and automatically calculate a list of concepts specific to the query and ranked by perceived importance.

Page 32: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

32

Finding news using text mining

Page 33: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

33

Clustering the documents

Page 34: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

34

Visualising the results

Page 35: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

3511th April 2008 ASSERT Project

Benefits to Users

Provision of a focused search with goal based results

Allows expansion beyond known keywords for a more complete search

Visualization of a result set creates an overview of the research in the domain

Saves time and effort

Page 36: What is Text Mining? - Welcome to the NCRM EPrints Repository

36

What do our users/clients use our services for?

Creation of controlled vocabularies, extraction of interactions,creation of models and networks, database curation(BOOTStrep)Bibliographic searching, automatic classification and semantic extraction in support of subject searching (ASSERT, INTUTE)Ontology building

Media frame analysis (ASSIST)

Semantic extraction to support indexing and linking across repositories (INTUTE)

Extracting bio-processes, gene-disease mining (Pfizer)

Maintaining and constructing pathways (REFINE)

Classification of on-line news feeds, document classification

Topic detection (BBC)

Page 37: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

37

Text Mining for Knowledge Discovery

Semantic enrichment relies on NLP based TM technologies

Applications based on semantically annotated texts enable knowledge discoveryLinking text with domain knowledge

Integrate other knowledge sources, ontologies, terminologies

Integration of distributed TM software (workflows)

Page 38: What is Text Mining? - Welcome to the NCRM EPrints Repository

What is Text Mining?Sophia Ananiadou

38