Presentation
description
Transcript of Presentation
Leiden University. The university to discover.
From Data Mining to Knowledge Fitting
Joost N. Kok, Leiden Institute of Advanced Computer Science
Leiden University. The university to discover.Sunday, April 9, 20232
Information Ladder
- Data- Information- Knowledge - Understanding - Insight - Wisdom
Leiden University. The university to discover.
Leiden University. The university to discover.
Data Mining definitions
- Secondary analysis of data- Induction of understandable useful models and patterns from data
- Algorithms for large quantities of data
Leiden University. The university to discover.
-Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data
useful
novel, surprising
comprehensible
valid (accurate)
Leiden University. The university to discover.
Data Mining
- Data Mining = Data Search using a Knowledge Bias
Leiden University. The university to discover.
Data Mining- Data Mining is somewhat comparable to
Statistics (and is often based on it), but takes it further in the sense that whereas
- statistics aims more at validating given hypotheses
- in data mining often millions of potential patterns are generated and tested, in the hope of finding some that are potentially useful
Leiden University. The university to discover.
Leiden University. The university to discover.
Typical Data Mining Results
-Forecasting what may happen in the future-Classifying people or things into groups by recognizing patterns-Clustering people or things into groups based on their attributes-Associating what events are likely to occur together-Sequencing what events are likely to lead to later events
Leiden University. The university to discover.
Different types of problems- “Data mining” problems / tasks often fall in
one of the following categories:- Classification- Regression- Clustering- Discovering associations- Probabilistic modelling
Leiden University. The university to discover.
From “Querying” to “Mining”
Are there any occurrences of GAAT in this string?
How many occurrences of AAT are there in this string?
Which substrings of length 4 occur at least 2 times?
Which substrings (of any length) occur significantly moreoften in the
white string than in the black string?
Standard databasetechnology solves suchquestions
Data mining technologycan sometimes solve suchquestions (computationsmay be (too) heavy)
Science fictionWhy is the virus to the left resistant to my drug, and the one to the right
not?
Leiden University. The university to discover.Sunday, April 9, 2023
Scientific Data Lifecycle
Leiden University. The university to discover.Sunday, April 9, 2023
Scientific Data Lifecycle
Leiden University. The university to discover.
Databases
Ontologies
IntegrationDisambiguation
DataKnowledgeDiscovery
tools
KDD
Dat
a m
inin
g
Sta
tistic
s
Knowledge Fitting
Leiden University. The university to discover.
Building Blocks
Leiden University. The university to discover.
Link Integration
Source Source Source Source
Leiden University. The university to discover.
Federated Database
Source Source Source Source
Distributor
Leiden University. The university to discover.
Data Warehousing
Source Source Source Source
Warehouse
Leiden University. The university to discover.
Scripting Languages
- A scripting language is a programming language that controls software applications.
- Examples: Python, Perl- Standards for uniform access to
databases
Leiden University. The university to discover.
Ontologies
- Ontology is about the description of things and their relationships.
- Ontologies are taxonomies that define concepts and relationships among them.
- The subclass / is-a relationship is most predominant in ontologies
Leiden University. The university to discover.
OWL =Web Ontology Language
Leiden University. The university to discover.
Building Blocks
Leiden University. The university to discover.
Service Orientation
- SOA = Service-Oriented Architecture- SOA: Distributed Software Architecture
that allows for building applications through individual component composition
Leiden University. The university to discover.
Leiden University. The university to discover.
Visualisation
- Intelligent Data Analysis
- Intelligent = Methods- Intelligent = Human Interaction
- First step:
- visualisation of the data
Leiden University. The university to discover.
DNA Visualisation
- Long patterns over small alphabets are hard to find …
- ababababababababababababababababababababababa . . .- (ab)
- abbbababaaababbabbbababaaababbabbbababaaababb . . .- (abbbababaaababb)
- abaaaababbbbabaaaababbbbabaaaababbbbabaaaabab . . .- (abaaaa · babbbb)
Leiden University. The university to discover.
Leiden University. The university to discover.
DNA Visualisation
- Associate each nucleotide with a dimension
- Four nucleotides => four dimensions- Build a structure in four dimensions- Project to three dimensions
Leiden University. The university to discover.
DNA Visualisation
- We expect to see the following things in the projection:
- A non-predictable walk for information rich parts of the DNA
- A true random walk for random parts- Lines (or approximate lines) for
repeating parts of the DNA- Large identical substrings in the DNA
can easily detected
Leiden University. The university to discover.
DNA Visualisation- Select four three-dimensional vectors.
- The vectors should be of comparable length
- The four vectors should add up to 0
- Every subset of three vectors should be independent.
Leiden University. The university to discover.
DNA Visualisation
Leiden University. The university to discover.
The first 160,000 nucleotides of the human Y-chromosome
Leiden University. The university to discover.
The first 160,000 nucleotides of the human Y-chromosome
Leiden University. The university to discover.
Leiden University. The university to discover.
40,000–100,000 of the chromosome 1 (human)
Leiden University. The university to discover.
DNA Visualisation
- Simple, large and extremely large (approximate) repeats can easily be detected
- Demo- http://www.liacs.nl/home/jlaros/projects/dnavis/
Leiden University. The university to discover.
Data Mining
Leiden University. The university to discover.
Subgroup Discovery
- How to find comprehensible subgroups in large amounts of data?
- As an example: subtypes in complex diseases
- Different types of input
1
2
3
Class A Class B
Leiden University. The university to discover.
+
+
+
+
+
+
+
+
+
+
+
+
Classification versus Subgroup Discovery
+
+
+
+
+ +
+
+
+
+
1
2
3
Class A Class B
Leiden University. The university to discover.
Classification vs Subgroup Discovery
- Classification - predictive induction - constructing sets of classification rules- aimed at learning a model for classification
or prediction- rules are dependent
- Subgroup Discovery- descriptive induction - constructing individual subgroup-describing
rules - aimed at finding interesting patterns in
target class examples
Leiden University. The university to discover.
Towards Knowledge Fitting- Trends:
- A lot of valuable data is not any longer being shared due to various reasons: privacy issues, data is difficult and expensive to collect, etc.
- The amount of publicly available knowledge increases daily.
- Patterns and models need to be complemented with knowledge that convinces the user.
Leiden University. The university to discover.
Knowledge Fitting =
Knowledge Mining using a Data Bias
Data Mining =
Data Search using a Knowledge Bias
Leiden University. The university to discover.
SUBGROUP MINING SCENARIO
43
Leiden University. The university to discover.
- Prepare the data
- Model the subgroups
- Characterize and compare the subgroups
- Evaluate the subgroups
Package available in R
Subgroup Mining Scenario
Leiden University. The university to discover.
Group Modeling
- Model based cluster analysis.- The data is modeled by a mixture of Gaussians.- Many models, many BIC scores.
Leiden University. The university to discover.
Group Characteristics
Leiden University. The university to discover.
Subgroup Evaluation- We report in tables statistical results and
generalization estimates
Leiden University. The university to discover.
GENOMICS SCENARIO
48
Leiden University. The university to discover.49
Gene Expression Data
- Genomics: the study of genes and their function
- MicroArray Data
- a very large number of attributes (genes) relative to the number of examples (observations)
- typical values: 7000-16000 attributes, 50-150 examples
Leiden University. The university to discover.50
Gene Expression Data
Patient # Tumor Type Gene #1 Gene #2 Gene #3 … Gene #10,000 1 A 5.00 1.33 3.45 … 4.22 2 A 0.98 0.87 1.04 … ? 3 B 0.33 1.40 0.42 … 0.24 … … … … … … …
100 B 0.89 0.90 1.00 … 0.66
fewcases
many features
…#1 #2 #100
50/71
Leiden University. The university to discover.51
Ranking of differentially expressed genes
The genes are ordered in a ranked list, according to their differential expression between the classes.
The challenge is to extract meaning from this list, to describe subgroups.
The conjunction of terms of ontologies are used as a vocabulary for the description of sets of genes.
.
Leiden University. The university to discover.52
Subgroup DiscoveryDiscovery of gene subgroups which
- are “higher” in the ranked list- can be compactly summarized
using• knowledge (GO, ENTREZ, KEGG)• Interactions between genes
Leiden University. The university to discover.
Enrichment Score
Leiden University. The university to discover.
Descriptions
- FANTOM = Frequent pAtterN Tree-based Ontology Miner
- FANTOM is a knowledge fitting tool that uncovers “interesting” descriptions of gene sets
- Interesting: high Gene Set Enrichment Score
- Search for patterns is exhaustive
Leiden University. The university to discover.
Inputs
- FANTOM takes as inputs:- A ranked list of genes (default ID is from
ENTREZ), together with a score.- Ontologies (default are GO and KEGG)- Mappings (to map ENTREZ or another
ID to the ontologies)- Interaction data (if available)- Cutoffs
- minimum GSES - minimum amount of gene participants in a rule
Leiden University. The university to discover.
Typical Statistics
- Experiment comparing two different mouse hearts:
- Generated rule options: 200k-2m- Actual rules: 10-40k- Rules after pruning: 5-500- Runtime: 5 minutes - 4 hours
Leiden University. The university to discover.
Knowledge Fitting =
Knowledge Mining using a Data Bias
Data Mining =
Data Search using a Knowledge Bias
Leiden University. The university to discover.
Intelligent Bridges
Movies
Leiden University. The university to discover.
Cyttron
- The Cyttron consortium aims at developing a "super microscope", imaging the living cell with atomic resolution.
- Images gathered with X–ray diffraction, electron microscopy, and other sources will be combined through advanced software solutions.
- www.cyttron.org
Leiden University. The university to discover.
LIACS
- The Computer Science Institute of Leiden University
- Leiden Institute of Advanced Computer Science
- www.liacs.nl
Leiden University. The university to discover.
Research Clusters
- Algorithms- Foundations of Software Technology- Computer Systems- Imagery and Media- Technology and Innovation Management
Leiden University. The university to discover.
Acknowledgements
- Jeroen Laros (LIACS)- Jeroen de Bruin (LIACS)- Fabrice Colas (LIACS)- Nada Lavrac (JSI)- Igor Trajkovski (JSI)- Jan Bot (TU Delft)- Ingrid Meulebelt (LUMC)- Eline Slagboom (LUMC)- Peter-Bram ‘t Hoen (LUMC)- Tineke van Veen (LUMC)- Stephanie van Roden (LUMC)
Leiden University. The university to discover.
Algorithms Cluster @ LIACS