(Talk in Powerpoint Format)
Transcript of (Talk in Powerpoint Format)
![Page 1: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/1.jpg)
AI Machine learning
Neural networks
Deductive detabases
![Page 2: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/2.jpg)
• Detecting regularities in data (bird flue cases)
• Detecting rare occurrences, rare events
• Finding “causal” relationships
![Page 3: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/3.jpg)
OpportunitiesCollecting vast amounts of data has become possible.
Ex1: Astromomy: petabytes of information are collected Laboratory for Cosmological Data Mining (LCDM)
1 petabyte (PB) = 250 bytes = 1,125,899,906,842,624 bytes.
1 petabyte = 1,024 terabytes
1 terabyte (TB) = 1,024 gigabytes
=> The armchair astronomer
![Page 4: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/4.jpg)
Ex2: Biology: huge sequences of nucleotides have been collected. (The human genome contains more than 3.2 billion base pairs and more than 30 000 genes).
http://www.genomesonline.org
Very little of that has
been interpreted yet.
![Page 5: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/5.jpg)
Ex: Physics, Geography, weather data, …
Business, …
• numerical
• discrete
• continuous
• categorical
• raw data
• cleaned data
• complete records
• Incomplete records (missing data)
• formatted data
• unformatted data
![Page 6: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/6.jpg)
Tasks
• Fit data to model– Descriptive– Predictive
• Finding the “best” model ???– Beware of model overfitting!
• Interpreting results• Evaluating models (ex: lift charts)
=> Usually a lot of going back and forth between model(s) and data
![Page 7: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/7.jpg)
Another complementary tack:Interactive visual data exploration
• Remarkable properties of the human visual system. (ex: analysis of a pseudo random number generator)
• Various visual representation schemes– Simultaneous viewing– (fast) sequential viewing
• Animating data (dynamic queries)
Other possibilities: converting data to sounds, etc.
![Page 8: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/8.jpg)
Two broad approaches to Learning
• Supervised learning ex: want to discover a model to help classify stars, based
on emission spectra.
In the “training set” the correct classification of the stars is known.
The resulting model is used to predict the class of a new star (not in the training set)
• Unsupervised learning ex: want to group a set of stars into a small number
sufficiently homogenous sub-groups of stars
![Page 9: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/9.jpg)
Many techniques Fast evolving field
• Statistical– Descriptive stats, graphics, ..– Regression analysis– Principal components analysis– Time series analysis– Cluster analysis (use of a distance measure)– Naïve Bayse classifiers
• Artificial intelligence– Rule induction (Machine Learning)– Various inference techniques (various logics,
deductive databases,…)
![Page 10: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/10.jpg)
– Pattern matching (speech recognition)
– Neural networks (many approaches)
– Genetic algorithms– Baysian networks (probably the best approach to model complex causal structures)
• Information retrieval– Many specialized models (vector model,…)– Concepts of Precision and Recall
• Many ad hoc techniques– Co-occurrence analysis– MK generality analysis– Association analysis
![Page 11: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/11.jpg)
One famous technique
Ross Quinlan’s ID3 algorithm
![Page 12: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/12.jpg)
The weather data
Object Outlook Temperature Humidity Windy Class
1 sunny hot high FALSE N
2 sunny hot high TRUE N
3 overcast hot high FALSE P
4 rain mild high FALSE P
5 rain cool normal FALSE P
6 rain cool normal TRUE N
7 overcast cool normal TRUE P
8 sunny mild high FALSE N
9 sunny cool normal FALSE P
10 rain mild normal FALSE P
11 sunny mild normal TRUE P
12 overcast mild high TRUE P
13 overcast hot normal FALSE P
14 rain mild high TRUE N
![Page 13: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/13.jpg)
![Page 14: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/14.jpg)
From decision trees to rules
• Reading rules from a tree– Unambiguous – Rule order not counting– Alternative rules for the same conclusion are
ORed– But too complex rules
![Page 15: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/15.jpg)
Rules can be much more compact than trees
• Ex: if x=1 and y = 1 then class=a
if z=1and w=1 then class=a
Otherwise class=b
![Page 16: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/16.jpg)
From rules to decision trees
• Rule disjunction result in too complex trees.
• Ex: write as a tree– If a and b then x– If c and d then x (Fig. 3.2)
(replicated sub-tree problem)
• Ex: tree and rules of equivalent complexity
• Ex: tree much more complex than rules
![Page 17: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/17.jpg)
To learn from examples, the examples must be rich enough
• Ex: sister-of relation (fig 2-1)
• Denormalization (fig 2-3)
Importance of data preparation
![Page 18: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/18.jpg)
Attributes
• An attribute may be irrelevant in a given context (ex: number of wheels for a ship in a database of transportation vehicles => Create value “irrelevant”
![Page 19: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/19.jpg)
Software tools
• Many commercial software– CART (http://www.salford-systems.com/landing.php)
– SPSS modules– WEKA (free) (http://www.cs.waikato.ac.nz/~ml/weka/)
– For a larger list: http://www.kdnuggets.com/software/suites.html
• Many field specific software– In the context of GRID computing
• Demonstrating WEKA
![Page 20: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/20.jpg)
Ad hoc methods
• Co-occurrence analysis
• MK generality analysis
![Page 21: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/21.jpg)
Term Co-occurrence Analysis
The following approach measures the strength of association between a term i and a term j of the set of documents by:
e(i,j)2 = (Cij)2/(Ci * Cj)
Where:• Ci : is the number of documents indexed by term i• Cj : is the number of documents indexed by term j• Cij : is the number of documents indexed both by terms i
and j
![Page 22: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/22.jpg)
![Page 23: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/23.jpg)
Interactive Data Visualization
• Fish eye views
• Hyperbolic trees
• Linear Visual data sequences
• Dynamic queries
![Page 24: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/24.jpg)
![Page 25: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/25.jpg)
Tree Maps• Financial Data http://www.smartmoney.com/marketmap/
![Page 26: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/26.jpg)
Conclusion
• Current state of the art (Graphic Models – Markov networks)
• Still an art
• Ethical issues
![Page 27: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/27.jpg)
Baysian Networks
• Objective: determine probability estimates that a given sample belongs to a class
Probability(x Class | attribute values)
• Baysian network: – One node for each attribute– Nodes connected in an acyclic graph– Conditional independance
![Page 28: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/28.jpg)
![Page 29: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/29.jpg)
Learning a baysian network from data
• Function for evaluating a given network based on the data
• Function for searching through the space of possible networks
• K1 and TAN algorithms
![Page 30: (Talk in Powerpoint Format)](https://reader036.fdocuments.in/reader036/viewer/2022062313/558a3663d8b42ace728b4635/html5/thumbnails/30.jpg)
Baysian Networks Graphical Models = Markov models
undirected edges