Data collected in large databases

1.

Data collected in large databases

Relational databases: Many variables and cases.

Mostly noisy data: Missing Values, Zeros, Outliers.

No more independent random samples.

Data mining objective: To extract valuable information.

To identify nuggets, small clusters of observations inthese data that contain potentially valuable information.

The definition of valuable is generally reflected by a large response value of a specific category of a qualitative response.

Sifting through a large volume of data that is noisy, badly behaved, and that may have many missing values, or that may just be irrelevant is the main challenge of data mining.

Statistical Data mining 2. How large is large?

By number of cases:

Small:N < 30(No CLT)

Moderate:30 < N < 500 (CLT)

Moderately large:500 < N < 50000 ( tolerableN 2 )

Large:50000+:No N 2computations.

By the number of variables:

Small:One variable.

Moderate: Less than 500 Variables. Matrix inversion.

Large: More than 500 Variables.

By database size:

Large:Does not fit in memory.

3. Data mining Software

Fast computations.

Economic use of memory.

Flexible (and user friendly) Graphics Interface.

Software:

Clementine from SPSS

Enterprise Miner from SAS Institute

Diva, Spottfire, C5, Splus, R and others

Statistical Methods:

Advanced Data Visualization.

Data Reduction:variable and case subsetting, sampling.

Dimension Reduction, Principal Components, Covariance.

Cluster analysis (Segmentation):k -means,hierarchical.

Classification(pattern recognition) trees, neural nets.

Regression: Linear and Nonlinear.

Improved Methodology and Software.

Improved packaging

Data is from regular businesses.

Objective:Better business decisions.

What is new?

To develop new methodology that truly answers data miningobjectives.

Challenge $!%#?! data that is noisy,badly behaved,many missing values Tree Methods Cluster Analysis Regression Analysis 5.

Statistical Criteria is evaluated at the subset only.

Add a penalty for subset size (to avoid subsets that are too small).

Paradigm for data mining: Selection of interesting subsets Variables Bad Data Good Data Cases Good Data 6. Role of visualization Data visualization methods are attractive tools to use for analyzing such datasets for several reasons: Data visualization methods show many features (expected and unexpected) of a dataset at once and, as such, are well equipped to pick up subtle structures of interest and anomalies as well as clear patterns. They allow (in fact, encourage) flexible interaction with the data. They can be more readily understood by non-statisticians (although their properties may not be). Good user-friendly graphics software is becoming more readily available. 7.

Data visualization methods

Large datasets create visualization challenges.

Scatterplots: Large numbers of points may hide the underlying structure.

Apply data binning and use an image graph.

Sometimes is enough to graph a subset selected at random.

Many variables at once. There are many ingenious tools for this.

Scatterplot matrix

-all variables

-all descriptor variables with color coding according to one response

-all response variables with color coding according to one descriptor

plot selected 2D views to highlight some feature of the data:

- principal components analysis (spread)

- projection pursuit (clustering)]

look at all 2D views of the data via a dynamic display

[rotating 3D display, grand tour]

conditional plots

multiple windows with brush and link

8. Scatter Plot Image Plot 9. 10. 11. 12. Feature recognition methods variable and case selection cluster analysis (unsupervised pattern recognition) - partitioning methods (e.g.,k -means,k -medioids) - hierarchical methods (e.g., agglomerative nesting) - fuzzy analysis classification (supervised pattern recognition, discriminant analysis) - trees (e.g., CART, C5, Firm, Tree) - model-based methods (e.g., logistic regression, sliced inverse regression) - artificial neural networks role of robust methods / diagnostics 13. Trees

Classification & Regression Trees

Fit a tree model to data.

Recursive Partitioning Algorithm.

At each node we perform a split: we chose a variable X and a value t that minimizes a criteria.

The split: L = {X < t};R = { X t}

Function f(X,Y) Tree form of f(X,Y) 5 0 2 3 3 4 2 X Y Y< 4 X< 3 0 2 Y< 2 3 5 14.

For regression trees two criteria functions are:

For classification trees: criteria functions

15. ClassificationTree 1) root 1000 249.300 0.47302) x0.931225 16031.900 0.2750 * Iclass:Interval%Success%Interval-0.026 0.5 12791440.0 0.550634) KNEE961.36514 365685.0 0.9898 * 9) RBEDS>2.77141 159739.0 1.948018) ADM4.87542 112563.8 2.3880 * 5) HIP96>2.01527 10592992.0 1.569010) FEMUR962.28992 6682002.0 1.7840 * 3) HIP95>2.52265 18857672.0 2.96906) KNEE952.96704 7923022.0 3.625014) SIR9.85983 4261530.0 3.9790 * 20. Interval-focused regression (IREG) For thej th descriptor variablex j , an interesting subset { aRapid technology advances (advances in database technology as well as advances in the biological sciences) have led to an explosive growth in the number of massive high dimensional datasets appearing in pharmaceutical situations.

Some examples are high throughput screening data, DNA microchip and other genomics data, consolidated clinical data.

Analysis of this data is spurred on by the anticipation that it contains nuggets of information valuable to drug discovery and development.

CASE STUDY: Data mining in Biopharmaceutical Research 23. The structure activity database (SADB) consisted of several measurements ofin vivoandin vitroactivity on876 compounds together with several physicochemical properties ofthese compounds. Each concentration of each compound was recorded as 1 (response)or 0 (no response) for each assay and each route and each species (there were also a large number of missing values). In vivo measurements of biological activity Anti Convulsant Assay (AC) - positive effect Horizontal Screen Assay (HS) - adverse effect -rats and mice -ip and po 24. In vitro results IC50 G-shift H-coefficient Physicochemical properties Molecular weight Molecular volume CMR MlogP Energy Total dipole moment Rotational bonds Atom types Connectivities between molecules 25. Objective To determine whether the physicochemical properties, together with thein vitroactivity measurements, are predictive of biological activity (and selectivityand bioavailability). Given the data {( Y ti , x ji ),i=1,..,N ,t=1,...,q ,j=1,...,p },this involves studying the relationship between { Y ti } vs { x ji }. Initial considerations Simplification:Z= I( Y 4) Re-expression: log(IC50) Check for clear outliers in individual variables (remedy: winsorize) 26. 27. G-shift Hill MLOGP MOLWT ENERGY MOLVOL ROTBONDS -2.631.669.48 LOG(IC50) 0.152.043.374.80 0.190.851.32 0.191.792.854.97 240383447683 9.7643.1649.16151 1832833375490579 30% 31% 35% 46% 30% 30% 35% 40% 0.30 0.14 0.18 0.14 0.21 0.19 0.19 0.17 PROB 30% 31% 35% 46% 30% 30% 35% 40% Results of Interval CART Analysis 28. -2.63LOGIC50 1.66 (78/262) 244MOLVOL276407 < MVOL < 46720/38(24/34)ICLASS Tree 29. 30. Case Study: Pima Indians Diabetes

768 Pima Indian females, 21+ years old

268 tested positive to diabetes

Variables;

PRG:Number of times pregnant

PLASMA:Plasma glucose concentration in saliva

BP:Diastolic Blood Preasure

THICK:Triceps skin foldthickness

INSULIN:Two hours serum insulin

BODY:Body mass index(Weight/Height)

PEDIGREE: Diabetes pedigree function

AGE:In years

RESPONSE: 1: Diabetes, 0:Not

31. 32. Classification Tree 1) root 768 174.500 0.349002) PLASMA127.5 28367.020 0.614806) BODY29.95 20741.300 0.7246014) PLASMA157.5 9210.430 0.86960 * 33. 34. First Split of ICLASS CART Tree 35. ICLASS Tree 167PLASMA 189 (54/61) 154PLASMA1640.328PEDIGREE1.154(14/18)(28/42)(14/18)(34/35)(5/8)123PLASMA152 (28/42)(3/7)(78/437)(90/201)(0/2) 36.

Data collected in large databases

Documents

Transcript of Data collected in large databases