Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a...

34
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15 th 2006 Data Mining and Materials Informatics: a primer Krishna Rajan Department of Materials Science and Engineering NSF Intl. Materials Institute Combinatorial Sciences & Materials Informatics Collaboratory Iowa State University

Transcript of Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a...

Page 1: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

Data Mining and Materials Informatics: a primer

Krishna Rajan

Department of Materials Science and EngineeringNSF Intl. Materials Institute Combinatorial Sciences &

Materials Informatics CollaboratoryIowa State University

Page 2: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

What is data? primary, secondary , derivative sources of data

What do we mean by mining data? data correlations –

dimensional analysis approachdata correlations –

data mining

What can we learn from data mining?

Classification data mining databaseshierarchy of datatrends in data

Predictionpredicting structure-property relationshipscomputational informatics vs. computational materials sciencedata mining for building databases

OUTLINE

Page 3: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

INFORMATICS

• Establish new correlations

• Identify outliers

• Enlarge database / virtual libraries

• Evaluate databases

• Establish predictions

What is it ? Why ?

Searching for patterns of behavior among multivariate data sets

Can pattern recognition lead to predictions?

Krishna Rajan

Page 4: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

COMPLEXITY OF BIOSYSTEMS

Page 5: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

Data + Correlations + Theory = Knowledge-base

•• SimulationsSimulations•• Combinatorially derived datasetsCombinatorially derived datasets•• Digital librariesDigital libraries•• Spectral and imaging librariesSpectral and imaging libraries

•• Machine learning Machine learning •• Data compressionData compression•• Pattern recognitionPattern recognition

•• Atomistic based calculationsAtomistic based calculations•• Continuum based theoriesContinuum based theories

ULTRA LARGE SCALE INFORMATION SPACE

Informatics-BasedDesign

INFORMATICS STRATEGY

Machine learning toolkit:• Classification• Prediction• Hybrid tools

Page 6: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

Ideker

and Lauffenburger:(2003)

INFORMATICS-BASED DESIGN STRATEGIES

BIOLOGICAL ANALOGUE CRYSTAL STRUCTURE ANALOGUE

Nye (1950)

Page 7: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

Statistical Tools

Combinatorial & Spectral Libraries

Latent Variables /Partial Least Squares

(PLS)

Principal Component Analysis(PCA)

Support Vector Machine(SVM)

Materials DatabasesExperimental dataComputational deriveddata sets

Simulations

Association Mining(AM)

Classification

Prediction

Input & Output

Page 8: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

Requirements

Global minima•

Categorical data•

Missing / variable data•

Skewed distributions•

Large data sets•

Scalable

Pitfalls

Local minima•

Categorical data difficult•

Convergence difficult for large data sets

Few outliers can lead to poor performance

MACHINE LEARNING AND MATERIALS DISCOVERY

Page 9: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

DATA CURATION

KNOWLEDGE DISCOVERY

Database administration & management

• thermodynamic, crystallographic & propertydata bases

• combinatorial experimental data

Oracle, Unix, Supercomputing, SQL

• taxonomy and ontology of materialsscience data

• data sharing , networking / cyber infrastructure

JAVA, HTML, Python

• object oriented programming language• visualization of high dimensional data

Data mining algorithms

• Clustering analysis • Quantitative Structure-Activity Relationships

(QSAR) for materials design

DATA REPRESENTATION

DATA STORAGE

Page 10: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

Accelerated insertion of materials into engineering systems

Rapid multiscale design and optimization of materials properties

Establishment of new structure –property correlations among large, heterogeneous and distributed data sets

Discovery of new chemistries and compounds

Formulation and / or refinement of new theories for materials behavior

Rapid identification of critical data and theoretical needs for future problems

WHY COUPLE COMPUTATIONAL MATERIALS SCIENCE AND INFORMATICS ?

Page 11: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

OriginalData

TransformedData

Patterns

KnowledgeData

Warehousing

Feature Extraction

Data Mining & Visualization

Interpretation

SOFT MODELING vs. HARD MODELING

Data bases

DimensionalAnalysis and/or Theory

Constitutive Equations

Hard

modelingHard

modeling

Soft modelingSoft modeling

Krishna Rajan

Page 12: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

22x22 (484 cells –

2500 data points):

Silicon nitride descriptors

DATA MAP

Page 13: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

DIMENSIONALITY REDUCTION

Page 14: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

EIGENVALUE DECOMPOSITION

eigenvalueeigenvector

If V is nonsingular, this become the eigenvalue decomposition

Eigenvalues

on the diagonal of this diagonal matrix

Eigenvectors

forming the columns of this matrix

Sp pλ=

SP P= Λ

1S P P−= ΛLoading Matrix=Eigenvector Matrix

Eigenvalue Matrix

Page 15: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

PREDICTIONSPLS

‘Upgradeable’ linear predictive toolQuadratic and interaction terms :flexibility

Calculate latent factors from original predictor variablesT=XW where

T-

latent factors, X –

predictors, W -

weights

Use latent factors to predict responseY=TQ+E where Y -

responses, Q –

loadings, E-

noise

•• W maximizes covariance between Y and T

Factors

UT

Responses

Page 16: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

INFORMATICS AND THE PERIODIC TABLE

Page 17: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

DATA WAREHOUSING

Page 18: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

0 20 400 5000 100000 2000 40001 2 30 5 100 2 40 10 202 4 6

0

20

40

0

5000

10000

0

2000

4000

1

2

3

0

5

10

0

2

4

0

10

20

2

4

6

1 2 3 4 5 6 7

8

8

7

6

5

4

3

2

1

1. # of atoms/unit cell 2. Valence electron # 3. Electronegativity 4. 1st

Ionization potential

5. Atomic radius 6. Melting point 7. Boiling point 8. Density

BIVARIATE MAPS

Page 19: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

-2 -1 0 1 2-6 -4 -2 0 2 4 6-6 -4 -2 0 2 4 6

-2

-1

0

1

2

-6

-4

-2

0

2

4

6

-6

-4

-2

0

2

4

6

PC1 PC2 PC3

PC

3

P

C2

P

C1

SEEKING PATTERNS

Page 20: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

-4 -3 -2 -1 0 1 2-3

-2

-1

0

1

2

3

4

5

Scores on PC 2 (22.61%)

Sco

res

on P

C 3

(12.

02%

)

1

2 3 4

5

6 7 8

9 10

11

12 13

14

15

16 17 18

19 20

21

22

23

24

25 26

27

28

29 30

31 32

33 34 35 36 37 38

39 40 41 42 43 44

45

46 47 48

49

50

51

52 53

54

55

56 57

58

Na, Mg, Al

K, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, and GaSr, Y, Zr, Nb, Mo, Tc, Ru, Rh, Pd, Ag, Cd, In, and Sn

Cs, Ba, La, Ce, Pr, Nd, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, Lu, Hf, Ta, W, Re, Os, Ir, Pt, Au, Hg, Tl, Pb, and Bi

Each cluster represent each row in the periodic table

MENDELEEV SEQUENCING

Page 21: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

Location of property in loading plot indicates influence of property on PC

Atomic weight has no influence on PC1, high influence on PC2 and 3

Clustered properties indicate relationships

Electrical and thermal conductivity•

Melting point and density•

Molar volume and atomic radius•

Melting and boiling points, heats of fusion and vaporization, and valence number

Pauling electronegativity and first ionization potential

AtomicWeight

PseudopotentialRadius

DensityPauling

electronegativity

FirstIonizationPotential

ThermalConductivity

ElectricalConductivity

Specific Heat

Martynov-Batsanov

electronegativity

MolarVolume

Covalentradius

Metallicradius

Heat ofFusion

Heat ofVaporization

ValenceElectronnumber Melting

Point

BoilingPoint

DEVELOPING PHYSICAL LAWS

Page 22: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

RPI Electrical Conductivity (0.00000001*ohm*m)1e-12 1e-10 1e-8 1e-6 1e-4 0.01

Ther

mal

Con

duct

ivity

(W/(m

*K))

1

10

100

32. Germanium (Ge)

6. Carbon, diamond (C)

52. Tellurium (Te)

83. Bismuth (Bi)

29. Copper (Cu)

34. Selenium (Se)

DEVELOPING PHYSICAL LAWS

20

30

40

50

60

70

500 1000 1500 2000 2500

Act

ivat

ion

Ener

gy, Q

[kca

l/mol

]

Tm

[K]

Pb

Ag

Au

Cu

NiCo

Fe Pt

18Rg

Linear regression

Bokshtein & Zhukhovitskii

Al

Activation energy for self-diffusion versus melting point

From Glicksman

Page 23: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

9 descriptors from NIST data-

Lattice parameters (a, b, c, α, β, γ)-

c/a, b/c-

V/abc

Hexagonal

MonoclinicTetragonal(P42

/mmc)

-100

-50

0

50

-50

0

50-40

-30

-20

-10

0

10

6

1 2 9 22 8 10 23 3 15 21 16 4 25 26 31 27 34 32 28 14 12 13

PC 1 (61.71%)

33 30 5 7 24 20 11

Scores Plot

18

29

19 17

PC 2 (29.17%)

PC

3 (

5.85

%)

1 2 9 22

8 10

23 3

15

21

16 4 25 26

31 34

27

32

28

14 12 13

Tetragonal(I-42d)

Orthorhombic

Cubic

PCA Score plot

(34 binary, ternary, quaternary compounds)

PCA of 34 binary, ternary, quaternary compounds: Score plotPCA of 34 binary, ternary, quaternary compounds: Score plot

Page 24: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

1 2 3 4 5 6789 10 1112 13 1415 16 1718 19202122 23 2425 262728 29 3031 32 3334

Cluster analysis of 34 binary, ternary, quaternary compoundsCluster analysis of 34 binary, ternary, quaternary compounds

Cubic(Fm-3m)

Tetragonal(I-42d)

Cubic(P41

32,P-43m,Pm-3,Pm-3,Fd-3ms)

Orthorhombic

Cubic(Im-3,Pm-3)

Tetragonal(I4/mcm)

Hexagonal (P63

/mmc,P6/mmm, P-6m2)/ Trigonal (P-3m1)

Tetragonal(P42/mmc)

Hexagonal (P-62m,P-6)Monoclinic

Hexagonal(P63

/mmc)

α, β, γ

=

90° α, β, γ ≠ 90°

Page 25: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

Atomic packing of single element

Li Al Cu ScMg

BCC FCC HCP FCC HCP

sp electrons/atom† 1 3 2 11 2

Ex.) Engel’s model using “# of sp electrons/atom”†

BCC < 1.5HCP 1.7 –

2.1 FCC 2.5-3

RULE EXTRACTION

Page 26: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

PATTERN DETECTION

Page 27: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

-6-4

-20

24

6

-6

-4

-2

0

2

4

6

-6-4

-20

24

6

PC2(22.43

%)

PC3(

19.4

7%)

PC1(49.89%)

classification

Graduciz, Gaune-EscardUniv.Novi

Sad, SerbiaCNRS, Marseilles

covalent ionic

prediction

DATA MINING DATABASES

Page 28: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

COMPUTATIONAL ISSUES

Data can come across length and time scales

Focus on properties of signal / macroscopic behavior rather than noise/ error. Assume complexity !!!

Utilize data dimensionality reduction techniques

Analyze variation and correlation in data

Covariance

Establish correlations acrossdiverse data sets ( ie. length & time scales)

Identify outliers: explore cause

Develop predictive models-

Target requirements of missing data

-

Quantitatively assess data diversity

Model relationships in data to seek heuristic relationships:

Advanced statistical learning tools can deal with:-

skewed data-

missing data-

differentiate between local and global minima

-

ultra large scale datasets-

variable uncertainty

•Singular value decomposition•Support vector machines•Association mining•Fuzzy clustering

Establish multivariate database:

Seek DIVERSITY

in datasets

Krishna Rajan

Page 29: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

CRYSTAL CHEMISTRY DESIGN

1.

Assess influence of latentvariables ( i.e. electronic structure parameters) on properties of known data

2.

Establish heuristic relationships on database of all input variables instead of phenomenological relationships in bivariate manner

3.

Use statistical learning to predict new materials behavioron new multivariate input data

4. Inverse problem approach to formulate quantitative structure-

property relationships

Ching et.al J. Amer. Ceram.Soc. 85 75-80 (2002)

Krishna Rajan

Page 30: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

Virtual library:via informatics

Refractory metals

Suh and Rajan (2005, 2006)

“Real” library:via first principles

calculations

Krishna Rajan

Page 31: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

LINKING DISPARATE LENGTH SCALESVisualizing Associations

Page 32: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

hcp

fcc

SYNTHESIZING REMOTE DATA SETS

Page 33: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

• Search and retrieve

• Refining descriptors

• Developing predictions

• Filling in missing data

• Structure databases

• Diffraction spectra databases

• Hyper spectral imaging databases

Mapping / Visualization of databases:

• Correlations• Trend analysis

STRATEGY FOR MATERIALS DESIGN

Page 34: Data Mining and Materials Informatics: a · PDF fileData Mining and Materials Informatics: a primer ... Rapid identification of critical data and theoretical needs for ... DATA WAREHOUSING

Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006

Science was originally empirical, like Leonardo, making wonderful drawings of

nature. Next came the theorists whotried to write down the equations that

explained the observed behaviors, like Kepler or Einstein. Then when we got to

complex enough systems like the clustering of a million galaxies, there came the

computer simulations, the computational branch of science. Now we are getting into the data exploration part of science,

which

is kind of a little bit of them all

””....

Dr. Alex Szalay: Virtual Observatory Project

May 20th 2003

The Facts of the Matter: Finding, understanding, and using information about our

physical world (2000)

CONCLUSIONS

Information science based design of Information science based design of materialsmaterials…………next stage of the scientific next stage of the scientific

discovery processdiscovery process

Division of Materials Research:Division of Materials Research:Cyber infrastructure and cyber discovery in Cyber infrastructure and cyber discovery in

materials science (2006)materials science (2006)