Tools in SPIDAL: Scalable Parallel Interoperable Data Analytics Library

Large Scale Data Analytics WorkshopAugust 9 2013Geoffrey Fox

gcf@indiana.edu http://www.infomall.org/

School of Informatics and ComputingIndiana University Bloomington

Overview• Higher level (but still low level tools) Capabilities

– Biology Distance Calculations– Clustering– Dimension Reduction– Plotting – Point and Heat maps

• Mode– Batch versus Online (Interpolative)

• Even Lower Level Capabilities– Eigenvalue Solvers– Conjugate Gradient Equation solvers– Robust 2 Solver (Levenberg-Marquardt)

• Implementations http://beaver.ads.iu.edu:9000/– MPI and/or Iterative MapReduce run up to 1000 cores with MPI/Threading mix– Parallel Threading – Not much GPU– ~75,000 lines of Java or C#

• Techniques– Expectation Maximization (Steepest Descent)– Deterministic Annealing to avoid local minima– Second Order Newton solvers

• PS I have always worked on Optimization problems – first paper 1966

Multi-Dimensional Scaling• Map general set of n points to d (typically 3) dimensions

– Map genomic/proteomic sequences to ~20D as alternative to Multiple Sequence Alignment

• Original Data is vectors: Generative Topographic Map GTM and Deterministic Annealing + GTM

• No vectors – only a distance metric d(Xi , Xj) : Classic MDS with SMACOF and 2 versions: Minimize Stress – (X) = i<j=1

n weight(i, j) [(i, j) - d(Xi , Xj)]2 – (i, j) are input dissimilarities

• SMACOF has deterministic annealing version– Recently added SMACOF for weight(i, j) 1; until this needed slower 2

method for general problem• Note O(n2) Objective function makes Newton’s method effective

because conjugate gradient makes O(n3) linear equation solvers act as constant . O(n2) solver

• Interpolative (online) versions

Protein Universe Browser for COG Sequences with a few illustrative biologically identified clusters

CoGNWSqrt(4D)

Clustering• Vector Datasets O(N): Kmeans plus deterministic annealing

– Improved Elkans algorithm (Qiu)– Continuous Clustering (solve for cluster density at a center)– Trimmed (finite size) clusters

• Non-metric Spaces with only dissimilarities defined O(N2) with deterministic annealing

• Deterministic Annealing (vector or non-vector) always improves results but not always needed note – DA does not need specification of # clusters– Comes from early neural networks days (Hopfield, Durbin)

• Hierarchical algorithms (roughly built in with DA)• Related algorithms Gaussian Mixture Models, PLSI (probabilistic

latent semantic indexing), LDA (Latent Dirichlet Allocation) also with Deterministic annealing

~125 Clusters from Fungi sequence set

Non metric spaceSequences Length ~500 Smith WatermanA month on 768 cores

https://portal.futuregrid.org

DeterministicAnnealing

• Minimum evolving as temperature decreases• Movement at fixed temperature going to false minima if

not initialized “correctly

Solve Linear Equations for each temperature

Nonlinear effects mitigated by initializing with solution at previous higher temperature

F({y}, T)

Configuration {y}

https://portal.futuregrid.org

Basic Deterministic Annealing• H is objective function to be minimized as a function of

parameters • Gibbs Distribution at Temperature T

P() = exp( - H()/T) / d exp( - H()/T)• Or P() = exp( - H()/T + F/T ) • Minimize Free Energy combining Objective Function and Entropy

F = < H - T S(P) > = d {P()H + T P() lnP()}• Simulated annealing corresponds to doing these integrals by

Monte Carlo• Deterministic annealing corresponds to doing integrals analytically

(by mean field approximation) and is much faster • In each case temperature is lowered slowly – say by a factor 0.95

to 0.9999 at each iteration• Start with one cluster, others emerge automatically as T decreases

https://portal.futuregrid.org 10

• Start at T= “” with 1 Cluster

• Decrease T, Clusters emerge at instabilities

Proteomics LC-MS 2D DA Clustering T= 25000 60 Clusters

Temperature

Number of Clusters

StartEnd

Proteomics 2D DA Clustering T=0.1small sample of ~30,000 Clusters Count >=2

Orange sponge points Outliers not in cluster

Yellow trianglesCenters

Clusters v. Regions

• In Lymphocytes clusters are distinct• In Pathology, clusters divide space into regions and

sophisticated methods like deterministic annealing are probably unnecessary

Pathology 54D

Lymphocytes 4D

Generalizing large Scale 2 • 2 = i=1

N [t(i, x(k)) – e(i)]2

• “Just” an objective function to be minimized as a function of x(k)• Parallelism over i and k• If you can calculate t(i)/x(k) (automatic compilers available),

then can either do – “Steepest descent” x(k) = x0(k) – i=1

N const t(i)/x(k) guaranteed to decrease 2 but likely to get local minima

– “Newton’s method” solving second order Taylor expansion always fails unless “regularize” but always succeeds with good solver (maybe too slow though)

• Levenberg Marquardt regularizes i=1N t(i)/x(k)

t(i)/x(l) by adding multiple Q of unit matrix– Solve equations by conjugate gradient ~ N (# parameters)2

– “Manxcat” (based on 1976 physics expt code) choses Q etc.

Overview• Higher level (but still low level tools) Capabilities

– Biology Distance Calculations– Clustering– Dimension Reduction– Plotting – Point and Heat maps

• Mode– Batch versus Online (Interpolative)

• Even Lower Level Capabilities– Eigenvalue Solvers– Conjugate Gradient Equation solvers– Robust 2 Solver (Levenberg-Marquardt)

• Implementations http://beaver.ads.iu.edu:9000/– MPI and/or Iterative MapReduce run up to 1000 cores with MPI/Threading mix– Parallel Threading – Not much GPU– ~75,000 lines of Java or C#

• Techniques– Expectation Maximization (Steepest Descent)– Deterministic Annealing to avoid local minima– Second Order Newton solvers

• PS I have always solved Optimization problems – first paper 1966

Future

• Scale to lots more cores and larger problems (HPC, Azure …)

• Add GPU’s or MIC chip• Lets look at other problems• Bundle as easy to use services

• Choose best batch set for large batch + online run– O(N2) O(N)

• Exploit cluster as well as point parallelism (similar to better hierarchical method)

Tools in SPIDAL: Scalable Parallel Interoperable Data Analytics Library

Documents

Transcript of Tools in SPIDAL: Scalable Parallel Interoperable Data Analytics Library

Interoperable PHP

Interoperable Documentation - Unidata

MIFARE DESFire EV2 - NXP Semiconductors€¦ · MIFARE DESFire EV2 is ideal for solution developers and system operators building reliable, interoperable and scalable smart card solutions.

Interoperable Components for Parallel Mesh Generation and ... · Solve the system of equations Scalable mesh-based analysis procedure All components must operate in parallel Preferred

Concept Design - PTC/USER · PDF file3 Creo Product Family Supports Concept Design Scalable, Interoperable, Time to Value PTC Creo Direct PTC Creo Simulate PTC Creo Parametric PTC

Public Safety Interoperable Communications Grant Program Public Safety Interoperable Communications Grant Program July 2007.

An Interoperable Environment for Geospatial DataReasons for Building an Interoperable Environment Reasons for Building an Interoperable Environment zTo collect, organize, process and

Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC Clusters

Interoperable Network NorthWest

Reliable, Scalable and Interoperable Internet Telephonyhgs/papers/Sing06_Reliable.pdf · among multiple distributed servers that use backend SQL databases to maintain user records.

The FIDO Alliancelockstep.com.au/library/conference-presentations... · FIDO Alliance Mission •Developing technical specs for open, scalable, interoperable mechanisms that reduce

SPIDAL Java: High Performance Data Analytics with Java and ...dsc.soic.indiana.edu/publications/hpc2016-spidal... · sums determines the new k cluster centers. If the difference between

Introduction to F# - unimi.itra.crema.unimi.it/turing/materiale/admin/corsi/SOASecurity/slide/... · Introduction to F# Scalable, Type-safe, Succinct, Interoperable, ... C#.NET Common

Interoperable Positive Train Control (PTC) William A Petit · Interoperable Positive Train Control (PTC) ... operation of a vital interoperable positive train control system ... Alstom

Deciphering the Interoperable Web

Open, Interoperable and Scalable · 2018-03-30 · 3 Internal Use - Confidential Open Networking innovation timeline Apr –announced S3048-ON 1G switch, S4048-ON 10G switch and Z9100-ON

Accelerating the Internet of Things - Amazon Web …...Accelerating the Internet of Things Meeting Market Demand with Scalable, Interoperable IoT Solutions By Rose Schooler, Vice President,

SPIDAL Java: High Performance Data Analytics with Java and ...pdfs.semanticscholar.org/5153/92cf93c50e1ba13f78bfe075b23a889e41c0.pdfto overcome these and demonstrates signiﬁcant

Towards a Scalable and Interoperable Global Environmental ...fredh/papers/conf/105-tasaigesnusoa/pap… · are accessible only by disparate and proprietary software interfaces. Each

New York State Interoperable and Emergency Communication ... · New York State Interoperable and Emergency Communication Board 2015 Annual Report 4 I. The State Interoperable and