MLconf NYC Ted Willke

39
CONTEXT SEMANTICS!
  • date post

    17-Oct-2014
  • Category

    Technology

  • view

    720
  • download

    0

description

 

Transcript of MLconf NYC Ted Willke

Page 1: MLconf NYC Ted Willke

CONTEXT

SEMANTICS!

Page 2: MLconf NYC Ted Willke

Danny : isBrotherOf : Nezihfood cart : uses : bicyclesFrank : isFriendsWith : MohitFrank : isFriendsWith : TedFrank : likes : bicyclesFrank : likes : food cartsIvy : isFriendsWith : KushalIvy : isFriendsWith : TedIvy : likes : bicyclesIvy : likes : food cartsKushal : isFriendsWith : MohitKushal : isFriendsWith : NezihNezih : is FriendsWith : TedTed : likes : bicycles

Page 3: MLconf NYC Ted Willke

This model... ... infers this interest.

Ted Kushal

Mohit

Danny

Ivy

Frank

Nezih

friends

friends

friends

brothers

friends

friends

friends

friends

FoodCart

likes

likes likesBicycles

likes likes

likes

uses

Likes?

Page 4: MLconf NYC Ted Willke

Virtuous cycle of data

CLOUD

Richer data to analyze CLIENTS

Richer data from devices

Richer user experiences

INTELLIGENT SYSTEMS

Page 5: MLconf NYC Ted Willke

SEMANTIC INFORMATIONIS FUEL FOR THE CYCLE

Page 6: MLconf NYC Ted Willke

1985 1995 2005 2015

enterpriseNoSQL

Docs+

SemanticsRDF

WIDESPREADMACHINE LEARNINGON THIS

Page 7: MLconf NYC Ted Willke

IMAGINE THE POSSIBILITIES

Page 8: MLconf NYC Ted Willke

Graph centrality

High

Program Importance(Centrality)

Low

Graph ofchannel viewingbehavior

Current popularsurfing patterns

SH002463130000 EP005544723744

Changes in surfing behavior may predict customer churn.

Page 9: MLconf NYC Ted Willke

Preference and Similarity Recommendations

User

Movie

1.7MM Nodes23.9MM Edges

similar cast

prefers

similar topic

userId: A0A22A5

title: The Godfather genre: Crime dramacast: [M. Brando, Al Pacino]

title: Scarfacegenre: Crime dramacast: [Al Pacino, M. Pfeiffer]

title: The Departedgenre: Crime dramacast: [L. DiCaprio, M. Damon]

weight=11.8

weight=0.67

weight=0.03

weight=14.98

Min-cost path search

Page 10: MLconf NYC Ted Willke

10

URL Ground-Truth Data

IP/Domain Reputations

420MM Records

74.5MM Nodes185MM Edges

URL

Domain

IP Address

Calculation of priorsLBP Messaging

Loopy Belief Propagation on the (semantic) web

84.231.82.93

86.39.155.137

forum.vsichko.com

hermansonskok.se

euskzzbz.nonetheups.com

keesenbep.spaces.live.com

Page 11: MLconf NYC Ted Willke

Loopy Belief Propagation on the (semantic) web

Page 12: MLconf NYC Ted Willke

A yogaball

graph.

Really!?!

Page 13: MLconf NYC Ted Willke

You may actually need this

• When the problem is an information network

• When a graph is a natural way of expressing the algorithm

• When you want to study specific relationships

• When you want faster machine learning or solvers on sparse data

shortest path

central influence

sub networks

triangle count

Page 14: MLconf NYC Ted Willke

But there are challenges.

Handling all that data.

Finding people good at both handling all that data and data analysis.

Putting exploratory work into production fast enough to keep up

with the competition.

14

Page 15: MLconf NYC Ted Willke

Congratulations! Youare a

data scientist!

Page 16: MLconf NYC Ted Willke

It’s a demanding job

Ingest & Clean

EngineerFeatures

StructureModel

TrainModel

Query & Analyze

Learn

Visualize

Skills shortage at intersection of

systems engineering and

data analysis

Painful data ingestion and preparation

Workflows that are not designedwith loopbacks in mind

Few tools for analyzingsemantics at scale

Composing pipeline is

DIY

Page 17: MLconf NYC Ted Willke

Decomposingthe “data scientist”

Source: 2013 Report from Accenture Institute for High Performance

Page 18: MLconf NYC Ted Willke

IMAGINE A PLATFORM FOR DATA SCIENTISTSDOCS + SEMANTICS + MACHINE LEARNING

Page 19: MLconf NYC Ted Willke

Ease-of-use: Making big data familiar

Python

R

Dataflow GUI

...

Datacenter / Cloud

Network

Client

BIG DATA

API

ConnectManag

eSecure

Analyzedistributed and parallel

ManageSecure

Connect

Analyzelocal

Query

Big Data Java/Scala/C++ Computational

Frameworks

Big Data Algorithms

Cluster Workload Mgmt

Cluster Storage

Machine Learning & Statistics

Data WranglingAnalyst Skills

The Other Skills

Page 20: MLconf NYC Ted Willke

Delivering it

FILESYSTEMS AND NOSQL STORAGE

HW PLATFORM

APACHE HADOOP APACHE SPARK

DATA WRANGLINGMACHINE LEARNING AND

STATISTICSGraphical

AlgorithmsClassical

Algorithms

Graph Construction

Tools

Useful String

Manipulation

Useful Math

Operators

BIG DATA API

DATA SCIENCE SERVER (Query and Scripting)

Intel Analytics Toolkit

A UNIFIED DOCUMENT + SEMANTIC STORE

The Ask

Page 21: MLconf NYC Ted Willke

Approach Algorithm Category Applications/Use Cases

Loopy Belief Propagation (LBP) Structured Prediction Personalized recs, image de-noising

Label Propagation Structured Prediction Personalized recommendations

Alternating Least Squares (ALS) Collaborative Filtering

Recommenders

Conjugate Gradient Descent (CGD) Collaborative Filtering

Recommenders

Connected Components Graph Analytics Network manipulation, image analysis

Latent Dirichlet Allocation (LDA) Topic Modeling Document Clustering

Structure Attribute Clustering Network analysis, consumer seg

K-Truss Clustering Social network analysis

KNN* Clustering Recommenders

Logistic Regression* Classification Fraud detection

Random Forest* Classification Fraud detection, consumer seg

Generalized Linear Model (Binomial, Poisson)

Non-linear Curve Fitting

Forecasting, pricing, market mix models

Association Rule Mining Data Mining Market basket analysis, recommenders

Frequent Pattern Mining* Data Mining Pattern Recognition

Bringing a full spectrum of possibilitiesG

raph

21

Page 22: MLconf NYC Ted Willke

Article Tagging Problem

• Articles are tagged by experts with MeSH terms, drawn from a hierarchical controlled vocabulary of 55,000 keywords• Process is resource-intensive – can we automate

it?• Categorize articles into a hierarchy that matches

the same categorization from the MeSH controlled vocabulary

Page 23: MLconf NYC Ted Willke

Hierarchy Level

Article Count

Page 24: MLconf NYC Ted Willke

Demo: Graph Analytics For Medical Journal Analysis

INGEST&

CLEAN

ENGINEERFEATURES

STRUCTUREGRAPH

QUERY & ANALYZE

LEARN

VISUALIZE

PARSE AND EXTRACT WORDS

CREATE ARTICLE/

WORD LISTBUILD GRAPH QUERY/

VISUALIZE DATA

DETECT CLUSTERS

USING LDA

• Medline™ XML• MeSH Ontology XML

• Create list of unique words

• Stemming and lemmatization

• Index word list• Transform articles

into list of article/word pairs

• Extract vertices• Assign id columns

to vertex property• Assign year and

count edge properties

• Gremlin query for each visual

• Python web server and other libraries

• Select optimization parameters

• Invoke LDA

Page 25: MLconf NYC Ted Willke

The Playbook?

PARSE AND

EXTRACT WORDS

CREATE ARTICLE/

WORD LIST

BUILD GRAPH

QUERY/VISUALIZE

DATA

DETECT CLUSTERS

USING LDA

Parse Prepare graph dataBasic analysis Run LDA

INSIGHTFULRESULT

This never happens!

Page 26: MLconf NYC Ted Willke

The Real Playbook

PARSE AND

EXTRACT WORDS

CREATE ARTICLE/

WORD LIST

BUILD GRAPH

QUERY/VISUALIZE

DATA

DETECT CLUSTERS

USING LDA

Parse

Correct mistake

Prepare graph data

Correct schema mistake

Correct aggregation mistake

Data validation

Correct dataset mistake

Guess LDA settings

Tune and re-run

Detect bias in dataset

Page 27: MLconf NYC Ted Willke

WE NEED THE AGILITY OF INTERACTIVE SCRIPTINGANDTHE

BRAINS AND BRAWN OF SCALABLE GRAPH ANALYTICS

Page 28: MLconf NYC Ted Willke

Build Frame

28

Page 29: MLconf NYC Ted Willke

Build Graph

29

Page 30: MLconf NYC Ted Willke

Query Vertices

30

Page 31: MLconf NYC Ted Willke

LDA with 3 Topics

Page 32: MLconf NYC Ted Willke

LDA with 5 Topics

Page 33: MLconf NYC Ted Willke

LDA with 7 Topics

Page 34: MLconf NYC Ted Willke

Query Vertices Again – Now with ML Properties

34

Page 35: MLconf NYC Ted Willke

Following Analysis

Wakefulness

Sleep

Animals

Electroencephalography

Circadian Rhythm

Arousal

Sleep Stages

REM

Mental Recall

Attention

Rats

Child

Evoked Potentials

Aged

Schizophrenia

Ocular

Conditioning

Infant

Psychophysics

Dreams

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

Top MeSH terms that predict which category an article will be assigned

Page 36: MLconf NYC Ted Willke

Reimagining 2014

New partnerships in big data

Contributions to the open source community

The Intel Analytics Toolkit – COMING SOON

SEMANTICS + MACHINE LEARNINGTOGETHER AT LAST!

Page 37: MLconf NYC Ted Willke

INTERESTED IN THE INTEL ANALYTICS [email protected]

Page 38: MLconf NYC Ted Willke
Page 39: MLconf NYC Ted Willke

Legal DisclaimersAll products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.Intel processor numbers are not a measure of performance.  Processor numbers differentiate features within each processor family, not across different processor families.  Go to: http://www.intel.com/products/processor_number

Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM).  Functionality, performance or other benefits will vary depending on hardware and software configurations.  Software applications may not be compatible with all operating systems.  Consult your PC manufacturer.  For more information, visit http://www.intel.com/go/virtualization

No computer system can provide absolute security under all conditions.  Intel® Trusted Execution Technology (Intel® TXT) requires a computer system with Intel® Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT-compatible measured launched environment (MLE).  Intel TXT also requires the system to contain a TPM v1.s.  For more information, visit http://www.intel.com/technology/security

Intel, Intel Xeon, Intel Atom, Intel Xeon Phi, Intel Itanium, the Intel Itanium logo, the Intel Xeon Phi logo, the Intel Xeon logo and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Other names and brands may be claimed as the property of others.Copyright © 2013, Intel Corporation. All rights reserved.