Close Encounters with Data Science

61
Close Encounters with Data Science Oct 28, 2015 Geoff Yuen, Ph.D. VP Emerging Technology, PCCW [email protected]

Transcript of Close Encounters with Data Science

Page 1: Close Encounters with Data Science

Close Encounters with Data Science

Oct 28, 2015

Geoff Yuen, Ph.D. VP Emerging Technology, PCCW [email protected]

Page 2: Close Encounters with Data Science

What’s new about data ?

• Data = values of qualitative or quantitative variables, belonging to a set of items (usually population)

• Data = often unstructured (without pre-established data model), usually raw file, different formats

chat Genome-DNA base pairs picture

Page 3: Close Encounters with Data Science

Lots of Data ≠ Insight

Page 4: Close Encounters with Data Science

Data itself is not useful, we need insights !

Its easy to get lost in your data

Jiawei Han. Abel Bliss Professor, Department of Computer Science, UIUC; “Pattern Discovery in Data Mining” Coursera online course with 75000 students 2/2015

Page 5: Close Encounters with Data Science

39

%

39

2G Network

3G Network

900 MHz

1800 MHz

2100 MHz

2013 4G Network

The O2 mobile network has hundreds of cells to measure the trends in footfall across the country (Telefonica UK)

Network Data

Page 6: Close Encounters with Data Science

39

%

39

Easier to use

Further protecting

anonymity

Extrapolated to

represent local

population

Footfall is rendered into 200 x 200 metre grid squares

200 x 200 Grid

Page 7: Close Encounters with Data Science

Drilling into footfalls demographics

Page 8: Close Encounters with Data Science

“US has killed Osama Bin Laden” • average of 3,000 tweets per second • 27,900,000 tweets in 2.5 hours • peak of 12,384,000 tweets in one hour

Viral Social Data : From 1 to 14.9 million tweets in 5 minutes (1st May 2011)

Page 9: Close Encounters with Data Science

“US has killed Osama Bin Laden” • average of 3,000 tweets per second • 27,900,000 tweets in 2.5 hours • peak of 12,384,000 tweets in one hour

Viral Social Data : From 1 to 14.9 million tweets in 5 minutes (1st May 2011)

Page 10: Close Encounters with Data Science

The data is the second most important thing

Jeff Leeks, Assistant Professor of Biostatistics, Data Science Program , John Hopkins University :

Focus on the problem first …

Page 11: Close Encounters with Data Science
Page 12: Close Encounters with Data Science

Facebook “Likes” Predicting Personality Facebook can predict personality based on annotated data better than humans

… except for spouse

http://www.pnas.org/content/112/4/1036.full.pdf

Page 13: Close Encounters with Data Science

What’s New About Analytics

• Golden Age of Analytisc (1995-) Statistical Machine Learning has contributed many much more powerful algorithms than simple regression (list modified from Seni Giovanni, A9):

• 1983 CART (Tree) • 1996 Lasso • 1996 Bagging • 1997 AdaBoost • 2001 Random Forest • 2003 Learning Ensembles • 2004 Regularization & Boosted Lasso • 2005-2013 Deep Belief / Deep Learning

Many ways to predict and classify structured and unstructured data now !

Page 14: Close Encounters with Data Science

1. Kinect Posture Detection

Page 15: Close Encounters with Data Science

Kinect detection of body segments

Page 16: Close Encounters with Data Science

Goal: Estimate Pose from Depth Image

Page 17: Close Encounters with Data Science

A single input depth image is segmented into a dense probabilistic body part labeling, with the parts defined to be spatially localized near skeletal joints of interest

From depth images to joint positions in 3D

Page 18: Close Encounters with Data Science

Challenges

Page 19: Close Encounters with Data Science

• 3 trees each of depth 20 from 1 million images were trained

• Get 3D models for 15 bodies with a variety of weights, heights, etc.

• Synthesize mocap data for all 15 body types

• Capture and sample 500K mocap frames of people kicking, driving, dancing, etc.

Get Lots of Training Data into ‘3 trees’

Page 20: Close Encounters with Data Science

Kinect's reliable detection of body segments is based on successful application of a famous

analytic algorithm (random forest)

Page 21: Close Encounters with Data Science

Opportunities

What application areas can benefit ? Rehabilitation, motion training (martial arts, tennis, dry land training), elderly fall detection

With aging population, fall detection and related services can be a major opportunity • Australia : 30% of adults over 65 experiencing at least one

fall per year, group predicted to increase from 14% to 23% (8.1 million) in 2050, costing $1.4 billion by 2051.

• China : 1405 mil vs 24 mil, a factor of 58 bigger !

Recommend for HK : elderly fall detection and motion training

Page 22: Close Encounters with Data Science

Flyby Science is hard!

Page 23: Close Encounters with Data Science

Flyby Science (typical)

Page 24: Close Encounters with Data Science

Status Quo: Respond in days

Onboard analysis: Respond in minutes

Page 25: Close Encounters with Data Science

NASA JPL: better flyby surface feature recognition by random forests

Page 26: Close Encounters with Data Science

2. Deep learning

Page 27: Close Encounters with Data Science

By 2017, 10 % of computers will be learning rather than processing (Gartner 2013)

Page 27

Structured Data Unstructured Data

Regression

Linear or Logistic

Problem specific

Learning structure in data

non-Linear (polynomial)

Knowledge specific

Big Data finally found its analytic partner : deep learning

Page 28: Close Encounters with Data Science
Page 29: Close Encounters with Data Science
Page 30: Close Encounters with Data Science

CIFAR-10 Units: accuracy %

Rank Results (%) Method Venue

1 94 Lessons learned from manually classifying CIFAR-10 unpublished 2011

2 91.78 Deeply-Supervised Nets arXiv 2014

3 91.2 Network In Network ICLR 2014

4 90.68 Regularization of Neural Networks using DropConnect ICML 2013

5 90.65 Maxout Networks ICML 2013

6 90.61 Improving Deep Neural Networks with Probabilistic Maxout Units

ICLR 2014

7 90.5 Practical Bayesian Optimization of Machine Learning Algorithms

NIPS 2012

8 89 ImageNet Classification with Deep Convolutional Neural Networks

NIPS 2012

9 88.79 Multi-Column Deep Neural Networks for Image Classification CVPR 2012

10 84.87 Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

arXiv 2013

• The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class.

• There are 50000 training images and 10000 test images.

• Classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

Page 31: Close Encounters with Data Science

Error Back propagation

Error Back Propagation

Error Back propagation

Parallel Error Correction

Page 32: Close Encounters with Data Science

Train this layer first

Learning Layer by Layer

Page 33: Close Encounters with Data Science

Train this layer first

then this layer

The new way to train multi-layer NNs…

Page 34: Close Encounters with Data Science

Train this layer first

then this layer

then this layer

The new way to train multi-layer NNs…

Page 35: Close Encounters with Data Science

Train this layer first

then this layer

then this layer

then this layer

The new way to train multi-layer NNs…

Page 36: Close Encounters with Data Science

Train this layer first

then this layer

then this layer

then this layer finally this layer

The new way to train multi-layer NNs…

Page 37: Close Encounters with Data Science

EACH of the middle layers is trained to be

an auto-encoder

... basically forced to learn good features coming from the previous layer

The new way to train multi-layer NNs…

Page 38: Close Encounters with Data Science

Deep Learning for unstructured data

• Previous paradigm for feature detection and prediction from data is based on modelling and optimization. “Deep learning” have now surpassed related performance in many problems from various researchers around the world.

“Tech 2015: Deep Learning And Machine Intelligence Will Eat The World” Forbes 12/2014

• Deep learning scale well with big data to learn “layering of knowledge” in hidden

layers without handcrafting of feature detectors as past machine learning methods. Convergence time proof for RBM.

• Demonstrated impressive improvements in diverse areas : speech recognition, object recognition in images, targeted advertising, fraud detection, personalization • Speech recognition : Microsoft, Google & Apple competing mobile “digital assistants” (Google Now vs Siri vs Cortana

9/2014) Digital assistants will drive mCommerce & 50% US digital purchases in 2017 (Gartner) • Object recognition : Facebook

Mining user images for intentions (NYT) • Real-time translation : Skype • World Cup / NBA Predicting 2014 (MS) • Others : Baidu, IBM, Yahoo, Tencent, Netflix, Adobe, NEC, Toyota • Telco centric vendors : Wise-athena, Dataspark, Zettics

Page 39: Close Encounters with Data Science

Deep learning has created breakthroughs in object and speech recognition.

But also watch other areas : sports prediction, natural language processing, churn prediction, targeted advertising, customer segmentation

Page 40: Close Encounters with Data Science

2014 Survey of Deep Learning Vendor Claims Previous Accuracy

Data used to train model

Latest Accuracy

Company

Speech Recognition 75% 680 speakers, 10 sentences each

94% (2013) Google, IBM, Skype, MS

Object recognition 70% 1.2 mil images 95% (2015) Baidu, Google, Facebook

Target Advertising <1 % (Banner Ads)

220K users 22% NDA

Personalization na 220K users

27% NDA

Churn Prediction (Telco)

69% (SAS) 300 mil CDRs 1.8 mil users

82% NDA

Dealer Fraud Detection (Telco)

<40% (reactive)

700 mil CDRs 1.2 mil users

80% (predictive)

NDA

• Other big companies in related efforts : Baidu, IBM, Yahoo, Tibco, Tencent, Netflix, Adobe, NEC, Toyota

Page 41: Close Encounters with Data Science

Speech Recognition : the race is on

Page 42: Close Encounters with Data Science

Contextual Mobile Targeting Contextual & unstructured data using machine learning technology also improve advertising accuracy +219 %

Page 43: Close Encounters with Data Science

43

Customer visibility: Accuracy and Algorithm speed

43

Manual test of the algorithm

• Several camera can observe

same area

• Aggregated signals with

proper threshold will perfectly

match

Algorithm speed

• Calibration: manual

• Runtime: 60 msec/frame

0

0.2

0.4

0.6

0.8

1

1.2

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59minute

Ground truth(is person infront of ATM 7)

aggregatedsignal

Page 44: Close Encounters with Data Science

44

Utilization

Average daily

utilization is 10%.

Highest values

(20%) are on

weekends,

Saturdays mostly,

except Chinese

New Year. Lowest

utilization is on the

11th of March (1%).

Recorded coverage

There is recording in

the 30%-90% of the

hours and the 10%-

70% of total time.

This highly correlates

with daily utilization,

but the weekly cycle

is more obvious.

0%

5%

10%

15%

20%

25%

20

15

02

15

20

15

02

16

20

15

02

17

20

15

02

18

20

15

02

19

20

15

02

20

20

15

02

21

20

15

02

22

20

15

02

23

20

15

02

24

20

15

02

25

20

15

02

26

20

15

02

27

20

15

02

28

20

15

03

01

20

15

03

02

20

15

03

03

20

15

03

04

20

15

03

05

20

15

03

06

20

15

03

07

20

15

03

08

20

15

03

09

20

15

03

10

20

15

03

11

20

15

03

12

20

15

03

13

20

15

03

14

20

15

03

15

20

15

03

16

20

15

03

17

20

15

03

18

20

15

03

19

20

15

03

20

20

15

03

21

20

15

03

22

20

15

03

23

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

20

15

02

15

20

15

02

16

20

15

02

17

20

15

02

18

20

15

02

19

20

15

02

20

20

15

02

21

20

15

02

22

20

15

02

23

20

15

02

24

20

15

02

25

20

15

02

26

20

15

02

27

20

15

02

28

20

15

03

01

20

15

03

02

20

15

03

03

20

15

03

04

20

15

03

05

20

15

03

06

20

15

03

07

20

15

03

08

20

15

03

09

20

15

03

10

20

15

03

11

20

15

03

12

20

15

03

13

20

15

03

14

20

15

03

15

20

15

03

16

20

15

03

17

20

15

03

18

20

15

03

19

20

15

03

20

20

15

03

21

20

15

03

22

20

15

03

23

Sat

Sat

Sat Sat

Sat

Sat Sat

Sat

CNY School holiday

Data error

Daily utilization

Page 45: Close Encounters with Data Science

45

Customer demography: Accuracy and Algorithm speed

Manual test of the algorithm

• Average age/gender accuracy of

algorithms with 48x48 = 92%

• Our current algorithm at the desk with

face size of 40x40 = 72%

• Accuracy will be improved up to 85%,

using tilted face + body corpus

Algorithm speed

• Calibration: one-time

• Runtime: irrelevant

Page 46: Close Encounters with Data Science

Opportunities What application areas can benefit ? • Internet : Baidu targeting advertising, Facebook sentiments

from face photos • Commercial : fraud detection, churn prediction, food

detection, weapons detection • Others : disability assistance, object recognition for the blind,

speech recognition for the deaf, cancer tissue recognition

Specific Application Example • Bank customers recognition

Recommend for HK : biggest market impact may be in health image processing and online education

Page 47: Close Encounters with Data Science

3. Networks

Page 48: Close Encounters with Data Science

How Google beat previous search engines ?

Aside from searched content, also use url data patterns (links)* An additional datatype can make a huge difference ! * Eric Schmidt “How Google Works”; also see http://www.economist.com/node/3171440

Page 49: Close Encounters with Data Science

Genetic Basis of Diseases

Page 50: Close Encounters with Data Science

Asthma : known to have multiple variant gene sequences

“ Simple Regression ” “ Multivariate Sparse Lasso Regression ”

Page 51: Close Encounters with Data Science

Novel statistical method allows for joint network analysis to correlated phenotypes

Eric Xing (2014)

Advantages

• Greater power to detect weak associations

• Fewer false positives

• Joint association to multiple correlated phenotypes

Page 52: Close Encounters with Data Science

Asthma Trait Network

Page 53: Close Encounters with Data Science

53

FB data only

Asian Telco data versus Facebook - 1

Analysing family relations with graphs

Page 54: Close Encounters with Data Science

Asian Telco data versus Facebook - 2 Telco data only

+

Campaign Targeting using URL + Social Data Types Response rate

Normal / Control 0.20%

With Social 0.49%

Social + URL 2.30%

Romantic Partner Relationship Prediction Data Types Accuracy

SMS No. 25%

SMS No. + CDR graphs 75%

SMS No. + CDR & Location graphs 85%

Page 55: Close Encounters with Data Science

Combined Social Networking

Graph

1. Improved demographic prediction : Age (45% -> 63%), Gender (45% -> 70%) 2. Inferring romantic partner from SMS/CDR 3. Inferred family relationships, colleagues & communities

Results :

CDR Facebook

Location

URL

Survey Registration

Loyalty

Telco + FB data

Telco Data and Facebook combined !

Page 56: Close Encounters with Data Science

•Wave 1 churners with red •Wave 2 churners with pink •Own customers with yellow •Competitor customers with green •Very active customer with blue

Finding: Wave 1 (red) Churners are contagious (followed by pinks) when local community members are less embedded in the network

Viral churn in service providers : prioritize key opinion leaders before they leave !

Page 57: Close Encounters with Data Science

Capturing network properties can improve prediction

Page 58: Close Encounters with Data Science

• Finding friend of a friend in social network requires one join operation in relational database (RDBMS), so for six degrees of separation, six joins are required. Graph DB can solve this with six simple traversals which is fast and scalable to millions

Depth (how many level of friends of friends)

Execution Time (seconds)

Result Count

MySQL

2 0.016 ~2,500

3 30.267 ~125,000

4 1,543.505 ~600,000

5 Not finished (days) N/A

Neo4J (Graph db)

2 0.01 ~2,500

3 0.168 ~110,000

4 1.359 ~600,000

5 2.132 ~800,000

• Performance RDBMS joining suffered beyond 2 levels due to the huge Cartesian product resulted from each join operation.

Real Life Benchmarks - A MySQL DB with 1M users and each user has 50 friends.

How to learn network properties ?

Page 59: Close Encounters with Data Science

2

Page 60: Close Encounters with Data Science

Opportunities

What application areas can benefit ?

• marketing: recommendation, churn and loyalty

• health: family social disease inheritance, personalized medicine, health education and engagement

• education: socially assisted

Recommend for HK :

digital marketing, education and health

Page 61: Close Encounters with Data Science

Conclusion

Advancing technologies to derive insights from increasing types and amounts of data points to many new opportunities ahead

Questions ? Email [email protected]

Special Thanks to : Mr. William Mak