Big Data at Ancestry.com

32
DNA Learning from Data: Who Do You Think You Are? Sco$ Sorensen and Leonid Zhukov

description

Presentation at Big Data Summit, April 2013, SF

Transcript of Big Data at Ancestry.com

Page 1: Big Data at Ancestry.com

DNA Learning  from  Data:    

Who  Do  You  Think  You  Are?    Sco$  Sorensen  and  Leonid  Zhukov  

Page 2: Big Data at Ancestry.com

Ancestry.com  Mission  

2

Page 3: Big Data at Ancestry.com

Discoveries  

It’s  the  “aha”  moment  of  a  discovery  that  drives  our  business!  

3

Page 4: Big Data at Ancestry.com

World’s  largest  online  family  history  resource  

Historical  Content  Over  30,000  historical  content  collec2ons    11  billion  records  and  images  Records  da2ng  back  to  16th  century  

4

Page 5: Big Data at Ancestry.com

World’s  largest  online  family  history  resource  

User  Contributed  Content  45  million  family  trees  More  than  4  billion  profiles  200  million  stories  and  photos  

5

Page 6: Big Data at Ancestry.com

DNA  Data  

DNA  Data  

 Over  120,000  DNA  samples  700,000  SNPs  for  each  sample  2,000,000  4th  cousin  matches  

 

 

 

 

Spit  in  a  tube,  pay  $99,  learn  your  past  Derrick  Harris  -­‐  GigaOm  

 

 

6

DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism). (http://en.wikipedia.org/wiki/Single-nucleiotide_polymorphism)  

Page 7: Big Data at Ancestry.com

User  Behavior  Data  

User  Behavior  Data  40  million  searches  /  day  10  million  people  added  to  trees  /  day  5  million    Hints  accepted  /  day  3.5  million    Records  aMached  /  day  

 

7

1/12   12/12   1/12   12/12  

Page 8: Big Data at Ancestry.com

Real-­‐Ome  data  feed  

8

Page 9: Big Data at Ancestry.com

Technology  

9

Machine  Learning    

Page 10: Big Data at Ancestry.com

Person  and  record  search  

10

•  Search  query  

Page 11: Big Data at Ancestry.com

Hint  suggesOons  system  

11

• Hints  -­‐  sugges2ons    to  aMach  a  record    

Page 12: Big Data at Ancestry.com

Record  linkage  

•  Record  linkage  –  finding  and  matching  records  in  mul2ple  data  sets    with  non-­‐unique  iden2fiers  

•  Goal:  bring  together  informa2on  about  the  same  person  

•  Some    non-­‐unique  iden2fiers:  –  Names:  first  name,  last  name  (John  Smith  –  300,000  records)  –  Dates:    date  of  birth,  date  of  death        –  Places:  place  of  birth,  residence,  place  of  death    –  Extra:  family  members,  life  events  

•  Records  o[en  incomplete    

•  Records  contains  mistakes  

•  Exact  and  fuzzy  match  

 12

Page 13: Big Data at Ancestry.com

Life  events  in  collecOons  

13

•  Life  events  –  Birth:  2.59  bln  –  Marriage:    114  mln  –  Census:    2.74  bln  –  Death:    467  mln  

•  Total:    5.91  bln  events  

Page 14: Big Data at Ancestry.com

Candidate  set  funnel:  exact  match  

14

John  Smith:    300,000    

John  Smith,  1870:  2,200  

John  Smith,  1870,    Boston,  MA:  

 10  

Search:    high  precision  

Page 15: Big Data at Ancestry.com

Candidate  set  funnel:  fuzzy  match  

15

John  Smith:    380,000    

John  Smith,  1870:  97,000  

John  Smith,  1870,    Boston,  MA:  

 1400  

Explora2on:  large  recall  

Page 16: Big Data at Ancestry.com

Results  set  

16

Names edit distance

Extended dates

Missing fields

Short names

initials

Exact match

Page 17: Big Data at Ancestry.com

Hints  suggesOon  system  

17

• User  feedback  loop:  – Accept  sugges2on  – Reject  sugges2on  

Page 18: Big Data at Ancestry.com

•  Supervised  machine  learning  

•  Learn  similarity  measure    

(how  to  combine  iden2fiers)  

•  Training  &  tes2ng  sets:  – User  accepts,  rejects  

•  Features  (>  500):  – First  last  name,  DOB,  POB,  DOD,  POD    – Parents,  children,  siblings,  spouses  – Fuzzy  matches  

•  Similar  to  “learning  to  rank”  problem  

A  place  for  machine  learning  

18

ML suggest

Candidate  k-­‐set  

Person Record ?  

Page 19: Big Data at Ancestry.com

Similarity  measure  learning  

19

Ancestry collections

Feature generation

Member trees

Person ID

ML Random forest

Person ID

Label

Model

Index

Top-k records candidate set

Feature generation Ranked List

Training  

Scoring  

Hadoop  Hive  

Record ID

Page 20: Big Data at Ancestry.com

Large  scale  machine  learning  

20

Random forest (R)

Random forest (R)

Random forest (R)

Random forest (R)

Model

Hadoop  streaming  

Hadoop  HDFS  

Page 21: Big Data at Ancestry.com

Data  

21

Big  Data  –  Big  Picture    

Page 22: Big Data at Ancestry.com

Family  tree  

22

•  User  generated  family  trees:  

–   45  mln  family  trees  

–   4.9  bln    profiles  

Page 23: Big Data at Ancestry.com

Family  tree  as  a  graph  (DAG)  

23

2020  nodes  572  marriage  edges  2910  family  edges  

 

Page 24: Big Data at Ancestry.com

Family  trees  

24

Page 25: Big Data at Ancestry.com

Family  trees  staOsOcs  

25

“Power  law”  distribu2on  44  mln  trees  

Page 26: Big Data at Ancestry.com

History  from  family  trees  

26

500  nodes  700  edges  

55  genera2ons      

2me  

Page 27: Big Data at Ancestry.com

Historical  immigraOon  to  the  US  

•  ImmigraOon  is  the  movement  of  people  into  a  country  or  region  to  which  they  are  not  na2ve  in  order  to  seMle  there  

•  Immigrants  are  those  who  were  born  outside  the  US  and  died  in  the  US  

•  Based  on  family  tree  profiles:  –  Birth/death  dates  range    1500-­‐1990  –  Select  only  complete  profiles  with  FLN,  POB,  DOB,  POD,  DOD  –  Perform  de-­‐duplica2on,  remove  same  ancestors  from  different  family  trees  –  Select  only  those  with  POB  !=  US,  POD  ==  US  

•  15  mln  profiles  (  0.3  %  from  4.9  bln  profiles)  

27

Page 28: Big Data at Ancestry.com

ImmigraOon  to  the  USA  1500-­‐1990  

28

Page 29: Big Data at Ancestry.com

29

Page 30: Big Data at Ancestry.com

ImmigraOon  map    

30

Page 31: Big Data at Ancestry.com

Ports  of  arrival    (1800-­‐1980)    

31

Page 32: Big Data at Ancestry.com

Data  Science    

• Ancestry  is  building  data  science  team  

• We  work  on  product  data  and  BI  

• We  are  hiring  

•  Special  thanks  to  Mercator  Group  for  inforgraphics      

32