Post on 06-Apr-2017
Pa#ernMiningandMachineLearningforDemographicSequences
DmitryI.Ignatov1,EkaterinaMitrofanova2,AnnaMuratova1,andDanilGizdatulin1
1Computer Science Faculty & 2Institute of Demography
National Research University Higher School of Economics, Moscow, Russia
KESW 2015, Moscow
Outline
KESW 2015, Moscow
• ProblemStatement• DemographicData• MethodsandResults
Ø DecisionTrees§ FirstlifecourseeventpredicGon§ NextlifecourseeventpredicGon§ Rulesfor‘gender’aLribute
Ø FrequentPaLerns§ Frequentclosedsequences§ Emergingsequencesformenandwomen
• ConclusionandFutureWork
2
• Analysisofdemographicdatabymeansofmachinelearninganddatamining[Blockeel et al, 2001; Billari et al., 2006]• DemographersquesGons:
§ Whatarethedifferencesbetweendemographic behaviour of menandwomen?
§ Whatarethetypical(frequent)eventsequencesthatappearinlifecoursetrajectories?
§ WhatarethefirstandthenextstarGngeventsaparGcularpersonmayhave?
§ Andmanymore…
• SimpleDM&MLtoolspreferablywithGUI:§ Orange http://orange.biolab.si/§ SPMFhLp://www.philippe-fournier-viger.com/spmf/§ Ad-hocscriptsinPython
ProblemStatement
3KESW 2015, Moscow
Thesurvey«Parentsandchildren,menandwomeninfamilyandsociety»www.socpol.ru/gender/RIDMIZ.shtml• 4857peopleincluding1545men3312women• 11generaGonsaresplitin5yearsintervalsfrom1930Gll1984
DemographicData
4KESW 2015, Moscow
Comparisonofclassifiersforthefirsteventpredic?on
5KESW 2015, Moscow
WhyDecisionTrees?
5KESW 2015, Moscow
• Notablack-boxapproach• Simpleif-thenrulebasedrepresentaGon• However…
Thefirsteventpredic?onbyDecisionTrees
ThetreeisbuiltinOrangeThefirsteventmainlydependsoneducaGontype
6KESW 2015, Moscow
FeatureEncodingSchemes
General attributes: • Gender • Generation • Type of education • Locality • Religious views • Religious activity
Events encoding: • Binary encoding (0 - absence, 1 - presence) • Time-base encoding (age in months) • Pairs of events ordered by precedence relations (<, >, =, n/a)
7KESW 2015, Moscow
The anonymised datasets for each experiment are freely available in CSV files: http://bit.ly/KESW2015seqdem
Influenceoffeatureencodingonthenexteventpredic?on
Encodingtype
Classifica?onAccuracy
Imbalanceddata Balanceddata
Binary(BE) 0.8498 0.8780(*)
Time-based(TE) 0.3516 0.3591
Pairsofevents(PE) 0.7076 0.7013
BE+TE 0.7293(~) 0.7459
BE+PE 0.8407 0.8438
TE+PE 0.5465 0.4959
BE+TE+PE 0.7295(~) 0.7503
(*)meansthebestresult,(~)meansalmostequivalentresults
8KESW 2015, Moscow
Theconfusionmatrixforthenexteventpredic?on
8KESW 2015, Moscow
– binary encoding for balanced dataset – “br” means “break up” and “div” means “divorce” events
Examplesofrulesforthenexteventpredic?on
9KESW 2015, Moscow
Thenexteventpredic?onbasedongenerala#ributes
A decision tree diagram built only for general attributes. The next event is influenced by the type of education.
10KESW 2015, Moscow
Classifica?onaccuracyofdifferentencodingschemesforgenderpredic?on
(∼) means very close results, and (*) means the best result in the column.
11KESW 2015, Moscow
Men:Women:
Examplesofrulesforpredic?onoftargeta#ribute«gender»
Antecedent Confidence
Firstjobaper19.9years,marriagein20.6-22.4,educaGonbefore20.7,breakupaper27.6,divorcebefore30.5
65.9%
Firstjobaper19.9,marriagein20.6-22.4,break-upbefore27.6 61.1%
Firstjobbefore17.2,marriagein20.6-22.4,break-upbefore27.6 61.3%
Firstjobaper21,marriageaper29.5 70.2%
Antecedent Confidence
Firstjobin18.2-19.9,marriagein20.6-22.4,break-upaper27.6,divorceaper30.5
71.9%
Firstjobin18.2-19.9,marriagein20.6-22.4,break-upaper27.6,divorcebefore30.5
70.9%
Firstjobin17.2-19.9,marriagein20.6-22.4,break-upbefore27.6 62.8%
Firstjobin17.7-21,marriageaper29.5 62.8%
12KESW 2015, Moscow
• Asequenceisanorderedsetofelements(events)• ThesupportofsequencerindatabaseisthenumberofsequencesinDthatcontainr
• Therela?vesupportofristhefracGonofsequencesthatcontainr• AsequencerisfrequentinadatabaseDif,whereminsupisaminimalsupportthreshold• Afrequentclosedsequenceis a sequence such that there is no any its
supersequence with the same support
SequenceMining
13KESW 2015, Moscow
[Zaki & Meira, 2014; Agrawal & Srikant, 1995]
sup(r) = #{si ∈ D | r is subsequence of si}
• AnemergingsequenceisasequencesuchthatitssupportgrowthsdrasGcallyinthetransiGonfromonedatabasetoanotherone
• ThegrowthrateofasequencesinthetransiGonfromdatabasesD1toD2: (1) • Thecontribu?onofasequencetoaparGcularclass:,(2)whereeisasubsequenceofsTheideaisbasedon[Mill,1843;Finn,1983;Dong&Li,1999]
EmergingSequences
14KESW 2015, Moscow
• An SPMFoutputforfrequentclosedsequencesmining
Frequentclosedsequences
15KESW 2015, Moscow
• ImplementedinPython2.7Input:twodatasetsformenandwomenwithageindicaGonfordemographicevents1. TransformaGonofeventstosequences(80:20test-to-trainingraGo)2. PassingthetestsettoSPMFandfindingfrequentsequences.3. Findingemergingsequences(classificaGonrules)andtheircontribuGonsto
classes.4. DefiningtheclassofarulebyitscontribuGonandthencontribuGon
normalisaGon.5. AccuracycalculaGonforthetestset.Output:fileswithclassificaGonrulesformenandwomen
• minsup = 0.005; for each class we use 3312 sequences after oversampling.
• The best classification accuracy (0.936) has been reached at minimal growth rate 1.0, with 577 rules for men and 1164 for women, and 3 non-covered objects.
EmergingSequenceMining
16KESW 2015, Moscow
• The list of emerging sequences for men(with their class contribution):• Thelistofemergingsequencesforwomen:
Emergingsequences
17KESW 2015, Moscow
• We have shown that decision trees and sequence mining could become the tools of choice for demographers.
• Machine learning and data mining tools can help in finding regularities
and dependencies that are hidden in voluminous demographic datasets. However, these methods need to be properly tuned and adapted to the domain needs.
• In the near future we are planning:
• to implement emerging prefix-string mining and learning to deal with sequences without gaps
• to use different rule-based techniques that are able to cope with unbalanced multi-class data (Cerf et al., 2013)
• to apply Pattern Structures to demographic sequence mining (Kuznetsov et al., 2013).
• andmanymoreideasandtricks…
ConclusionandFutureWork
18KESW 2015, Moscow
Thankyou.Ques?ons?
KESW 2015, Moscow