Traffic calming project Dana Krajá č ová. Scheme of Roosevelt street.
Data Modelling in SAS - · PDF fileData Modelling in SAS How SAS is Used . for. Research and...
Transcript of Data Modelling in SAS - · PDF fileData Modelling in SAS How SAS is Used . for. Research and...
Data Modelling in SASData Modelling in SAS
How SAS is Used How SAS is Used forfor ResearchResearch and and TeachingTeaching to to EnableEnable StudentsStudents toto BecomeBecome More More MarketableMarketable
Iveta Iveta StankoviStankoviččovovááComeniusComenius UniversityUniversityFacultyFaculty ofof ManagementManagementBratislava, SlovakiaBratislava, [email protected]@fm.uniba.sk
22
DataData
CurrentCurrent ageage isis characteristiccharacteristic ofofinformationinformation explosionexplosionData Data areare generatedgenerated::–– forfor researchresearch purposespurposes ((historicallyhistorically, , forfor datadata
analysisanalysis) ) –– experimentalexperimental datadata–– asas operationaloperational datadata ((todaytoday, in , in businessbusiness) ) ––
opportunisticopportunistic datadata ((HuberHuber 1977)1977)
33
DataData
DirtyDirtyCleanCleanHygieneHygieneDynamicDynamicStaticStaticStateState
MassiveMassiveSmallSmallSizeSize
PassivelyPassivelyobservedobserved
ActivelyActivelycontrolledcontrolledGenerationGeneration
CommercialCommercialScientificScientificValueValue
OperationalOperationalReaserchReaserchPurposePurpose
OpportunisticOpportunisticDataData
ExperimentalExperimentalDataData
44
Data Data InformationInformation
ItIt isis necessarynecessary to to obtainobtain informationinformation from from massivemassive amountsamounts ofof operationaloperational datadata forfordecisiondecision makingmaking ofof managersmanagers ((businessbusinessdecisiondecision supportsupport))ItIt isis necessarynecessary to to exploreexplore and model and model relationshipsrelationships in in datadata predictivepredictive modellingmodelling((fundamentalfundamental tasktask))
Data Data ModellingModelling = Data = Data MiningMining(cca 1963)(cca 1963)
55
Data Data MiningMining -- DDefinitionefinitionSelectionSelection processprocess, , researchresearch and and modellingmodellingbasedbased on on greatgreat volumevolume ofof datadata in in orderorder to to detectdetect previousprevious unknownunknown informationinformationpatternspatterns forfor advantageadvantage in in thethe competiticompetitiveveenvironmentenvironmentMultidisciplinaryMultidisciplinary lineagelineageUseUse statisticalstatistical methodsmethods and and furtherfurther methodsmethodsin in bordersborders on on artificialartificial intelligenceintelligence
66
Data Data MiningMining –– SAS SAS ddefinitionefinitionAdvancedAdvanced methodsmethods for for exploringexploring andandmodellingmodelling relationshipsrelationships in in largelarge amountsamounts ofofdatadata
CharacteristicsCharacteristics::1.1. datadata –– massivemassive, , operationaloperational, , opportunisticopportunistic2.2. usersusers andand sponsorssponsors –– nonnon--researchersresearchers, ,
business business orientedoriented3.3. methodologymethodology –– multidisciplinarymultidisciplinary, via , via
computercomputer
77
Data Data MiningMining –– AnalyticalAnalytical toolstoolsStatisticsStatisticsArtificialArtificial intelligenceintelligence (AI)(AI)KnowledgeKnowledge discoverydiscovery in in databasesdatabases (KDD)(KDD)MachineMachine learninglearningPatternPattern recognitionrecognition methodologymethodologyNeurocomputingNeurocomputing
88
Data Data MiningMining –– StepsSteps, , CycleCycle
1.1. IdentifyingIdentifying businessbusinessproblemproblem
2.2. TransformingTransforming datadataintointo actionableactionableresultsresults
3.3. ActingActing accordingaccording to to achievedachieved resultsresults
4.4. MeasuringMeasuring thetheresultsresults
4.4.
1.1.
3.3.
2.2.
99
Data Data MiningMining -- ActivitiesActivities
ClassificationClassificationAffinityAffinity groupinggrouping or or associationassociation rulesrulesClusteringClustering, , segmentationsegmentationEstimationEstimationPredictionPredictionDescriptionDescription and and visualizationvisualization
1010
Data Data MiningMining -- PeoplePeople
DomainDomain expertsexpertsData Data expertsexpertsAnalyticalAnalytical expertsexperts
1111
Data Data MiningMining -- ProcessesProcesses
1.1. Model Model makingmakinghistoricalhistorical datadata::1.1. trainingtraining2.2. testtest3.3. validationvalidation
2.2. ApplyApply modelmodelnew new datadatapredictionprediction
Data MiningSystem
Algorithm
Training Test
Model
Score Model
Results
Training
Eval
Prediction
1212
Data Data MiningMining –– PracticePractice
1.1. GoalGoal definitiondefinition2.2. SelectionSelection ofof datadata sourcessources3.3. PreparationPreparation ofof datadata forfor modellingmodelling4.4. SelectionSelection and and transformationtransformation ofof variablesvariables5.5. ProcessingProcessing and and evaluationevaluation ofof thethe modelmodel6.6. Model Model verificationverification7.7. ImplementationImplementation and model and model maintenancemaintenance
1313
Data Data MiningMining –– SAS SAS solutionsolutionSEMMA SEMMA methodologymethodology::1.1. SSampleample –– identify input data sets, sample
from a large data set (training, test and validation data sets)
2.2. EExplorexplore –– explore data set statistically and graphically
3.3. MModifyodify –– prepare the data for analysis(data manipulation and transformation))
4.4. MModel odel –– fit a predictive model5.5. AAssessssess –– compare competing models
1414
Data Data MiningMining -- MethodsMethods
StatisticalStatistical methodsmethods -- linearlinear and and logisticlogisticregressionregression, , multidimensionalmultidimensional methodsmethods,,timetime seriesseries analysisanalysis ......NonNon--statisticalstatistical methodsmethods -- neuralneuralnetworksnetworks, , geneticgenetic algorithmalgorithm ......MixedMixed methodsmethods -- classificacionclassificacion and and regressionregression treestrees ......
1515
SAS SAS SystemSystem atat ComeniusComeniusUniversityUniversity Bratislava (CU)Bratislava (CU)
November 1999November 1999 –– signedsigned a a licenselicensecontractcontract betweenbetween CU Bratislava andCU Bratislava and SAS SAS Institute Institute GmbHGmbH on on providingproviding 50 50 licenceslicencesofof SAS SAS SystemSystem
November 2001November 2001 -- additionaddition to to thethe licencelicencecontractcontract withwith EnterpriseEnterprise GuideGuide
1616
SAS SAS SystemSystem atat FacultyFaculty ofofManagementManagement Bratislava (FM)Bratislava (FM)
Faculty Faculty ofof MManagementanagement -- 25 license25 licensessBeginningBeginning withwith SAS SAS educationeducation (V 6.12) (V 6.12) --summersummer termterm in in academicacademic year year 1999/20001999/2000CurrentCurrent daysdays –– SAS V8.2 and Enterprise SAS V8.2 and Enterprise Guide VGuide V2.02.0
1717
SubjectsSubjects ofof StatisticsStatistics
3 compulsory 3 compulsory subjectssubjects::IntroductionIntroduction to Sto Statisticstatistics
((1st 1st yearyear,, summersummer termterm –– 4 ho4 hoursurs/week)/week)
SStatistics on PCtatistics on PC(2(2ndnd yearyear, , winterwinter termterm –– 2 ho2 hoursurs/week)/week)
SStatistical tatistical MMethodsethods(2(2ndnd yearyear, , summersummer termterm -- 4 ho4 hoursurs/week)/week)
22 elective elective subjectssubjects::QQuantitativeuantitative methodmethods s ((inin SASSAS SystemSystem))
(3(3rdrd yearyear, , summersummer termterm -- 2 ho2 hoursurs/week)/week)
TimeTime seriesseries analysisanalysis ((inin SASSAS SystemSystem))(3(3rdrd yearyear, , summersummer termterm -- 2 ho2 hoursurs/week)/week)
1818
SubjectsSubjects contentscontents
ContentsContents ofof compulsorycompulsory subjectssubjects::–– mathematicalmathematical statisticsstatistics methodsmethods areare includedincluded
intointo thethe basicbasic module (SAS/BASE, SAS/STAT, module (SAS/BASE, SAS/STAT, SAS/ETS)SAS/ETS)
ContentsContents ofof electiveelective subjectsubject::–– logisticlogistic regressionregression, , principalprincipal componentscomponents analysisanalysis
(PCA), (PCA), clustercluster analysisanalysis, , factorfactor analysisanalysis, , discriminantdiscriminant analysisanalysis (SAS/STAT, SAS/EG)(SAS/STAT, SAS/EG)
–– TimeTime seriesseries analysisanalysis –– ARIMA ARIMA modelsmodels (SAS/EG)(SAS/EG)
1919
ExampleExample –– LogisticLogistic modelmodel
SampleSample ofof 396 396 applicantsapplicants forfor creditcreditIndependetIndependet VariablesVariables XXii (categorical)(categorical)::AgeAge (class(classees)s) = vek= vek 8 values8 valuesGenderGender = = pohlpohl (0=male, 1=female)(0=male, 1=female) 2 values 2 values IncomeIncome (class(classees)s) = plat= plat 8 values8 valuesNumberNumber ofof dependantsdependants = = vyz_osvyz_os 4 values4 valuesJobJob durationduration (class(classees)s) = = trtrvv_zam_zam 6 values6 values
DependetDependet VariableVariable Y (Y (binarybinary))::CreditCredit 2 values 2 values
1 = 1 = assignedassigned 0 = 0 = nonnon--assignedassigned
2020
LogisticLogistic regression regression modelmodel
ConditionalConditional PProbabilityrobability P(Y=1/X) P(Y=1/X) ........................ pp
p= 1/(1 + ep= 1/(1 + e--((αα + + ββ’’X)X)))
OOddsdds ...................................................................................... p/(1p/(1--p)p)
p/(1p/(1--p) = p) = eeαα + + ββ’’XX
LogLogarithmarithm odds = odds = logitlogit–– linelinearar transformtransformationation
logitlogit (p) = log (p/1(p) = log (p/1--p) = p) = αα + + ββ’’XX
2121
Signification of VariablesSignification of Variables
<.000133.92341trv_zam
<.000141.41971plat
<.000139.67071vyz_os
<.000148.07911vek
0.36750.81211pohl
Pr > ChiSqScore Chi-SquareDFEffect
Analysis of Effects Not in the Model
2222
Estimates of model’s parametersEstimates of model’s parameters
<.000127.00750.11550.60001trv_zam
<.000132.29180.12640.71821plat
<.000127.76920.15390.81091vyz_os
<.000120.23080.08710.39161vek
<.000180.68850.7073-6.35381Intercept
Pr > ChiSqWald
Chi-SquareStandard
ErrorEstimateDFParameter
Analysis of Maximum Likelihood Estimates
2323
Odds Ratio EstimatesOdds Ratio Estimates
2.2851.4531.822trv_zam
2.6271.6012.051plat
3.0421.6642.250vyz_os
1.7551.2471.479vek
95% Wald Confidence LimitsPoint EstimateEffect
Odds Ratio Estimates
2424
LogisticLogistic modelmodel -- finalfinal
LogitLogit functionfunction: :
log(p/1log(p/1--p) = p) = = = --6,35 + 0,396,35 + 0,39**vek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zamvek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zam
Probability function:Probability function:p= 1/(1+ ep= 1/(1+ e --((--6,35 + 0,39*vek + 0,81*vyz_os + 0,72*plat6,35 + 0,39*vek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zam)+ 0,6*trv_zam)))Odds function:Odds function:p/p/((11--pp)) == e e ((--6,35 + 0,396,35 + 0,39**vek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zamvek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zam))
Interpretation Interpretation -- example:example:Odds of client Odds of client to to havehave thethe creditcredit assignedassigned areare beingbeing increasedincreasedapproximatelyapproximately 22--times times withwith eacheach higherhigher incomeincome classclass. . –– becausebecause ee 0.720.72= 2,05= 2,05, , i.e. the parameteri.e. the parameter of variable income of variable income (plat)(plat) in in
logistic functionlogistic function
2525
MeasuresMeasures ofof associationassociation
0.824c38180Pairs
0.316Tau-a0.7Percent Tied
0.652Gamma17.3Percent Discordant
0.647Somers' D82.0Percent Concordant
Association of Predicted Probabilities and Observed Responses
2626
LogisticLogistic SS--curvecurve
xx--axis = incomeaxis = income classesclassesyy--axis = axis = pprobabilityrobability of credit's assignmentof credit's assignment
0
0,5
1
0 2 4 6 8income classes
Prob
abili
ty o
f cre
dit's
as
sign
men
t
2727
SAS SAS SytemSytem –– offeredoffered in Menuin Menu
OverviewOverview ofof modulesmodules anan applicationsapplications ofof SAS SAS SystemSystem V8.2 V8.2 forfor creationcreation ofof statisticalstatisticalanalysisanalysis in in thethe menu menu modemode ((knowledgeknowledge ofofSAS SAS codecode isis notnot requiredrequired))
SAS/ASSIST softwareSAS/ASSIST softwareSAS/INSIGHT softwareSAS/INSIGHT softwareSAS AnalystSAS AnalystSAS/SAS/EnterpriseEnterprise GuideGuide
2828
ActivitiesActivities
OutputsOutputs fromfrom SAS SAS educationeducation: : ProjectsProjects –– outputoutput fromfrom eacheach subjectsubjectStudentStudent ResearchResearch ActivityActivity CompetitionCompetition –– 3rd 3rd year, cca 15 year, cca 15 worksworks/per year/per yearThesisThesis worksworks–– informationinformation systemsystem (module AF)(module AF)–– datadata analysisanalysis (module BASE, STAT, QC, ...)(module BASE, STAT, QC, ...)–– ScorecardScorecard ((EnterpriseEnterprise GuideGuide, , EnterpriseEnterprise MinerMiner))
ConferenceConference SAS SAS ForumForum -- participationparticipation ofofteachersteachers and and studentsstudents
2929
PlansPlans
EExtensionxtension ofof plansplans forfor SAS SAS exploitationexploitation in in followingfollowing subjectssubjects: : MultidimensionalMultidimensional MMethodethodss of of AAnalysisnalysisTTimeime SSerieseries AnalysisAnalysisMMarketingarketing RResearchesearchData MiningData MiningFFinancialinancial AAnalysnalysiissQQualityuality CControlontrolOOperationalperational MManaanagegementment