Data Modelling in SAS - · PDF fileData Modelling in SAS How SAS is Used . for. Research and...

30
Data Modelling in SAS Data Modelling in SAS How SAS is Used How SAS is Used for for Research Research and and Teaching Teaching to to Enable Enable Students Students to to Become Become More More Marketable Marketable Iveta Iveta Stankovi Stankovi č č ov ov á á Comenius Comenius University University Faculty Faculty of of Management Management Bratislava, Slovakia Bratislava, Slovakia [email protected] [email protected]

Transcript of Data Modelling in SAS - · PDF fileData Modelling in SAS How SAS is Used . for. Research and...

Data Modelling in SASData Modelling in SAS

How SAS is Used How SAS is Used forfor ResearchResearch and and TeachingTeaching to to EnableEnable StudentsStudents toto BecomeBecome More More MarketableMarketable

Iveta Iveta StankoviStankoviččovovááComeniusComenius UniversityUniversityFacultyFaculty ofof ManagementManagementBratislava, SlovakiaBratislava, [email protected]@fm.uniba.sk

22

DataData

CurrentCurrent ageage isis characteristiccharacteristic ofofinformationinformation explosionexplosionData Data areare generatedgenerated::–– forfor researchresearch purposespurposes ((historicallyhistorically, , forfor datadata

analysisanalysis) ) –– experimentalexperimental datadata–– asas operationaloperational datadata ((todaytoday, in , in businessbusiness) ) ––

opportunisticopportunistic datadata ((HuberHuber 1977)1977)

33

DataData

DirtyDirtyCleanCleanHygieneHygieneDynamicDynamicStaticStaticStateState

MassiveMassiveSmallSmallSizeSize

PassivelyPassivelyobservedobserved

ActivelyActivelycontrolledcontrolledGenerationGeneration

CommercialCommercialScientificScientificValueValue

OperationalOperationalReaserchReaserchPurposePurpose

OpportunisticOpportunisticDataData

ExperimentalExperimentalDataData

44

Data Data InformationInformation

ItIt isis necessarynecessary to to obtainobtain informationinformation from from massivemassive amountsamounts ofof operationaloperational datadata forfordecisiondecision makingmaking ofof managersmanagers ((businessbusinessdecisiondecision supportsupport))ItIt isis necessarynecessary to to exploreexplore and model and model relationshipsrelationships in in datadata predictivepredictive modellingmodelling((fundamentalfundamental tasktask))

Data Data ModellingModelling = Data = Data MiningMining(cca 1963)(cca 1963)

55

Data Data MiningMining -- DDefinitionefinitionSelectionSelection processprocess, , researchresearch and and modellingmodellingbasedbased on on greatgreat volumevolume ofof datadata in in orderorder to to detectdetect previousprevious unknownunknown informationinformationpatternspatterns forfor advantageadvantage in in thethe competiticompetitiveveenvironmentenvironmentMultidisciplinaryMultidisciplinary lineagelineageUseUse statisticalstatistical methodsmethods and and furtherfurther methodsmethodsin in bordersborders on on artificialartificial intelligenceintelligence

66

Data Data MiningMining –– SAS SAS ddefinitionefinitionAdvancedAdvanced methodsmethods for for exploringexploring andandmodellingmodelling relationshipsrelationships in in largelarge amountsamounts ofofdatadata

CharacteristicsCharacteristics::1.1. datadata –– massivemassive, , operationaloperational, , opportunisticopportunistic2.2. usersusers andand sponsorssponsors –– nonnon--researchersresearchers, ,

business business orientedoriented3.3. methodologymethodology –– multidisciplinarymultidisciplinary, via , via

computercomputer

77

Data Data MiningMining –– AnalyticalAnalytical toolstoolsStatisticsStatisticsArtificialArtificial intelligenceintelligence (AI)(AI)KnowledgeKnowledge discoverydiscovery in in databasesdatabases (KDD)(KDD)MachineMachine learninglearningPatternPattern recognitionrecognition methodologymethodologyNeurocomputingNeurocomputing

88

Data Data MiningMining –– StepsSteps, , CycleCycle

1.1. IdentifyingIdentifying businessbusinessproblemproblem

2.2. TransformingTransforming datadataintointo actionableactionableresultsresults

3.3. ActingActing accordingaccording to to achievedachieved resultsresults

4.4. MeasuringMeasuring thetheresultsresults

4.4.

1.1.

3.3.

2.2.

99

Data Data MiningMining -- ActivitiesActivities

ClassificationClassificationAffinityAffinity groupinggrouping or or associationassociation rulesrulesClusteringClustering, , segmentationsegmentationEstimationEstimationPredictionPredictionDescriptionDescription and and visualizationvisualization

1010

Data Data MiningMining -- PeoplePeople

DomainDomain expertsexpertsData Data expertsexpertsAnalyticalAnalytical expertsexperts

1111

Data Data MiningMining -- ProcessesProcesses

1.1. Model Model makingmakinghistoricalhistorical datadata::1.1. trainingtraining2.2. testtest3.3. validationvalidation

2.2. ApplyApply modelmodelnew new datadatapredictionprediction

Data MiningSystem

Algorithm

Training Test

Model

Score Model

Results

Training

Eval

Prediction

1212

Data Data MiningMining –– PracticePractice

1.1. GoalGoal definitiondefinition2.2. SelectionSelection ofof datadata sourcessources3.3. PreparationPreparation ofof datadata forfor modellingmodelling4.4. SelectionSelection and and transformationtransformation ofof variablesvariables5.5. ProcessingProcessing and and evaluationevaluation ofof thethe modelmodel6.6. Model Model verificationverification7.7. ImplementationImplementation and model and model maintenancemaintenance

1313

Data Data MiningMining –– SAS SAS solutionsolutionSEMMA SEMMA methodologymethodology::1.1. SSampleample –– identify input data sets, sample

from a large data set (training, test and validation data sets)

2.2. EExplorexplore –– explore data set statistically and graphically

3.3. MModifyodify –– prepare the data for analysis(data manipulation and transformation))

4.4. MModel odel –– fit a predictive model5.5. AAssessssess –– compare competing models

1414

Data Data MiningMining -- MethodsMethods

StatisticalStatistical methodsmethods -- linearlinear and and logisticlogisticregressionregression, , multidimensionalmultidimensional methodsmethods,,timetime seriesseries analysisanalysis ......NonNon--statisticalstatistical methodsmethods -- neuralneuralnetworksnetworks, , geneticgenetic algorithmalgorithm ......MixedMixed methodsmethods -- classificacionclassificacion and and regressionregression treestrees ......

1515

SAS SAS SystemSystem atat ComeniusComeniusUniversityUniversity Bratislava (CU)Bratislava (CU)

November 1999November 1999 –– signedsigned a a licenselicensecontractcontract betweenbetween CU Bratislava andCU Bratislava and SAS SAS Institute Institute GmbHGmbH on on providingproviding 50 50 licenceslicencesofof SAS SAS SystemSystem

November 2001November 2001 -- additionaddition to to thethe licencelicencecontractcontract withwith EnterpriseEnterprise GuideGuide

1616

SAS SAS SystemSystem atat FacultyFaculty ofofManagementManagement Bratislava (FM)Bratislava (FM)

Faculty Faculty ofof MManagementanagement -- 25 license25 licensessBeginningBeginning withwith SAS SAS educationeducation (V 6.12) (V 6.12) --summersummer termterm in in academicacademic year year 1999/20001999/2000CurrentCurrent daysdays –– SAS V8.2 and Enterprise SAS V8.2 and Enterprise Guide VGuide V2.02.0

1717

SubjectsSubjects ofof StatisticsStatistics

3 compulsory 3 compulsory subjectssubjects::IntroductionIntroduction to Sto Statisticstatistics

((1st 1st yearyear,, summersummer termterm –– 4 ho4 hoursurs/week)/week)

SStatistics on PCtatistics on PC(2(2ndnd yearyear, , winterwinter termterm –– 2 ho2 hoursurs/week)/week)

SStatistical tatistical MMethodsethods(2(2ndnd yearyear, , summersummer termterm -- 4 ho4 hoursurs/week)/week)

22 elective elective subjectssubjects::QQuantitativeuantitative methodmethods s ((inin SASSAS SystemSystem))

(3(3rdrd yearyear, , summersummer termterm -- 2 ho2 hoursurs/week)/week)

TimeTime seriesseries analysisanalysis ((inin SASSAS SystemSystem))(3(3rdrd yearyear, , summersummer termterm -- 2 ho2 hoursurs/week)/week)

1818

SubjectsSubjects contentscontents

ContentsContents ofof compulsorycompulsory subjectssubjects::–– mathematicalmathematical statisticsstatistics methodsmethods areare includedincluded

intointo thethe basicbasic module (SAS/BASE, SAS/STAT, module (SAS/BASE, SAS/STAT, SAS/ETS)SAS/ETS)

ContentsContents ofof electiveelective subjectsubject::–– logisticlogistic regressionregression, , principalprincipal componentscomponents analysisanalysis

(PCA), (PCA), clustercluster analysisanalysis, , factorfactor analysisanalysis, , discriminantdiscriminant analysisanalysis (SAS/STAT, SAS/EG)(SAS/STAT, SAS/EG)

–– TimeTime seriesseries analysisanalysis –– ARIMA ARIMA modelsmodels (SAS/EG)(SAS/EG)

1919

ExampleExample –– LogisticLogistic modelmodel

SampleSample ofof 396 396 applicantsapplicants forfor creditcreditIndependetIndependet VariablesVariables XXii (categorical)(categorical)::AgeAge (class(classees)s) = vek= vek 8 values8 valuesGenderGender = = pohlpohl (0=male, 1=female)(0=male, 1=female) 2 values 2 values IncomeIncome (class(classees)s) = plat= plat 8 values8 valuesNumberNumber ofof dependantsdependants = = vyz_osvyz_os 4 values4 valuesJobJob durationduration (class(classees)s) = = trtrvv_zam_zam 6 values6 values

DependetDependet VariableVariable Y (Y (binarybinary))::CreditCredit 2 values 2 values

1 = 1 = assignedassigned 0 = 0 = nonnon--assignedassigned

2020

LogisticLogistic regression regression modelmodel

ConditionalConditional PProbabilityrobability P(Y=1/X) P(Y=1/X) ........................ pp

p= 1/(1 + ep= 1/(1 + e--((αα + + ββ’’X)X)))

OOddsdds ...................................................................................... p/(1p/(1--p)p)

p/(1p/(1--p) = p) = eeαα + + ββ’’XX

LogLogarithmarithm odds = odds = logitlogit–– linelinearar transformtransformationation

logitlogit (p) = log (p/1(p) = log (p/1--p) = p) = αα + + ββ’’XX

2121

Signification of VariablesSignification of Variables

<.000133.92341trv_zam

<.000141.41971plat

<.000139.67071vyz_os

<.000148.07911vek

0.36750.81211pohl

Pr > ChiSqScore Chi-SquareDFEffect

Analysis of Effects Not in the Model

2222

Estimates of model’s parametersEstimates of model’s parameters

<.000127.00750.11550.60001trv_zam

<.000132.29180.12640.71821plat

<.000127.76920.15390.81091vyz_os

<.000120.23080.08710.39161vek

<.000180.68850.7073-6.35381Intercept

Pr > ChiSqWald

Chi-SquareStandard

ErrorEstimateDFParameter

Analysis of Maximum Likelihood Estimates

2323

Odds Ratio EstimatesOdds Ratio Estimates

2.2851.4531.822trv_zam

2.6271.6012.051plat

3.0421.6642.250vyz_os

1.7551.2471.479vek

95% Wald Confidence LimitsPoint EstimateEffect

Odds Ratio Estimates

2424

LogisticLogistic modelmodel -- finalfinal

LogitLogit functionfunction: :

log(p/1log(p/1--p) = p) = = = --6,35 + 0,396,35 + 0,39**vek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zamvek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zam

Probability function:Probability function:p= 1/(1+ ep= 1/(1+ e --((--6,35 + 0,39*vek + 0,81*vyz_os + 0,72*plat6,35 + 0,39*vek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zam)+ 0,6*trv_zam)))Odds function:Odds function:p/p/((11--pp)) == e e ((--6,35 + 0,396,35 + 0,39**vek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zamvek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zam))

Interpretation Interpretation -- example:example:Odds of client Odds of client to to havehave thethe creditcredit assignedassigned areare beingbeing increasedincreasedapproximatelyapproximately 22--times times withwith eacheach higherhigher incomeincome classclass. . –– becausebecause ee 0.720.72= 2,05= 2,05, , i.e. the parameteri.e. the parameter of variable income of variable income (plat)(plat) in in

logistic functionlogistic function

2525

MeasuresMeasures ofof associationassociation

0.824c38180Pairs

0.316Tau-a0.7Percent Tied

0.652Gamma17.3Percent Discordant

0.647Somers' D82.0Percent Concordant

Association of Predicted Probabilities and Observed Responses

2626

LogisticLogistic SS--curvecurve

xx--axis = incomeaxis = income classesclassesyy--axis = axis = pprobabilityrobability of credit's assignmentof credit's assignment

0

0,5

1

0 2 4 6 8income classes

Prob

abili

ty o

f cre

dit's

as

sign

men

t

2727

SAS SAS SytemSytem –– offeredoffered in Menuin Menu

OverviewOverview ofof modulesmodules anan applicationsapplications ofof SAS SAS SystemSystem V8.2 V8.2 forfor creationcreation ofof statisticalstatisticalanalysisanalysis in in thethe menu menu modemode ((knowledgeknowledge ofofSAS SAS codecode isis notnot requiredrequired))

SAS/ASSIST softwareSAS/ASSIST softwareSAS/INSIGHT softwareSAS/INSIGHT softwareSAS AnalystSAS AnalystSAS/SAS/EnterpriseEnterprise GuideGuide

2828

ActivitiesActivities

OutputsOutputs fromfrom SAS SAS educationeducation: : ProjectsProjects –– outputoutput fromfrom eacheach subjectsubjectStudentStudent ResearchResearch ActivityActivity CompetitionCompetition –– 3rd 3rd year, cca 15 year, cca 15 worksworks/per year/per yearThesisThesis worksworks–– informationinformation systemsystem (module AF)(module AF)–– datadata analysisanalysis (module BASE, STAT, QC, ...)(module BASE, STAT, QC, ...)–– ScorecardScorecard ((EnterpriseEnterprise GuideGuide, , EnterpriseEnterprise MinerMiner))

ConferenceConference SAS SAS ForumForum -- participationparticipation ofofteachersteachers and and studentsstudents

2929

PlansPlans

EExtensionxtension ofof plansplans forfor SAS SAS exploitationexploitation in in followingfollowing subjectssubjects: : MultidimensionalMultidimensional MMethodethodss of of AAnalysisnalysisTTimeime SSerieseries AnalysisAnalysisMMarketingarketing RResearchesearchData MiningData MiningFFinancialinancial AAnalysnalysiissQQualityuality CControlontrolOOperationalperational MManaanagegementment

3030

ThanksThanks forfor youryour attention!attention!