Data preparation, training and validation using SystemML by Faraz Makari Manshadi
-
Upload
arvind-surve -
Category
Education
-
view
27 -
download
1
Transcript of Data preparation, training and validation using SystemML by Faraz Makari Manshadi
DataPreparationandDescriptiveStatisticsin
SystemML
1
Outline
• Datapre-processingandtransformation• Training/Testing/CrossValidation• Descriptivestatistics
I. UnivariatestatisticsII. BivariatestatisticsIII. Stratifiedstatistics
2
InputDataFormat
3
Inputdata§ Rows:datapoints (akarecords)§ Columns: features(akavariables, attributes)
Featuretypes:§ Scale (akacontinuous), e.g.,‘Height’,‘Weight’, ‘Salary’, ‘Temperature’§ Categorical (akadiscrete)§ Nominal – nonaturalranking,e.g.,‘Gender’,‘Region’,‘Haircolor’§ Ordinal – naturalranking,e.g.,‘LevelofSatisfaction’
Example:Thehousedataset
DataPre-ProcessingTabularinput dataneedstobetransformedintoamatrix– transform()built-in functionCategoricalfeaturesneedspecialtreatment:§ Recoding:mappingdistinctcategoriesintoconsecutive numbersstartingfrom1§ Dummycoding (akaone-hot-encoding, one-of-K encoding)
Example:recoding dummycoding
4
Zipcode
96334
95123
95141
96334
Zipcode
1
2
3
1
direction
east
west
north
south
dir_east dir_west dir_north dir_south
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
transform() Built-inFunction
transform() built-infunction supports:§ Omittingmissingvalues§ Missingvalueimputation byglobal_mean (scalefeatures),global_mode (categorical
features),or constant (scale/categoricalfeatures)§ Binning (equi-width)§ Scaling (scalefeatures):mean-subtraction,z-score§ Recoding§ Dummycoding
5
TransformSpecification
§ Transformationsoperateonindividualcolumns§ AllrequiredtransformationsspecifiedinaJSONfile§ Propertyna.strings inthemtd filespecifiesmissingvalues
Example:data.spec.json data.csv.mtd
6
{"data_type": "frame","format": "csv","sep": ",","header": true,"na.strings": [ "NA", "" ]
}
{“ids": true, "omit": [ 1, 4, 5, 6, 7, 8, 9 ], "impute":[ { “id": 2, "method": "constant",
"value": "south" },{ “id": 3, "method":
"global_mean" }]
,"recode": [ 1, 2, 4, 5, 6, 7 ]
,"bin":[ { “id": 8, "method": "equi-
width", "numbins": 3 } ]
,"dummycode": [ 2, 5, 6, 7, 8, 3 ]}
CombinationsofTransformations
7
Signatureoftransform()
§ Invocation1:
§ Resultingmetadata:#distinctvalues incategoricalcolumns, listofdistinctvalueswiththeirrecodedIDs,numberofbins, binwidth, etc.
§ Anexistingtransformationcanbeapplied tonewdatausingthemetadatageneratedinanearlierinvocation
§ Invocation2:
8
output = transform (target = input, spec = specification, transformPath = "/path/to/metadata“);
output = transform (target = input, transformPath = "/path/to/new_metadata“applyTransformPath = "/path/to/metadata“);
Outline
• Datapre-processingandtransformation• Training/Testing/CrossValidation• Descriptivestatistics
I. UnivariatestatisticsII. BivariatestatisticsIII. Stratifiedstatistics
9
Training/Testing
§ Pre-processing trainingandtestingdatasets§ Splittingdatapointsandlabels– splitXY.dml andsplitXY-dummy.dml (hands-on)§ Samplingdatapoints– sample.dml (hands-on)§ CrossValidation– cv-linreg.dml (hands-on)
10
Pre-ProcessingTrainingandTestingDataTrainingphase
Testingphase
11
Train = read ("/user/ml/trainset.csv"); Spec = read("/user/ml/tf.spec.json“, data_type = "scalar",
value_type = "String");trainD = transform (target = Train,
transformSpec = Spec, transformPath = "/user/ml/train_tf_metadata");
# Build a predictive model using trainD...
Test = read ("/user/ml/testset.csv"); testD = transform (target = Test,
transformPath = "/user/ml/test_tf_metadata", applyTransformPath = "/user/ml/train_tf_metdata");
# Test the model using testD...
CrossValidation
K-foldCrossValidation:1. Shufflethedatapoints2. Dividethedatapoints into𝑘 foldsof(roughly)
thesamesize3. For𝑖 = 1, … , 𝑘:
• Traineachmodelonallthedatapointsthatdonotbelongtofold𝑖
• Testeachmodelonalltheexamplesinfold𝑖andcomputethetesterror
4. Selectthemodelwiththeminimumaveragetestoverall𝑘 folds
5. (Trainthewinningmodelonallthedatapoints)
12
Testing Training
Example:𝑘 = 5
Outline
• Datapre-processingandtransformation• Training/Testing/CrossValidation• Descriptivestatistics
I. UnivariatestatisticsII. BivariatestatisticsIII. Stratifiedstatistics
13
UnivariateStatistics
14
Row Name ofStatistic Scale Category
1 Minimum +
2 Maximum +
3 Range +
4 Mean +
5 Variance +
6 Standarddeviation +
7 Standard errorofmean +
8 Coefficientofvariation +
9 Skewness +
10 Kurtosis +
11 Standarderrorofskewness +
12 StandarderrorofKurtosis +
13 Median +
14 Intequartilemean +
15 Numberofcategories +
16 Mode +
17 Numberofmodes +
Centraltendencymeasures
Dispersionmeasures
Shapemeasures
Categoricalmeasures
BivariateStatistics
Quantitativeassociationbetweenpairsoffeatures
I. Scale-vs-Scalestatistics§ Pearson’scorrelationcoefficient
II. Nominal-vs-Nominalstatistics§ Pearson’s𝜒)§ Cramér's 𝑉
III. Nominal-vs-Scalestatistics§ Etastatistic§ 𝐹 statistic
IV. Ordinal-vs-Ordinalstatistics§ Spearman’srankcorrelationcoefficient
15
Scale-vs-ScaleStatistics
Pearson’scorrelationcoefficient§ Ameasureoflineardependencebetweenscalefeatures
§ 𝜌) measuresaccuracyof𝑥)~𝑥0
16
𝜌 =123(56,57)9:69:7
,𝜌 ∈ [−1,+1]
1 − 𝜌) =∑ 𝑥A,) − 𝑥BA,)
)CAD0
∑ 𝑥A,) − �̅�A,))C
AD0
ResidualSumofSquares(RSS)
TotalSumofSquares (TSS)
Nominal-vs-NominalStatistics
Pearson’s𝜒)§ Ameasurehowmuchfrequenciesofvaluepairsoftwocategoricalfeaturesdeviatefrom
statisticalindependence
§ Underindependence assumption Pearson’s𝜒) distributedapproximately𝜒) 𝑑 with𝑑 = (𝑘0 − 1)(𝑘) − 1) degreesoffreedom
§ 𝑃-value:
§ 𝑃 → 0 (rapidly)asfeatures’dependenceincreases,sensitiveto𝑛§ Onlymeasuresthepresenceofdependencenot thestrengthofdependence
17
𝜒) = K𝑂M,N − 𝐸M,N
)
𝐸M,NM,N
𝑥0 with 𝑘0 distinct categories𝑥) with 𝑘) distinct categories𝑂M ,N = #(𝑎, 𝑏)observed frequencies
𝐸M,N =#M#NC
expected frequencies for all pairs (𝑎, 𝑏)
𝑃 = Pr 𝜌 ≥ Pearson[s𝜒) 𝜌~𝜒)(𝑑)distribution
Nominal-vs-NominalStatistics
Cramér's𝑉§ Ameasureforthestrengthofassociationbetweentwocategoricalfeatures
§ Underindependence assumption𝑉 distributedapproximately𝜒) 𝑑 with𝑑 = (𝑘0 − 1)(𝑘) − 1) degreesoffreedom
§ 𝑃-value:
§ 𝑃 → 1 (slowly)asfeatures’dependenceincreases,sensitiveto𝑛
18
𝑉 =Pearson[s𝜒)
𝜒aM5)𝜒aM5) = 𝑛.min{𝑘0 − 1, 𝑘) − 1}
𝑃 = Pr 𝜌 ≥ Cramér[s𝑉 𝜌~𝜒)(𝑑)distribution
Nominal-vs-ScaleStatistics
Etastatistic§ Ameasureforthestrengthofassociationbetweenacategoricalfeatureandascale
feature
§ 𝜂) measuresaccuracyof𝑦~𝑥 similarto𝑅) statisticoflinearregression
19
𝜂) = 1 −∑ 𝑦A − 𝑦B[𝑥A] )CAD0∑ 𝑦A − 𝑦k )CAD0
RSS
TSS
𝑥 categorical𝑦 scale𝑦B[𝑥A]:averageof𝑦A amongallrecordswith𝑥A = 𝑥
Nominal-vs-ScaleStatistics
𝐹 statistic§ Ameasureforthestrengthofassociationbetweenacategoricalfeatureandascale
feature§ Assumptions (𝑥 categorical, 𝑦 scale):
§ 𝑦~𝑁𝑜𝑟𝑚𝑎𝑙 𝜇,𝜎) - samevarianceforall𝑥§ 𝑥 hassmallvaluedomainwith largefrequencycounts, 𝑥A non-random§ Allrecordsareiid
§ Underindependence assumption𝐹 distributedapproximately𝐹(𝑘 − 1, 𝑛 − 𝑘)
20
𝐹 =∑ 𝑓𝑟𝑒𝑞 𝑥 𝑦B 𝑥 − 𝑦k )/(𝑘 − 1)5∑ 𝑦A − 𝑦B 𝑥A )/(𝑛− 𝑘)CAD0
=𝜂)(𝑛 − 𝑘)
1 − 𝜂)(𝑘 − 1)
ESS:Explained SumofSquares
RSS
Degreesoffreedom
Degreesoffreedom
Ordinal-vs-OrdinalStatistics
Spearman’srankcorrelationcoefficient§ Ameasureforthestrengthofassociationbetweentwoordinalfeatures§ Pearson’scorrelationefficientappliedtofeaturewithvaluesreplacedbytheirranks
Example:
21
8x3)11z
8{5|20
𝑥′8
3
11
8
5
2
𝑥4.5
2
6
4.5
3
1
𝑟
𝜌 =123(�6 ,�7)9�69�7
𝜌 ∈ [−1, +1]
StratifiedStatistic
Bivariatestatisticsmeasuresassociationbetweenpairsoffeaturesinpresenceofaconfoundingcategoricalfeature
Whystratification?
22
Month Oct Nov Dec Oct-Dec
Customers (Millions) 0.6 1.4 1.4 0.6 3.0 1.0 5.0 3.0
Promotions (0or1) 0 1 0 1 0 1 0 1
Avg salesper1000 0.4 0.5 0.9 1.0 2.5 2.6 1.8 1.3
Atrendineachgroupisreversedandamplified ifgroupscombined
StratifiedStatistics
Measureofassociations:correlation,slope,𝑃-values,etc.
Assumptions:• Valuesofconfoundingfeature𝑠 grouptherecordsintostrata,withineachstrataall
bivariatepairsassumedfreeofconfounding• Foreachbivariatepair(𝑥, 𝑦),𝑦 mustbenumericaland𝑦distributednormallygiven𝑥• Alinearregressionmodelfor𝑦 (𝑖:stratumid)
• 𝜎) sameacrossallstrata
Computedstatistics:• �̅�A,𝜎�5�,𝑦kA, 𝜎B��• For𝑥~ strata,y~ strata,y~𝑥 NOstrata,andy~𝑥 ANDstrata• 𝑅), slopes,std.errorofslopes,𝑃- values
23
𝑦A,� = 𝛼A + 𝛽𝑥A,� + 𝜀A,� 𝜀A,� ~𝑁𝑜𝑟𝑚𝑎𝑙(0,𝜎))