Data preparation, training and validation using SystemML by Faraz Makari Manshadi

DataPreparationandDescriptiveStatisticsin

SystemML

1

Outline

• Datapre-processingandtransformation• Training/Testing/CrossValidation• Descriptivestatistics

I. UnivariatestatisticsII. BivariatestatisticsIII. Stratifiedstatistics

2

InputDataFormat

3

Inputdata§ Rows:datapoints (akarecords)§ Columns: features(akavariables, attributes)

Featuretypes:§ Scale (akacontinuous), e.g.,‘Height’,‘Weight’, ‘Salary’, ‘Temperature’§ Categorical (akadiscrete)§ Nominal – nonaturalranking,e.g.,‘Gender’,‘Region’,‘Haircolor’§ Ordinal – naturalranking,e.g.,‘LevelofSatisfaction’

Example:Thehousedataset

DataPre-ProcessingTabularinput dataneedstobetransformedintoamatrix– transform()built-in functionCategoricalfeaturesneedspecialtreatment:§ Recoding:mappingdistinctcategoriesintoconsecutive numbersstartingfrom1§ Dummycoding (akaone-hot-encoding, one-of-K encoding)

Example:recoding dummycoding

4

Zipcode

96334

95123

95141

96334

Zipcode

1

2

3

1

direction

east

west

north

south

dir_east dir_west dir_north dir_south

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

transform() Built-inFunction

transform() built-infunction supports:§ Omittingmissingvalues§ Missingvalueimputation byglobal_mean (scalefeatures),global_mode (categorical

features),or constant (scale/categoricalfeatures)§ Binning (equi-width)§ Scaling (scalefeatures):mean-subtraction,z-score§ Recoding§ Dummycoding

5

TransformSpecification

§ Transformationsoperateonindividualcolumns§ AllrequiredtransformationsspecifiedinaJSONfile§ Propertyna.strings inthemtd filespecifiesmissingvalues

Example:data.spec.json data.csv.mtd

6

{"data_type": "frame","format": "csv","sep": ",","header": true,"na.strings": [ "NA", "" ]

}

{“ids": true, "omit": [ 1, 4, 5, 6, 7, 8, 9 ], "impute":[ { “id": 2, "method": "constant",

"value": "south" },{ “id": 3, "method":

"global_mean" }]

,"recode": [ 1, 2, 4, 5, 6, 7 ]

,"bin":[ { “id": 8, "method": "equi-

width", "numbins": 3 } ]

,"dummycode": [ 2, 5, 6, 7, 8, 3 ]}

CombinationsofTransformations

7

Signatureoftransform()

§ Invocation1:

§ Resultingmetadata:#distinctvalues incategoricalcolumns, listofdistinctvalueswiththeirrecodedIDs,numberofbins, binwidth, etc.

§ Anexistingtransformationcanbeapplied tonewdatausingthemetadatageneratedinanearlierinvocation

§ Invocation2:

8

output = transform (target = input, spec = specification, transformPath = "/path/to/metadata“);

output = transform (target = input, transformPath = "/path/to/new_metadata“applyTransformPath = "/path/to/metadata“);

Outline



9

Training/Testing

§ Pre-processing trainingandtestingdatasets§ Splittingdatapointsandlabels– splitXY.dml andsplitXY-dummy.dml (hands-on)§ Samplingdatapoints– sample.dml (hands-on)§ CrossValidation– cv-linreg.dml (hands-on)

10

Pre-ProcessingTrainingandTestingDataTrainingphase

Testingphase

11

Train = read ("/user/ml/trainset.csv"); Spec = read("/user/ml/tf.spec.json“, data_type = "scalar",

value_type = "String");trainD = transform (target = Train,

transformSpec = Spec, transformPath = "/user/ml/train_tf_metadata");

# Build a predictive model using trainD...

Test = read ("/user/ml/testset.csv"); testD = transform (target = Test,

transformPath = "/user/ml/test_tf_metadata", applyTransformPath = "/user/ml/train_tf_metdata");

# Test the model using testD...

CrossValidation

K-foldCrossValidation:1. Shufflethedatapoints2. Dividethedatapoints into𝑘 foldsof(roughly)

thesamesize3. For𝑖 = 1, … , 𝑘:

• Traineachmodelonallthedatapointsthatdonotbelongtofold𝑖

• Testeachmodelonalltheexamplesinfold𝑖andcomputethetesterror

4. Selectthemodelwiththeminimumaveragetestoverall𝑘 folds

5. (Trainthewinningmodelonallthedatapoints)

12

Testing Training

Example:𝑘 = 5

Outline



13

UnivariateStatistics

14

Row Name ofStatistic Scale Category

1 Minimum +

2 Maximum +

3 Range +

4 Mean +

5 Variance +

6 Standarddeviation +

7 Standard errorofmean +

8 Coefficientofvariation +

9 Skewness +

10 Kurtosis +

11 Standarderrorofskewness +

12 StandarderrorofKurtosis +

13 Median +

14 Intequartilemean +

15 Numberofcategories +

16 Mode +

17 Numberofmodes +

Centraltendencymeasures

Dispersionmeasures

Shapemeasures

Categoricalmeasures

BivariateStatistics

Quantitativeassociationbetweenpairsoffeatures

I. Scale-vs-Scalestatistics§ Pearson’scorrelationcoefficient

II. Nominal-vs-Nominalstatistics§ Pearson’s𝜒)§ Cramér's 𝑉

III. Nominal-vs-Scalestatistics§ Etastatistic§ 𝐹 statistic

IV. Ordinal-vs-Ordinalstatistics§ Spearman’srankcorrelationcoefficient

15

Scale-vs-ScaleStatistics

Pearson’scorrelationcoefficient§ Ameasureoflineardependencebetweenscalefeatures

§ 𝜌) measuresaccuracyof𝑥)~𝑥0

16

𝜌 =123(56,57)9:69:7

,𝜌 ∈ [−1,+1]

1 − 𝜌) =∑ 𝑥A,) − 𝑥BA,)

)CAD0

∑ 𝑥A,) − �̅�A,))C

AD0

ResidualSumofSquares(RSS)

TotalSumofSquares (TSS)

Nominal-vs-NominalStatistics

Pearson’s𝜒)§ Ameasurehowmuchfrequenciesofvaluepairsoftwocategoricalfeaturesdeviatefrom

statisticalindependence

§ Underindependence assumption Pearson’s𝜒) distributedapproximately𝜒) 𝑑 with𝑑 = (𝑘0 − 1)(𝑘) − 1) degreesoffreedom

§ 𝑃-value:

§ 𝑃 → 0 (rapidly)asfeatures’dependenceincreases,sensitiveto𝑛§ Onlymeasuresthepresenceofdependencenot thestrengthofdependence

17

𝜒) = K𝑂M,N − 𝐸M,N

)

𝐸M,NM,N

𝑥0 with 𝑘0 distinct categories𝑥) with 𝑘) distinct categories𝑂M ,N = #(𝑎, 𝑏)observed frequencies

𝐸M,N =#M#NC

expected frequencies for all pairs (𝑎, 𝑏)

𝑃 = Pr 𝜌 ≥ Pearson[s𝜒) 𝜌~𝜒)(𝑑)distribution

Nominal-vs-NominalStatistics

Cramér's𝑉§ Ameasureforthestrengthofassociationbetweentwocategoricalfeatures

§ Underindependence assumption𝑉 distributedapproximately𝜒) 𝑑 with𝑑 = (𝑘0 − 1)(𝑘) − 1) degreesoffreedom

§ 𝑃-value:

§ 𝑃 → 1 (slowly)asfeatures’dependenceincreases,sensitiveto𝑛

18

𝑉 =Pearson[s𝜒)

𝜒aM5)𝜒aM5) = 𝑛.min{𝑘0 − 1, 𝑘) − 1}

𝑃 = Pr 𝜌 ≥ Cramér[s𝑉 𝜌~𝜒)(𝑑)distribution

Nominal-vs-ScaleStatistics

Etastatistic§ Ameasureforthestrengthofassociationbetweenacategoricalfeatureandascale

feature

§ 𝜂) measuresaccuracyof𝑦~𝑥 similarto𝑅) statisticoflinearregression

19

𝜂) = 1 −∑ 𝑦A − 𝑦B[𝑥A] )CAD0∑ 𝑦A − 𝑦k )CAD0

RSS

TSS

𝑥 categorical𝑦 scale𝑦B[𝑥A]:averageof𝑦A amongallrecordswith𝑥A = 𝑥

Nominal-vs-ScaleStatistics

𝐹 statistic§ Ameasureforthestrengthofassociationbetweenacategoricalfeatureandascale

feature§ Assumptions (𝑥 categorical, 𝑦 scale):

§ 𝑦~𝑁𝑜𝑟𝑚𝑎𝑙 𝜇,𝜎) - samevarianceforall𝑥§ 𝑥 hassmallvaluedomainwith largefrequencycounts, 𝑥A non-random§ Allrecordsareiid

§ Underindependence assumption𝐹 distributedapproximately𝐹(𝑘 − 1, 𝑛 − 𝑘)

20

𝐹 =∑ 𝑓𝑟𝑒𝑞 𝑥 𝑦B 𝑥 − 𝑦k )/(𝑘 − 1)5∑ 𝑦A − 𝑦B 𝑥A )/(𝑛− 𝑘)CAD0

=𝜂)(𝑛 − 𝑘)

1 − 𝜂)(𝑘 − 1)

ESS:Explained SumofSquares

RSS

Degreesoffreedom

Degreesoffreedom

Ordinal-vs-OrdinalStatistics

Spearman’srankcorrelationcoefficient§ Ameasureforthestrengthofassociationbetweentwoordinalfeatures§ Pearson’scorrelationefficientappliedtofeaturewithvaluesreplacedbytheirranks

Example:

21

8x3)11z

8{5|20

𝑥′8

3

11

8

5

2

𝑥4.5

2

6

4.5

3

1

𝑟

𝜌 =123(�6 ,�7)9�69�7

𝜌 ∈ [−1, +1]

StratifiedStatistic

Bivariatestatisticsmeasuresassociationbetweenpairsoffeaturesinpresenceofaconfoundingcategoricalfeature

Whystratification?

22

Month Oct Nov Dec Oct-Dec

Customers (Millions) 0.6 1.4 1.4 0.6 3.0 1.0 5.0 3.0

Promotions (0or1) 0 1 0 1 0 1 0 1

Avg salesper1000 0.4 0.5 0.9 1.0 2.5 2.6 1.8 1.3

Atrendineachgroupisreversedandamplified ifgroupscombined

StratifiedStatistics

Measureofassociations:correlation,slope,𝑃-values,etc.

Assumptions:• Valuesofconfoundingfeature𝑠 grouptherecordsintostrata,withineachstrataall

bivariatepairsassumedfreeofconfounding• Foreachbivariatepair(𝑥, 𝑦),𝑦 mustbenumericaland𝑦distributednormallygiven𝑥• Alinearregressionmodelfor𝑦 (𝑖:stratumid)

• 𝜎) sameacrossallstrata

Computedstatistics:• �̅�A,𝜎�5�,𝑦kA, 𝜎B��• For𝑥~ strata,y~ strata,y~𝑥 NOstrata,andy~𝑥 ANDstrata• 𝑅), slopes,std.errorofslopes,𝑃- values

23

𝑦A,� = 𝛼A + 𝛽𝑥A,� + 𝜀A,� 𝜀A,� ~𝑁𝑜𝑟𝑚𝑎𝑙(0,𝜎))

Data preparation, training and validation using SystemML by Faraz Makari Manshadi

Education

Transcript of Data preparation, training and validation using SystemML by Faraz Makari Manshadi