Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to...

22
1 Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity

Transcript of Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to...

Page 1: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

1

Chapter 2: Getting to Know Your Data

n DataObjectsandAttributeTypes

n MeasuringDataSimilarityandDissimilarity

Page 2: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

2

Types of Data Sets

n Recordn Relational recordsn Data matrix, e.g., numerical matrix,

crosstabsn Document data: text documents: term-

frequency vectorn Transaction data

n Graph and networkn World Wide Webn Social or information networksn Molecular Structures

n Orderedn Video data: sequence of imagesn Temporal data: time-seriesn Sequential Data: transaction sequencesn Genetic sequence data

n Spatial, image and multimedia:n Spatial data: mapsn Image data: n Video data:

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Page 3: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

3

Data Objects

n Datasetsaremadeupofdataobjects.

n Adataobject representsanentity.

n Examples:

n salesdatabase:customers,storeitems,sales

n medicaldatabase:patients,treatments

n universitydatabase:students,professors,courses

n Alsocalledsamples,examples,instances,datapoints,objects,tuples.

n Dataobjectsaredescribedbyattributes.

n Databaserows->dataobjects;columns->attributes.

Page 4: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

4

Attributes

n Attribute(dimension,feature,variable):adatafield,representingacharacteristicorfeatureofadataobject.n E.g.,customer_ID,name,address

n Attributetypes:n Non-numeric:symbolic,non-quantitative

n Nominal(categorical)n Binary

n Ordinal

n Numeric:quantitative,ameasurablequantity,integerorrealn Interval-scaledn Ratio-scaled

Page 5: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

5

Non-numeric Attribute Types

n Nominal(categorical): categories,states,or“namesofthings”n Hair_color={auburn,black,blond,brown,grey,red,white}n maritalstatus,occupation,IDnumbers,zipcodesn Binary

n Nominalattributewithonly2statesn Symmetricbinary:bothoutcomesequallyimportant

n e.g.,gendern Asymmetricbinary:outcomesnotequallyimportant.

n e.g.,medicaltest(positivevs.negative)n Convention: assign1tomost importantoutcome(e.g.,HIVpositive)

n Ordinaln Valueshaveameaningfulorder(ranking)butdifferencebetween

successivevaluesisunknown.n Size={small,medium,large}, lettergrades,armyrankings

Page 6: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

6

Numeric Attribute Types

n Interval-scaled:differencebetweentwovaluesismeaningfuln Measuredonascaleofequal-sizedunits

n E.g.,temperatureinC˚orF˚.pH,calendardatesn Notruezero

n neither0C˚ nor0F˚ indicatesnoheatn without zero,wecannottalkofonetemperaturevalueasbeingamultiple

ofanother.Wecannotsay10C˚istwiceaswarmas5C˚

n Ratio-scaled:hasallthepropertiesofanintervalvariable,andalsohasacleardefinitionofzero(thatmeansnone)

n e.g.,temperature inKelvin(0kelvindoesmeannoheat),length,weight,counts,monetaryquantities

n Wecanspeakofavalueasbeingamultiple(orratio)ofanothern 10K˚istwiceashighas5K˚

Page 7: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

7

Discrete vs. Continuous Attributes n Anotherwaytocategorizedatatypesn Discrete

n Hasonlyacountable(finiteorcountablyinfinite)setofvaluesn E.g.,zipcodes,thesetofwordsinacollectionofdocuments

n Sometimes,representedasintegervariablesn Note:Binaryattributesareaspecialcaseofdiscreteattributes

n Continuousn Hasrealnumbers(whichareuncountable)asattributevalues

n E.g.,temperature,height,orweightn Practically,realvaluescanonlybemeasuredandrepresentedusingafinite

numberofdigitsn Continuousattributesaretypicallyrepresentedasfloating-pointvariables

Page 8: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

8

Chapter 2: Getting to Know Your Data

n DataObjectsandAttributeTypes

n MeasuringDataSimilarityandDissimilarity

Page 9: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

9

Similarity and Dissimilarity

n Similarity

n Numericalmeasureofhowaliketwodataobjectsare

n Valueishigherwhenobjectsaremorealike

n Oftenfallsintherange[0,1]

n Dissimilarity (e.g.,distance)

n Numericalmeasureofhowdifferenttwodataobjectsare

n Lowerwhenobjectsaremorealike

n Minimumdissimilarityisoften0

n Upperlimitvaries

Page 10: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

10

Data Matrix and Dissimilarity Matrix

n Datamatrixn nxp,object-by-variablen ndatapointswithpdimensions

n Twomodes– storesbothobjectsandattributes

n Dissimilaritymatrixn nxn,object-by-objectn ndatapoints,butregistersonlythedistance

n Atriangularmatrixn Singlemodeasitonlystoresdissimilarityvalues

⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢

npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0...)2,()1,(:::

)2,3()

...ndnd

0dd(3,10d(2,1)

0

Page 11: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

11

Nominal Attributes

n Cantake2ormorestates,e.g.,red,yellow,blue,green(generalizationofbinaryattributes)

n Method1:Simplematching

n m:#ofmatches, p:total#ofvariables

n Method2:Usealargenumberofbinaryattributes

n creatinganewbinaryattributeforeachnominalstate

pmpjid −=),(

Page 12: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

12

Binary Attributes

n Contingencytableforbinarydata

n Distancefor symmetricbinaryvariables:

n Similarity?

n Distanceforasymmetricbinaryvariables:

n Jaccardcoefficient (similarity measureforasymmetricbinaryvariables):

Object i

Object j

Page 13: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

13

Binary Attributes: example

n Allattributesareasymmetricbinaryn LetthevaluesYandPbe1,andthevalueNbe0

Name Fever Cough Test-1 Test-2 Test-3 Test-4 Jack Y N P N N N Mary Y N P N P N Jim Y P N N N N

75.021121),(

67.011111),(

33.010210),(

=++

+=

=++

+=

=++

+=

maryjimd

jimjackd

maryjackd

Page 14: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

14

Numeric Attributes: Standardization

n Z-score:n x:rawscoretobestandardized,μ:meanofthepopulation,σ:standard

deviationn thedistancebetweentherawscoreandthepopulationmeaninunitsofthe

standarddeviationn negativewhentherawscoreisbelowthemean,positivewhenabove

n Analternativeway:Calculatethemeanabsolutedeviation

where

n standardizedmeasure(z-score):n Usingmeanabsolutedeviationismorerobustthanusingstandarddeviation

.)...211

nffff xx(xn m +++=

|)|...|||(|1 21 fnffffff mxmxmxns −++−+−=

f

fifif s

mx z

−=

σµ−= x z

Page 15: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

15

Numeric Attributes: Minkowski Distance

n Minkowskidistance:Apopulardistancemeasure

wherei =(xi1,xi2,…,xip)and j =(xj1,xj2,…,xjp)aretwop-dimensionaldataobjects,andh istheorder(thedistancesodefinedisalsocalledL-h norm)

n Properties

n d(i,j)>0ifi≠j,andd(i,i)=0(Positivedefiniteness)

n d(i,j)=d(j,i) (Symmetry)

n d(i,j)≤ d(i,k)+d(k,j) (TriangleInequality)

n Adistancethatsatisfiesthesepropertiesisametric

Page 16: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

16

Special Cases of Minkowski Distance

n h =1:Manhattan (cityblock,L1 norm) distancen E.g.,theHammingdistance:thenumberofbitsthataredifferent

betweentwobinaryvectors

n h=2:(L2 norm)Euclidean distance

n h→∞.“supremum” (Lmaxnorm,L∞ norm)distance.n Thisisthemaximumdifferencebetweenanycomponent(attribute)

ofthevectors

)||...|||(|),( 22

22

2

11 pp jxixjxixjxixjid −++−+−=

||...||||),(2211 pp jxixjxixjxixjid −++−+−=

Page 17: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

17

Minkowski Distance: Examplepoint attribute 1 attribute 2

x1 1 2x2 3 5x3 2 0x4 4 5

L x1 x2 x3 x4x1 0x2 5 0x3 3 6 0x4 6 1 7 0

L2 x1 x2 x3 x4x1 0x2 3.61 0x3 2.24 5.1 0x4 4.24 1 5.39 0

L∞ x1 x2 x3 x4x1 0x2 3 0x3 2 5 0x4 3 1 5 0

Manhattan (L1)

Euclidean (L2)

Supremum (L∞)

0 2 4

2

4

x1

x2

x3

x4

Page 18: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

18

Ordinal Variables

n Canbetreatedlikeinterval-scaledn replacexif bytheirrankn maptherangeofeachvariableonto[0,1]byreplacing i-thobjectinthef-thvariableby

n computethedissimilarityusingmethodsforinterval-scaledvariables,e.g.,Euclideandistance

11−−

=f

ifif M

rz

},...,1{ fif Mr ∈

Page 19: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

19

Ordinal Variables: Example

n Consider thedataintheadjacenttable:n Here,theattributeTesthasthreestates:fair,good

andexcellent,soMf=3n Forstep1,thefourattributevaluesareassignedthe

ranks3,1,2and3respectively.n Step2normalizes therankingbymapping rank1to

0.0,rank2to0.5andrank3to1.0n Forstep3,usingEuclideandistance,adissimilarity

matrixisobtainedasshownn Therefore, students1and2aremostdissimilar,asare

students2and4

Student Test1 Excellent2 Fair3 Good4 Excellent

Page 20: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

20

Attributes of Mixed Types

n Adatabasemaycontainmultipleattributetypesn useaweightedformulatocombinetheireffects

n f isbinaryornominal:dij

(f) =0ifxif =xjf ,ordij(f) =1otherwise

n f isnumeric:usethenormalizeddistancen f isordinal

n Computeranksrif andn Treatzif asnumeric

n Theindicatordeltaisgenerallysetto1,butn Iff isasymmetricbinaryandxif =xjf =0,settheindicatorto0

n recallweremovedtfromconsiderationfor“Distanceforasymmetricbinaryvariables”

)(1

)()(1),(

fij

pf

fij

fij

pf d

jidδ

δ

=

=

Σ

Σ=

1

1

−=

f

if

Mrzif

Page 21: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

21

Cosine Similarity

n Adocument canberepresentedbythousandsofattributes,eachrecordingthefrequency ofaparticularword(suchaskeywords)orphraseinthedocument.

n Othervectorobjects:genefeaturesinmicro-arrays,…n Applications:informationretrieval,biologictaxonomy,genefeaturemapping,...n Cosinemeasure:Ifd1 andd2 aretwovectors(e.g.,term-frequencyvectors),then

cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,where • indicates vector dot product, ||d||: the length of vector d

Page 22: Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Measuring Data Similarity and Dissimilarity.

22

Example: Cosine Similarity

n cos(d1,d2)=(d1 • d2)/||d1||||d2||,where • indicates vector dot product, ||d|: the length of vector d

n Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12cos(d1, d2 ) = 0.94