Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to...
Transcript of Chapter 2: Getting to Know Your Data - Computer Science · 2017-01-29 · Chapter 2: Getting to...
1
Chapter 2: Getting to Know Your Data
n DataObjectsandAttributeTypes
n MeasuringDataSimilarityandDissimilarity
2
Types of Data Sets
n Recordn Relational recordsn Data matrix, e.g., numerical matrix,
crosstabsn Document data: text documents: term-
frequency vectorn Transaction data
n Graph and networkn World Wide Webn Social or information networksn Molecular Structures
n Orderedn Video data: sequence of imagesn Temporal data: time-seriesn Sequential Data: transaction sequencesn Genetic sequence data
n Spatial, image and multimedia:n Spatial data: mapsn Image data: n Video data:
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
3
Data Objects
n Datasetsaremadeupofdataobjects.
n Adataobject representsanentity.
n Examples:
n salesdatabase:customers,storeitems,sales
n medicaldatabase:patients,treatments
n universitydatabase:students,professors,courses
n Alsocalledsamples,examples,instances,datapoints,objects,tuples.
n Dataobjectsaredescribedbyattributes.
n Databaserows->dataobjects;columns->attributes.
4
Attributes
n Attribute(dimension,feature,variable):adatafield,representingacharacteristicorfeatureofadataobject.n E.g.,customer_ID,name,address
n Attributetypes:n Non-numeric:symbolic,non-quantitative
n Nominal(categorical)n Binary
n Ordinal
n Numeric:quantitative,ameasurablequantity,integerorrealn Interval-scaledn Ratio-scaled
5
Non-numeric Attribute Types
n Nominal(categorical): categories,states,or“namesofthings”n Hair_color={auburn,black,blond,brown,grey,red,white}n maritalstatus,occupation,IDnumbers,zipcodesn Binary
n Nominalattributewithonly2statesn Symmetricbinary:bothoutcomesequallyimportant
n e.g.,gendern Asymmetricbinary:outcomesnotequallyimportant.
n e.g.,medicaltest(positivevs.negative)n Convention: assign1tomost importantoutcome(e.g.,HIVpositive)
n Ordinaln Valueshaveameaningfulorder(ranking)butdifferencebetween
successivevaluesisunknown.n Size={small,medium,large}, lettergrades,armyrankings
6
Numeric Attribute Types
n Interval-scaled:differencebetweentwovaluesismeaningfuln Measuredonascaleofequal-sizedunits
n E.g.,temperatureinC˚orF˚.pH,calendardatesn Notruezero
n neither0C˚ nor0F˚ indicatesnoheatn without zero,wecannottalkofonetemperaturevalueasbeingamultiple
ofanother.Wecannotsay10C˚istwiceaswarmas5C˚
n Ratio-scaled:hasallthepropertiesofanintervalvariable,andalsohasacleardefinitionofzero(thatmeansnone)
n e.g.,temperature inKelvin(0kelvindoesmeannoheat),length,weight,counts,monetaryquantities
n Wecanspeakofavalueasbeingamultiple(orratio)ofanothern 10K˚istwiceashighas5K˚
7
Discrete vs. Continuous Attributes n Anotherwaytocategorizedatatypesn Discrete
n Hasonlyacountable(finiteorcountablyinfinite)setofvaluesn E.g.,zipcodes,thesetofwordsinacollectionofdocuments
n Sometimes,representedasintegervariablesn Note:Binaryattributesareaspecialcaseofdiscreteattributes
n Continuousn Hasrealnumbers(whichareuncountable)asattributevalues
n E.g.,temperature,height,orweightn Practically,realvaluescanonlybemeasuredandrepresentedusingafinite
numberofdigitsn Continuousattributesaretypicallyrepresentedasfloating-pointvariables
8
Chapter 2: Getting to Know Your Data
n DataObjectsandAttributeTypes
n MeasuringDataSimilarityandDissimilarity
9
Similarity and Dissimilarity
n Similarity
n Numericalmeasureofhowaliketwodataobjectsare
n Valueishigherwhenobjectsaremorealike
n Oftenfallsintherange[0,1]
n Dissimilarity (e.g.,distance)
n Numericalmeasureofhowdifferenttwodataobjectsare
n Lowerwhenobjectsaremorealike
n Minimumdissimilarityisoften0
n Upperlimitvaries
10
Data Matrix and Dissimilarity Matrix
n Datamatrixn nxp,object-by-variablen ndatapointswithpdimensions
n Twomodes– storesbothobjectsandattributes
n Dissimilaritymatrixn nxn,object-by-objectn ndatapoints,butregistersonlythedistance
n Atriangularmatrixn Singlemodeasitonlystoresdissimilarityvalues
⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
0...)2,()1,(:::
)2,3()
...ndnd
0dd(3,10d(2,1)
0
11
Nominal Attributes
n Cantake2ormorestates,e.g.,red,yellow,blue,green(generalizationofbinaryattributes)
n Method1:Simplematching
n m:#ofmatches, p:total#ofvariables
n Method2:Usealargenumberofbinaryattributes
n creatinganewbinaryattributeforeachnominalstate
pmpjid −=),(
12
Binary Attributes
n Contingencytableforbinarydata
n Distancefor symmetricbinaryvariables:
n Similarity?
n Distanceforasymmetricbinaryvariables:
n Jaccardcoefficient (similarity measureforasymmetricbinaryvariables):
Object i
Object j
13
Binary Attributes: example
n Allattributesareasymmetricbinaryn LetthevaluesYandPbe1,andthevalueNbe0
Name Fever Cough Test-1 Test-2 Test-3 Test-4 Jack Y N P N N N Mary Y N P N P N Jim Y P N N N N
75.021121),(
67.011111),(
33.010210),(
=++
+=
=++
+=
=++
+=
maryjimd
jimjackd
maryjackd
14
Numeric Attributes: Standardization
n Z-score:n x:rawscoretobestandardized,μ:meanofthepopulation,σ:standard
deviationn thedistancebetweentherawscoreandthepopulationmeaninunitsofthe
standarddeviationn negativewhentherawscoreisbelowthemean,positivewhenabove
n Analternativeway:Calculatethemeanabsolutedeviation
where
n standardizedmeasure(z-score):n Usingmeanabsolutedeviationismorerobustthanusingstandarddeviation
.)...211
nffff xx(xn m +++=
|)|...|||(|1 21 fnffffff mxmxmxns −++−+−=
f
fifif s
mx z
−=
σµ−= x z
15
Numeric Attributes: Minkowski Distance
n Minkowskidistance:Apopulardistancemeasure
wherei =(xi1,xi2,…,xip)and j =(xj1,xj2,…,xjp)aretwop-dimensionaldataobjects,andh istheorder(thedistancesodefinedisalsocalledL-h norm)
n Properties
n d(i,j)>0ifi≠j,andd(i,i)=0(Positivedefiniteness)
n d(i,j)=d(j,i) (Symmetry)
n d(i,j)≤ d(i,k)+d(k,j) (TriangleInequality)
n Adistancethatsatisfiesthesepropertiesisametric
16
Special Cases of Minkowski Distance
n h =1:Manhattan (cityblock,L1 norm) distancen E.g.,theHammingdistance:thenumberofbitsthataredifferent
betweentwobinaryvectors
n h=2:(L2 norm)Euclidean distance
n h→∞.“supremum” (Lmaxnorm,L∞ norm)distance.n Thisisthemaximumdifferencebetweenanycomponent(attribute)
ofthevectors
)||...|||(|),( 22
22
2
11 pp jxixjxixjxixjid −++−+−=
||...||||),(2211 pp jxixjxixjxixjid −++−+−=
17
Minkowski Distance: Examplepoint attribute 1 attribute 2
x1 1 2x2 3 5x3 2 0x4 4 5
L x1 x2 x3 x4x1 0x2 5 0x3 3 6 0x4 6 1 7 0
L2 x1 x2 x3 x4x1 0x2 3.61 0x3 2.24 5.1 0x4 4.24 1 5.39 0
L∞ x1 x2 x3 x4x1 0x2 3 0x3 2 5 0x4 3 1 5 0
Manhattan (L1)
Euclidean (L2)
Supremum (L∞)
0 2 4
2
4
x1
x2
x3
x4
18
Ordinal Variables
n Canbetreatedlikeinterval-scaledn replacexif bytheirrankn maptherangeofeachvariableonto[0,1]byreplacing i-thobjectinthef-thvariableby
n computethedissimilarityusingmethodsforinterval-scaledvariables,e.g.,Euclideandistance
11−−
=f
ifif M
rz
},...,1{ fif Mr ∈
19
Ordinal Variables: Example
n Consider thedataintheadjacenttable:n Here,theattributeTesthasthreestates:fair,good
andexcellent,soMf=3n Forstep1,thefourattributevaluesareassignedthe
ranks3,1,2and3respectively.n Step2normalizes therankingbymapping rank1to
0.0,rank2to0.5andrank3to1.0n Forstep3,usingEuclideandistance,adissimilarity
matrixisobtainedasshownn Therefore, students1and2aremostdissimilar,asare
students2and4
Student Test1 Excellent2 Fair3 Good4 Excellent
20
Attributes of Mixed Types
n Adatabasemaycontainmultipleattributetypesn useaweightedformulatocombinetheireffects
n f isbinaryornominal:dij
(f) =0ifxif =xjf ,ordij(f) =1otherwise
n f isnumeric:usethenormalizeddistancen f isordinal
n Computeranksrif andn Treatzif asnumeric
n Theindicatordeltaisgenerallysetto1,butn Iff isasymmetricbinaryandxif =xjf =0,settheindicatorto0
n recallweremovedtfromconsiderationfor“Distanceforasymmetricbinaryvariables”
)(1
)()(1),(
fij
pf
fij
fij
pf d
jidδ
δ
=
=
Σ
Σ=
1
1
−
−=
f
if
Mrzif
21
Cosine Similarity
n Adocument canberepresentedbythousandsofattributes,eachrecordingthefrequency ofaparticularword(suchaskeywords)orphraseinthedocument.
n Othervectorobjects:genefeaturesinmicro-arrays,…n Applications:informationretrieval,biologictaxonomy,genefeaturemapping,...n Cosinemeasure:Ifd1 andd2 aretwovectors(e.g.,term-frequencyvectors),then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,where • indicates vector dot product, ||d||: the length of vector d
22
Example: Cosine Similarity
n cos(d1,d2)=(d1 • d2)/||d1||||d2||,where • indicates vector dot product, ||d|: the length of vector d
n Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12cos(d1, d2 ) = 0.94