Data Mining Memahami Data
Transcript of Data Mining Memahami Data
-
8/19/2019 Data Mining Memahami Data
1/38
1
Data Mining:Mengenal dan
memahami data
-
8/19/2019 Data Mining Memahami Data
2/38
2
Mengenal dan memahami data
Objek data dan macam-macam atribut
Statistik diskriptif data
Visualisasi data
Mengukur kesamaan dan ketidaksamaan
data
-
8/19/2019 Data Mining Memahami Data
3/38
3
Types of Data Sets
Record
Relational records
Data matrix e!g! numerical matrix
Document data" text documents"
term-fre#uenc$ %ector
&ransaction data
'rap( and net)ork
*orld *ide *eb
Social or information net)orks
Ordered
Video data" se#uence of images
&emporal data" time-series
Spatial image and multimedia"
Spatial data" maps
+mage data"
Video data"
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
-
8/19/2019 Data Mining Memahami Data
4/38
,
Data Objects
Data object men$atakan suatu entitas onto("
Database penjualan" customers barang-barang
$ang dijual penjualan
Database medis" pasien pera)atan
Database uni%ersitas" ma(asis)a professor
perkulia(an
Data objects dijelaskan dengan attribut-atribut!
.aris-baris Database -/ data objects0
olom-kolom -/attribut-atribut!
-
8/19/2019 Data Mining Memahami Data
5/38
Atribut
Attribut ( dimensi, tur, !ariabel"men$atakan karakteristik atau 4tur dari dataobjek Misal., ID_pelanggan, nama, alama
&ipe-tipe" 5ominal Ordina
.iner 5umerik"
+nter%al-scaled
Ratio-scaled
-
8/19/2019 Data Mining Memahami Data
6/38
6
Attribute Types
"ominal: kategori keadaan atau 7nama suatu (al8 Warna rambut Status kode pos dll 5R9 dll
#inary ::tribut 5ominal dengan (an$a 2 keadaan ;< dan 1 S$mmetric binar$" keduan$a sama penting
Misal" jenis kelamin :s$mmetric binar$" keduan$a tidak sama penting!
Misal " medical test ;positi%e atau negati%e Din$atakan dengan 1 untuk men$atakan (al $ang
lebi( penting ; positif =+V
Ordinal Memiliki arti secara berurutan ;ranking tetapi tidak
din$atakan dengan besaran angka atau nilai! Size = >small, medium, large?, kelas pangkat
-
8/19/2019 Data Mining Memahami Data
7/38@
Atribut "umeri$
uantitas ;integer atau nilai real %nter!al
Diukur pada skala dengan unit satuan $angsama
5ilai memiliki urutan tanggal kalender
5o true Aero-point &atio
+n(erent 'ero-point onto("9anjang berat badan dll .isa mengatakan perkalian dari nilai objek data
$ang lain Misal " panjang jalan : adala( 2 kali dari
panjang jalan .
-
8/19/2019 Data Mining Memahami Data
8/38B
Atribut Discrete dan $ontinu
Atribut Dis$rit &er(ingga dapat di(itung )alaupun itu tak
ter(ingga ode pos kata dalam sekumpulan dokumen
adang din$atakan dengan %ariabel integer atatan :tribut .inar$" kasus k(usus atribut
diskrit Atribut ontinu
Memilki nilai real C!g! temperature tinggi berat
:tribut kontinu din$atakn dengan oating-point%ariables
# i S i i l D i i f
-
8/19/2019 Data Mining Memahami Data
9/38E
#asic Statistical Descriptions ofData
&ujuan Fntuk mema(ami data" central tendenc$
%ariasi dan sebaran
arakteristik Sebaran data
median max min #uantiles outliers %ariancedll!
-
8/19/2019 Data Mining Memahami Data
10/381<
Mengu$ur )entral Tendency
Mean ;algebraic measure ;sample %s! population"
5ote" n jumla( sample dan N nilai populasi!
MeanGrata-rata"
&rimmed mean"
Median"
Cstimated b$ interpolation ;for grouped data"
Mode
Value t(at occurs most fre#uentl$ in t(e data
Fnimodal bimodal trimodal
Cmpirical formula"
∑==
n
i
i xn x 1
1
∑
∑
=
==n
i
i
n
i
ii
w
xw
x
1
1
width freq
freqn Lmedian
median
l )
)(2/(1
∑−+=
)(3 medianmeanmodemean −×=−
N
x∑= µ
Media
ninter!al
-
8/19/2019 Data Mining Memahami Data
11/38Marc( 2, 2
-
8/19/2019 Data Mining Memahami Data
12/3812
Mengu$ur sebaran Data
Huartiles outliers and boxplots uartiles" H1 ;2t( percentile H3 ;@t( percentile
%nter-uartile range" +HR I H3 J H1
.i!e number summary" min H1 median H3 max
Outlier" biasan$a lebi( tinggi atau lebi( renda( dari 1! x +HR
Variansi dan standar de%iasi ;sample: s, population: σ)
/ariance" ;algebraic scalable computation
Standard de!iation s atau σ) akar kuadrat daro %ariance s!
atau σ ! )
∑ ∑∑ = == −−=−−=n
i
n
i
ii
n
i
i xn xn x xn s 1 1
22
1
22
])(1
[1
1)(1
1
∑∑ ==−=−=
n
i i
n
i i
x N
x N 1
22
1
22 1)(1
µ µ σ
-
8/19/2019 Data Mining Memahami Data
13/3813
#o0plot Analysis
1ima nilai dari sebaran data Minimum H1 Median H3 Maximum
#o0plot
Data din$atakan dengan box
Fjung dari box kuartil pertama dan
ketiga tinggi kotak adala( +HR
Median ditandai garis dalam box
Outliers" diplot sendiri diluar
-
8/19/2019 Data Mining Memahami Data
14/381,
Sifat-sifat $ur!a Distribusi "ormal
ur%a norma dari KJL to KL" berisi 6BN pengukukuran ;K"
mean L" standar de%iasi Dari KJ2L to K2L" berisi EN pengukuran Dari KJ3L to K3L" berisi EE!@N pengukuran
-
8/19/2019 Data Mining Memahami Data
15/381
2istogram Analysis
=istogram" gra4kmenampilkan tabulasi dari
frek)ensi data
2 stograms ten Te More t an
-
8/19/2019 Data Mining Memahami Data
16/3816
2 stograms ten Te More t an#o0plots
Dua (istogram
menunjukkan
boxplot $ang sama
5ilai $ang sama"
min H1 median
H3 max
&etapi distribusi
datan$a berbeda
-
8/19/2019 Data Mining Memahami Data
17/381@
Scatter plot
Meli(at data bi%ariate data untuk meli(at clusterdan outlier data etc Setiap data menunjukkan pasangan koordinat dari
suatu data
-
8/19/2019 Data Mining Memahami Data
18/381B
3ositi!ely and "egati!ely )orrelatedData
iri atas korelasi positif
anan atas korelasi negatif
-
8/19/2019 Data Mining Memahami Data
19/381E
4ncorrelated Data
-
8/19/2019 Data Mining Memahami Data
20/382<
Data /isuali'ation
*($ data %isualiAation 'ain insig(t into an information space b$ mapping data onto
grap(ical primiti%es 9ro%ide #ualitati%e o%er%ie) of large data sets Searc( for patterns trends structure irregularities
relations(ips among data =elp 4nd interesting regions and suitable parameters for
furt(er #uantitati%e anal$sis
eometr c 3ro ect on / sua 'at on
-
8/19/2019 Data Mining Memahami Data
21/3821
eometr c 3ro ect on / sua 'at onTechniues
VisualiAation of geometric transformations andprojections of t(e data
Met(ods
Scatterplot and scatterplot matrices
-
8/19/2019 Data Mining Memahami Data
22/3822
Scatterplot Matrices
Matrix of scatterplots ;x-$-diagrams of t(e k-dim! data Ptotal of ;k2G2-k scatterplotsQ
U s e d
b y e r m i s s i o n o f M .
W a r d ,
W o r c e s t e r P o l y t e c h n i c I n s t i t u t e
-
8/19/2019 Data Mining Memahami Data
23/38
23
Similarity and Dissimilarity
Similarity Mengukur secara 5umerik bagaimana kesamaan
dua objek data
&inggi nilain$a bila benda $ang lebi( mirip
Range P
-
8/19/2019 Data Mining Memahami Data
24/38
2,
Matri0
Data matrix n titik data dengan
p dimensi &)o modes
Dissimilarit$ matrix n titik data $ang
didata adala(distanceGjarak
Matrik segitiga Single mode
np x ...nf x ...n1 x ... ... ... ... ...
ip x ...
if x ...
i1 x
... ... ... ... ...
1p x ...1f x ...11 x
0...)2,()1,(
:::
)2,3()
...nd nd
0d d(3,10d(2,1)
0
-
8/19/2019 Data Mining Memahami Data
25/38
2
3ro0imity Measure for "ominalAttributes
Misal terdapat 2 atau lebi( nilai misal! red$ello) blue green ;generalisasi dari atribut
binar$
Metode Simple matc(ing
m" $ang sesuai p" total %ariabel pm p
jid −=),(
3 i it M f #i
-
8/19/2019 Data Mining Memahami Data
26/38
26
3ro0imity Measure for #inaryAttributes
: contingenc$ table for binar$
data
Distance measure for s$mmetric
binar$ %ariables"
Distance measure for as$mmetric
binar$ %ariables"
accard coeTcient ;similarit"
measure for as"mmetri# binar$
%ariables"
5ote" accard coeTcient is t(e same as 7co(erence8"
Object i
Object $
Dissimilarity bet+een #inary
-
8/19/2019 Data Mining Memahami Data
27/38
2@
Dissimilarity bet+een #inary/ariables
Cxample
'ender is a s$mmetric attribute &(e remaining attributes are as$mmetric binar$ Uet t(e %alues and 9 be 1 and t(e %alue 5 <
ame en er ever oug es - es - es - es -
Ja! " # N $ N N N
"ar% F # N $ N $ N
J&m " # $ N N N N
75.0
211
21),(
67.0111
11),(
33.0102
10),(
mary jimd
jim jack d
mary jack d
-
8/19/2019 Data Mining Memahami Data
28/38
2B
Standardi'ing "umeric Data
W-score" X" ra) score to be standardiAed K" mean of t(e population
L" standard de%iation
t(e distance bet)een t(e ra) score and t(e population
mean in units of t(e standard de%iation
negati%e )(en t(e ra) score is belo) t(e mean 78 )(en
abo%e
:n alternati%e )a$" alculate t(e mean absolute de%iation
)(ere
standardiAed measure ; z%s#ore"
Fsing mean absolute de%iation is more robust t(an using
standard de%iation
')'''211 nf f f f x x(xnm +++=
)'''(121 f nf f f f f f m xm xm xn s −++−+−=
f
f if
if s
m x z
−=
σ µ −
=
x z
6 ample
-
8/19/2019 Data Mining Memahami Data
29/38
2E
60ample:Data Matri0 and Dissimilarity Matri0
Dissimilarity Matri0
(+ith 6uclidean Distance7
Data Matrix
D stance on "umer c Data: M n o+s
-
8/19/2019 Data Mining Memahami Data
30/38
3<
D stance on "umer c Data: M n o+sDistance
Minko&ski distan#e" : popular distance measure
)(ere i I ; ' i1 ' i2 Y ' ip and $ I ; ' j1 ' j2 Y ' jp are
t)o p-dimensional data objects and ( is t(e order;t(e distance so de4ned is also called U-( norm
9roperties
d;i j / < if i Z j and d;i i I < ;9ositi%e
de4niteness
d;i j I d;j i ;S$mmetr$
d;i j ≤ d;i k d;k j ;&riangle +ne#ualit$
-
8/19/2019 Data Mining Memahami Data
31/38
31
Special )ases of Min$o+s$i Distance
( I 1" Man(attan ;cit$ block U1 norm distance
C!g! t(e =amming distance" t(e number of bits t(atare di[erent bet)een t)o binar$ %ectors
( I 2" ;U2 norm Cuclidean distance
( → ∞! 7supremum8 ;Umax norm U∞ norm distance!
&(is is t(e maximum di[erence bet)een an$component ;attribute of t(e %ectors
)'''(),(22
22
2
11 p p j x
i x
j x
i x
j x
i x jid −++−+−=
'''),(2211 p p j
xi
x j
xi
x j
xi
x jid −++−+−=
-
8/19/2019 Data Mining Memahami Data
32/38
32
60ample: Min$o+s$i DistanceDissimilarity Matrices
Manhattan
(187
6uclidean (197
Supremum
-
8/19/2019 Data Mining Memahami Data
33/38
33
Ordinal /ariables
:n ordinal %ariable can be discrete or continuous
Order is important e!g! rank
an be treated like inter%al-scaled
replace ' i) b$ t(eir rank map t(e range of eac( %ariable onto P
-
8/19/2019 Data Mining Memahami Data
34/38
3,
Attributes of Mi0ed Type
: database ma$ contain all attribute t$pes 5ominal s$mmetric binar$ as$mmetric binar$numeric ordinal
One ma$ use a )eig(ted formula to combine t(eire[ects
is binar$ or nominal"
dij
;f I < if xifI x
jf or d
ij
;f I 1 ot(er)ise is numeric" use t(e normaliAed distance is ordinal
ompute ranks rif and &reat Aif as inter%al-scaled
)(1
)()(
1),( f
ij p f
f
ij
f
ij
p
f d jid δ
δ =
=
ΣΣ=
1
1
−
−=
f
if
M r
z if
-
8/19/2019 Data Mining Memahami Data
35/38
3
)osine Similarity
: document can be represented b$ t(ousands of attributes eac(recording t(e re*uen#" of a particular )ord ;suc( as ke$)ords orp(rase in t(e document!
Ot(er %ector objects" gene features in micro-arra$s Y :pplications" information retrie%al biologic taxonom$ gene feature
mapping !!! osine measure" +f d+ and d! are t)o %ectors ;e!g! term-fre#uenc$
%ectors t(en
cos;d+, d! I ;d+ • d! G\\d+\\ \\d!\\
)(ere • indicates %ector dot product \\d\\" t(e lengt( of %ector d
-
8/19/2019 Data Mining Memahami Data
36/38
36
60ample: )osine Similarity
cos;d+, d! I ;d+ • d! G\\d+\\ \\d!\\ )(ere • indicates %ector dot product \\d\" t(e lengt( of %ector d
Cx" ]ind t(e similarity bet)een documents 1 and 2!
d+ ;
-
8/19/2019 Data Mining Memahami Data
37/38
3@
1 Di!ergence: )omparingT+o 3robability Distributions
(e -ullba#k%eibler -) di/ergen#e: Measure t(e di[erence
bet)een t)o probabilit$ distributions o%er t(e same %ariable ' ]rom information t(eor$ closel$ related to relati/e entrop"
inormation di/ergen#e and inormation or dis#rimination
D-; p; ' \\ *; ' " di%ergence of *; ' from p; ' measuring t(e
information lost )(en *; ' is used to approximate p; ' Discrete form"
&(e U di%ergence measures t(e expected number of extra bitsre#uired to code samples from p; ' ;7true8 distribution )(en
using a code based on *; ' )(ic( represents a t(eor$ modeldescription or approximation of p; ' +ts continuous form"
&(e U di%ergence" not a distance measure not a metric"
as$mmetric not satisf$ triangular ine#ualit$
-
8/19/2019 Data Mining Memahami Data
38/38
)ompute the 1
Di!ergence= 0ase on t(e ormula, D-;1,2 3 < and D-;1 \\ 2 I < if and onl$ if 1 I 2!
=o) about )(en p I < or # I " ;a " 8 E 98 3, b " 3 8 E 98 3, # " 9, d " 1 8 E 98 3! D-;1? \\ 2? can be computed easil$