Preparing the Data
description
Transcript of Preparing the Data
![Page 1: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/1.jpg)
Preparing the Data
![Page 2: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/2.jpg)
What is Data?• Kumpulan obyek data dan
atributnya• Atribut adalah property atau
karakteristik suatu obyek▫ Contoh: warna mata,
temperature, dll▫ Atribut dikenal sebagai variable,
field, ataupun karakteristik• Kumpulan dari atribut
menggambarkan obyek▫ Obyek dikenal juga sebagai
record, point, case, sample, entitas
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Attributes
Objects
![Page 3: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/3.jpg)
Attribute Values• Nilai atribut adalah angka2 atau symbol2 yg
diassign ke suatu atribut• Perbedaan antara atribut dan nilai atribut
▫Atribut yg sama dapat dipetakkan ke nilai atribut yg beda Misal: ketinggian dapat diukur dalam feet atau meter
▫Atribut yg beda dapat dipetakan ke himpunan nilai yg sama Contoh: nilai atribut untuk ID dan age adalah integer Tetapi property nilai atribut dapat berbeda:
ID tidak mempunyai batasan nilai maksimum dan minimum
![Page 4: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/4.jpg)
Attribute Types•Ada jenis2 atribut yg berbeda:
▫Nominal Contoh: nomor ID, warna mata, kode pos
▫Ordinal Rangking/ tingkatan (contoh rasa dari kripik kentang dalam
skala 1-10), grade, tinggi dalam {tinggi, sedang, rendah}▫Interval
Contoh: tanggal kalender, temperature dalam Celsius atau Fahrenheit
▫Ratio Contoh: temperature dalam Kelvin, panjang, waktu, jumlah
![Page 5: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/5.jpg)
Properties of Attribute Values /1• Jenis atribut tergantung pada properti berikut yg
mana dia miliki▫Distinctness: = ▫Order: < > ▫Addition: + - ▫Multiplication: * /
▫Nominal attribute: distinctness▫Ordinal attribute: distinctness & order▫Interval attribute: distinctness, order & addition▫Ratio attribute: all 4 properties
![Page 6: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/6.jpg)
Properties of Attribute Values /2Attribute
TypeDescription Examples Operations
Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, )
zip codes, employee ID numbers, eye color, sex: {male, female}
mode, entropy, contingency correlation, 2 test
Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >)
hardness of minerals, {good, better, best}, grades, street numbers
median, percentiles, rank correlation, run tests, sign tests
Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - )
calendar dates, temperature in Celsius or Fahrenheit
mean, standard deviation, Pearson's correlation, t and F tests
Ratio For ratio variables, both differences and ratios are meaningful. (*, /)
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
geometric mean, harmonic mean, percent variation
![Page 7: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/7.jpg)
Properties of Attribute Values / 3Attribute
LevelTransformation Comments
Nominal Any permutation of values If all employee ID numbers were reassigned, would it make any difference?
Ordinal An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function.
An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}.Interval new_value =a * old_value + b
where a and b are constantsThus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree).Ratio new_value = a * old_value Length can be measured in meters or feet.
![Page 8: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/8.jpg)
Discrete and Continuous Attributes •Discrete Attribute
▫ Mempunyai himpunan nilai terbatas atau tak terbatas▫ Contoh: zip codes, himpunan kata dalam kumpulan dokumen▫ Sering direpresentasikan sbg variable integer▫ Note: binary attributes special case
•Continuos Attribute▫ Memiliki angka2 real sebagai nilai atribut▫ Contoh: temperatur, tinggi atau berat▫ Dapat diukur dan direpresentasikan menggunakan sejumlah
digit terbatas▫ Ciri khasnya direpresentasikan sebagai variable pecahan
![Page 9: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/9.jpg)
Asymmetric Attributes• Hanya keberadaannya (non zero attribute value)
diperhatikan• Contoh:
▫ Kata-kata muncul di dokumen▫ Item-item muncul di transaksi customer
Document 1
season
timeout
lost
win
game
score
ball
play
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
![Page 10: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/10.jpg)
Types of data setsRecord
▫ Data Matrix▫ Document Data▫ Transaction Data
Graph▫ World Wide Web▫ Molecular Structures
Ordered▫ Spatial Data▫ Temporal Data▫ Sequential Data▫ Genetic Sequence Data
![Page 11: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/11.jpg)
Important characteristics of structured data•Dimensionality
•Sparsity▫Hanya menghitung kemunculan
•Resolution▫Pola2 bergantung skala
![Page 12: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/12.jpg)
Record Data•Data yg berisi kumpulan record, yg mana masing-
masing berisi suatu himpunan atribut yang ditentukan.Tid Refund Marital
Status Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
![Page 13: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/13.jpg)
Data Matrix• Jika objek data mempunyai kumpulan atribut numerik yg ditentukan ,
kemudian data objek dapat dipandang sebagai titik dalam ruang multidimensional, di mana setiap dimensi merepresentasian suatu atribut yang berbeda.
• Seperti data set dapat direpresentasikan dengan suatu matrik m dengan n di mana ada m baris, satu dari setiap objek dan n kolom, satu untuk setiap atribut.
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection of y load
Projection of x Load
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection of y load
Projection of x Load
![Page 14: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/14.jpg)
Document Data• Setiap document menjadi suatu ‘term’ vector,
▫ Setiap term adalah komponen (atribut) dari vector▫ Nilai setiap komponen adalah banyaknya waktu yg
berhubungan terms terdapat dalam document
Document 1
season
timeout
lost
win
game
score
ball
play
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
![Page 15: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/15.jpg)
Transaction Data• Jenis spesial dari data rekord , dimana
▫ Setiap record (transaksi) mencangkup kumpulan item-item
▫ Contoh: Toko penjualan bahan makanan. Sejumlah produk dibeli customer selama perjalanan pembelian merupakan suatu transaksi, namun produk yg dibeli merupakan item
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
![Page 16: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/16.jpg)
Graph Data•Contoh: Generic graph and HTML Links
5
2
1 2
5
<a href="papers/papers.html#bbbb">Data Mining </a><li><a href="papers/papers.html#aaaa">Graph Partitioning </a><li><a href="papers/papers.html#aaaa">Parallel Solution of Sparse Linear System of Equations </a><li><a href="papers/papers.html#ffff">N-Body Computation and Dense Linear System Solvers
![Page 17: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/17.jpg)
Chemical Data•Benzene Molecule: C6H6
![Page 18: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/18.jpg)
Ordered Data /1•Sequence of
transaction
An element of the sequence
Items/Events
![Page 19: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/19.jpg)
Ordered Data /2•Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCCCGCAGGGCCCGCCCCGCGCCGTCGAGAAGGGCCCGCCTGGCGGGCGGGGGGAGGCGGGGCCGCCCGAGCCCAACCGAGTCCGACCAGGTGCCCCCTCTGCTCGGCCTAGACCTGAGCTCATTAGGCGGCAGCGGACAGGCCAAGTAGAACACGCGAAGCGCTGGGCTGCCTGCTGCGACCAGGG
![Page 20: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/20.jpg)
Ordered Data /3•Spatio-Temporal data
Average Monthly Temperature of land and ocean
![Page 21: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/21.jpg)
Data Quality•Jenis masalah apa kualitas data?•Bagaimana kita dapat mendeteksi masalah
dengan data?•Apa yg dapat kita lakukan tentang masalah
ini?•Contoh masalah kualitas data:
▫Noise & outliers▫Missing Values▫Duplicate data
![Page 22: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/22.jpg)
Noise• Mengacu pada modifikasi nilai original• Contoh: distorsi suara seseorang ketika berbicara
Two Sine Waves Two Sine Waves + Noise
![Page 23: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/23.jpg)
Outliers /1• Outliers adalah
obyek data dengan karakteristik berbeda dengan kebanyakan data obyek lain dalam data set.
![Page 24: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/24.jpg)
Outliers /2• Contoh: suatu data set merepresentasikan gambaran umur dengan 20 nilai yg
berbeda, ▫ Age = {3, 56, 23, 39, 156, 52, 41, 22, 9, 28, 139, 31, 55, 20, -67, 37, 11, 55, 45, 37}
• Maka parameter statistika yg berhubungan:▫ Mean = 39.9▫ Standard deviation = 45.65Jika kita memilih nilai threshold untuk distribusi normal data :Theshold = Mean ± 2 x Standard Deviation
maka seluruh data yg diluar range [-54.1, 131.2] adalah potential outliers. Dan oleh karena age >0, mungkin mengurangi range menjadi [0, 131.2]. Sehingga ada outlier berdasarkan kriteria yg diberikan: 156, 139dan -67
Dengan kemungkinan yg tinggi, dapat disimpulkan 3 data tersebut ada mistypo (data yg dimasukkan dengan penambahan digit atau tanda ‘-’)
![Page 25: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/25.jpg)
Missing Values• Beberapa alasan missing values:
▫ Informasi tidak terkumpul(misal: orang2 menolak memberikan info umur dan berat mereka)
▫ Atribut mungkin tidak dapat diaplikasikan je semua kasus(misal: pendapatan tidak dapat diaplikasikan ke anak2)
• Mengatasi missing values:▫ Eliminasi obyek data▫ Mengestimasi missing value selama analisis▫ Mengganti dengan semua nilai kemungkinan (pembobotan
oleh kemungkinannya)
![Page 26: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/26.jpg)
Duplicate Data•Data set mungkin terdapat obyek data yang
duplikat, atau hampir duplikasi dari yg lain▫Isu utama dengan menggabungkan sumber yg
berbeda2
•Contoh: orang yg sama dengan berbagai email address
•Data cleaning▫Proses perlakuan dengan isu data duplikasi
![Page 27: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/27.jpg)
Data Preprocessing: Why is Needed?• Data di dunia riil cenderung kotor
▫ Incompete: kekurangan nilai atribut, kurang atribut ttt yg menarik, atau hanya berupa kumpulan data
▫ Noise: berisi errors atau outliers▫ Inconsistent: berisi berbeda format dalam code dan nama
• Data yg tidak berkualitas, tidak ada hasil2 mining yg berkualitas▫ Keputusan kualitas harus didasarkan pada data kualitas▫ Data warehouse memerlukan integritas konsisten dari data
kualitas
![Page 28: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/28.jpg)
Major task in Data Preprocessing•Data Cleaning•Data Integration•Data Transformation•Data Reduction•Data Discretisation
![Page 29: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/29.jpg)
Forms of Data Preprocessing
![Page 30: Preparing the Data](https://reader035.fdocuments.in/reader035/viewer/2022070503/568164d6550346895dd71747/html5/thumbnails/30.jpg)
Transforming Data•Centering
▫Mengurangi setiap data dengan rata2 dari setiap atribut
•Normalization▫Hasil dari centering dibagi dengan standard deviasi
•Scaling▫Merubah data sehingga berada dalam skala tertentu