VSSML16 L5. Basic Data Transformations
-
Upload
bigml-inc -
Category
Data & Analytics
-
view
249 -
download
2
Transcript of VSSML16 L5. Basic Data Transformations
![Page 1: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/1.jpg)
September 8-9, 2016
![Page 2: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/2.jpg)
BigML, Inc 2
Basic Transformations
Poul Pertesen CIO, BigML, Inc
Creating Machine Learning Ready Data
![Page 3: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/3.jpg)
BigML, Inc 3Machine Learning-Ready Data
Basic Transformations
Q: How does a physicist milk a cow?
A: Well, first let us consider a spherical cow...
Q: How does a data scientist build a model?
A: Well, first let us consider perfectly formatted data…
![Page 4: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/4.jpg)
BigML, Inc 4Machine Learning-Ready Data
The Dream
CSV Dataset Model Profit!
![Page 5: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/5.jpg)
BigML, Inc 5Machine Learning-Ready Data
The Reality
CRM
Web Accounts
Transactions ML Ready?
Is all hope lost?How do you even start?
![Page 6: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/6.jpg)
BigML, Inc 6Machine Learning-Ready Data
Holistic Approach• Define a clear idea of the goal. • Understand what ML tasks will achieve the goal. • Understand the data structure to perform those ML tasks. • Find out what kind of data you have and make it ML-Ready
• where is it, how is it stored? • what are the features? • can you access it programmatically?
• Feature Engineering: transform the data you have into the data you actually need.
• Evaluate: Try it on a small scale • Accept that you might have to start over….
• But when it works, automate it!!!!
![Page 7: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/7.jpg)
BigML, Inc 7Machine Learning-Ready Data
Holistic Approach
Define Goal & ML Task
![Page 8: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/8.jpg)
BigML, Inc 8Machine Learning-Ready Data
Understand ML TasksGoal
• Will this customer default on a loan? • How many customers will apply for a
loan next month? • Is the consumption of this product
unusual? • Is the behavior of the customers
similar? • Are these product purchased
together?
ML TaskClassificationRegression
Anomaly Detection
Cluster Analysis
Association Discovery
![Page 9: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/9.jpg)
BigML, Inc 9Machine Learning-Ready Data
Holistic Approach
Required Data Structure
![Page 10: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/10.jpg)
BigML, Inc 10Machine Learning-Ready Data
ClassificationCategorical
Training
Testing
Predicting
![Page 11: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/11.jpg)
BigML, Inc 11Machine Learning-Ready Data
RegressionNumeric
Training
Testing
Predicting
![Page 12: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/12.jpg)
BigML, Inc 12Machine Learning-Ready Data
Anomaly Detection
![Page 13: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/13.jpg)
BigML, Inc 13Machine Learning-Ready Data
Cluster Analysis
![Page 14: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/14.jpg)
BigML, Inc 14Machine Learning-Ready Data
Association Discovery
![Page 15: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/15.jpg)
BigML, Inc 15Machine Learning-Ready Data
Holistic Approach
Make Your Data ML-Ready
![Page 16: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/16.jpg)
BigML, Inc 16Machine Learning-Ready Data
ML-Ready Data
Instan
ces
Fields (Features)
Tabular Data: • Each row is one of the instances. • Each column is a field that describes a property of the
instance that is relevant to the question being modeled. • Fields can be:
already be present in your data derived from your data or generated using other fields.
Machine Learning Algorithms consume
instances of the question that you want
to model.
!! Danger Ahead !!
![Page 17: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/17.jpg)
BigML, Inc 17Machine Learning-Ready Data
CleansingHomogenize missing values and different types in the same feature, fix input errors, correct semantic issues, types, etc.
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 - Rock 139
Blues alive 1990/03/01 281 Blues 239
Lonely planet 2002-11-19 5:32s Techno 42
Dance, dance 02/23/1983 312 Disco N/A
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 4 minutes Techno 895
The alchemist 2001-11-21 418 Bluesss 178
Bring me down 18-10-98 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Original data
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 Rock 139
Blues alive 1990-03-01 281 Blues 239
Lonely planet 2002-11-19 332 Techno 42
Dance, dance 1983-02-23 312 Disco
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 240 Techno 895
The alchemist 2001-11-21 418 Blues 178
Bring me down 1998-10-18 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Cleaned data
![Page 18: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/18.jpg)
BigML, Inc 18Machine Learning-Ready Data
Denormalizing
users
artists
tracks
albums
InstancesFeatures
(millions)
join
Data is usually normalized in relational databases, ML-Ready datasets need the information de-normalized in a single file/dataset.
![Page 19: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/19.jpg)
BigML, Inc 19Machine Learning-Ready Data
Aggregating
User Num.Playbacks Total Time Pref.DeviceUser001 3 830 TabletUser002 1 218 SmartphoneUser003 3 1019 TVUser005 2 521 Tablet
Aggregated data (list of users)
When the entity to model is different from the provided data, an aggregation to get the entity might be needed.
Content Genre
Duration Play Time User DeviceHighway
starRock 190 2015-05-12
16:29:33User001 TV
Blues alive Blues 281 2015-05-13 12:31:21
User005 TabletLonely planet
Techno
332 2015-05-13 14:26:04
User003 TVDance, dance
Disco 312 2015-05-13 18:12:45
User001 TabletThe wall Reag
ge218 2015-05-14
09:02:55User002 Smartphone
Offside down
Techno
240 2015-05-14 11:26:32
User005 TabletThe
alchemistBlues 418 2015-05-14
21:44:15User003 TV
Bring me down
Classic
328 2015-05-15 06:59:56
User001 TabletThe
scarecrowRock 269 2015-05-15
12:37:05User003 Smartphone
Original data (list of playbacks)
tail -n+2 playlists.csv | cut -d',' -f5 | sort | uniq -c
![Page 20: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/20.jpg)
BigML, Inc 20Machine Learning-Ready Data
PivotingDifferent values of a feature are pivoted to new columns in the
result dataset.
Content Genre Duration Play Time User DeviceHighway star Rock 190 2015-05-12 16:29:33 User001 TVBlues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TVDance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002 SmartphoneOffside down Techno 240 2015-05-14 11:26:32 User005 TabletThe alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me down Classic 328 2015-05-15 06:59:56 User001 TabletThe scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone
Original data
User Num.Playbacks
Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone
User001 3 830 Tablet 1 2 0 190 640 0
User002 1 218 Smartphone 0 0 1 0 0 218
User003 3 1019 TV 2 0 1 750 0 269
User005 2 521 Tablet 0 2 0 0 521 0
Aggregated data with pivoted columns
![Page 21: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/21.jpg)
BigML, Inc 21Machine Learning-Ready Data
Time WindowsCreate new features using values over different periods of time
InstancesFeatures
Time
InstancesFeatures
(millions)
(thousands)
t=1 t=2 t=3
![Page 22: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/22.jpg)
BigML, Inc 22Machine Learning-Ready Data
UpdatesNeed a current view of the data, but new data only comes in
batches of changes
day 1day 2day 3Instances
Features
![Page 23: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/23.jpg)
BigML, Inc 23Machine Learning-Ready Data
Structuring Output
• A CSV file uses plain text to store tabular data. • In a CSV file, each row of the file is an instance. • Each column in a row is usually separated by a comma (,) but other
"separators" like semi-colon (;), colon (:), pipe (|), can also be used. Each row must contain the same number of fields
• but they can be null • Fields can be quoted using double quotes ("). • Fields that contain commas or line separators must be quoted. • Quotes (") in fields must be doubled (""). • The character encoding must be UTF-8 • Optionally, a CSV file can use the first line as a header to provide the
names of each field.
After all the data transformations, a CSV (“Comma-Separated Values) file has to be generated, following the rules below:
![Page 24: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/24.jpg)
BigML, Inc 24Machine Learning-Ready Data
Holistic Approach
Feature Engineering
![Page 25: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/25.jpg)
BigML, Inc 25Machine Learning-Ready Data
Feature Engineering
• Flatline • Domain Specific Language for data generation
and filtering • Works with datasets -> datasets • Lots of built-in functions • Sliding windows • Date/Time parsing
• Flatline Editor (in UI) • https://github.com/bigmlcom/flatline
![Page 26: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/26.jpg)
BigML, Inc 26Machine Learning-Ready Data
Feature Engineering• Feature Engineering of Numeric features:
• Discretization (percentiles, within percentiles, groups) • Replacement • Normalization • Exponentiation, Logarithms, Squares, etc. • Shock
• Feature Engineering of Text features: • Misspellings • Length • Number of subordinate sentences • Language • Levenshtein distance
• Stacking: • Compute a field using non-linear combinations of other fields
![Page 27: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/27.jpg)
BigML, Inc 27Machine Learning-Ready Data
Holistic Approach
Test & Automate
![Page 28: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/28.jpg)
BigML, Inc 28Machine Learning-Ready Data
Test & Automate
• Test - Evaluate • Did you meet the goal? • If not, did you discover something else useful? • If not, start over • If you did…
• Automate - You don’t want to hand code that every time, right? • Consider tools that are easy to automate
• scripting interface • APIs • Ability to maintenance is important
![Page 29: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/29.jpg)
BigML, Inc 29Machine Learning-Ready Data
Tools• Command Line?
• join, cut, awk, sed, sort, uniq • Automation
• Shell, Python, etc • Talend • BigML: bindings, bigmler, API, whizzml
• Relational DB • MySQL
• Non-Relational DB • MongoDB
![Page 30: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/30.jpg)
BigML, Inc 30Machine Learning-Ready Data
Prosper
Submit Bids
Cancelled Withdraw
Funded
Expired
Defaulted
Paid
Current
Late
Q: Which new loans make it to funded? Q: Which funded loans make it to paid? Q: If funded, what will be the rate?
Classification
RegressionClassification
![Page 31: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/31.jpg)
BigML, Inc 31Machine Learning-Ready Data
ProsperData Provided in XML updates!!
fetch.sh“curl”daily
export.sh
import.pyXML
bigml.sh
ModelPredictShare in gallery
Status
LoanStatus
BorrowerRate
![Page 32: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/32.jpg)
BigML, Inc 32Machine Learning-Ready Data
Prosper
• XML… yuck! • MongoDB has CSV export and is record based so it is easy to
handle changing data structure. • Feature Engineering
• There are 5 different classes of “bad” loans • Date cleanup • Type casting: floats and ints
• Would be better to track over time • number of late payments • compare predictions and actuals
• XML… yuck!
Tidbits and Lessons Learned….
![Page 33: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/33.jpg)
BigML, Inc 33Machine Learning-Ready Data
DiabetesFix Missing Values in a “Meaningful” Way
Filter Zeros
Model insulin
Predict insulin
Select insulin
FixedDataset
AmendedDataset
OriginalDataset
CleanDataset
![Page 34: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/34.jpg)
BigML, Inc 34Machine Learning-Ready Data
Stock Prices
(/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))
Shock: Deviations from Trenddate volume price
1 34353 3142 44455 3153 22333 3154 52322 3215 28000 3206 31254 3197 56544 3238 44331 3249 81111 287
10 65422 29411 59999 30012 45556 30213 19899 30114 21453 302
314
314 315314 315 315
314 315 315 321315 315 321 320315 321 320 319
4-Day moving avg)
Current - (4-day avg) std dev
![Page 35: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/35.jpg)
BigML, Inc 35Machine Learning-Ready Data
Talend
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
Denormalization Example
![Page 36: VSSML16 L5. Basic Data Transformations](https://reader031.fdocuments.in/reader031/viewer/2022030214/5899eb3f1a28ab96418b651f/html5/thumbnails/36.jpg)
BigML, Inc 36Machine Learning-Ready Data
Talend
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
Denormalization Example