The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? −...
Transcript of The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? −...
![Page 2: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/2.jpg)
© 2016 IBM Corporation2
Look At All The DataLook At All The Data
Let Data Lead the WayLet Data Lead the Way Leverage Data as it is CapturedLeverage Data as it is Captured
Changing the Way We Do Analytics
![Page 3: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/3.jpg)
© 2016 IBM Corporation3
Basic Process
Ingest
data
Transform
: clean
Create
and build
model
Evaluate
Deliver
and deploy
model
Communicate
results
Understand
problem and
domain
Explore and
understand
data
Transform:
shape
OUTPUT
ANALYSIS
INPUT
![Page 4: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/4.jpg)
© 2016 IBM Corporation4
Work Task Example
Given a person is Arrested
Who Gets Released on Bond? and
How Fast?
![Page 5: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/5.jpg)
© 2016 IBM Corporation5
Understand the Domain
�Analytics requires an understanding of the data & the judicial process
�Need to learn how a Judge decides whether or not to allow bond
−SME’s indicate Judicial bond decisions are
based on
• “Threat to community” = Qualitative assessment (Current Charges + Past Charges +Time Line)
•Ties to Community
![Page 6: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/6.jpg)
© 2016 IBM Corporation6
The Hunt for Data -- The Rap Sheet
�Charges
− Criminal code - thousands of numbers
− Time/date of arrest
− Sentenced or not (sometimes)/Released or not
�Personal Information
− Dirty & incomplete
�Arresting Organization
![Page 7: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/7.jpg)
© 2016 IBM Corporation7
�Acquiring Rap Sheet Data
−Access required all sorts of agreements
−Different jurisdictions, different content and form
−Task requirement: Meta data mapping and integration
•Consistent Crime codes (NCIC)
![Page 8: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/8.jpg)
© 2016 IBM Corporation8
Explore and Understand the Data
�Analyze variables
� Values, max, min, number of variables, coverage or % missing data, distribution shapes, etc.
� Outliers
� Anomalous values
, Number of days from arrest to adjudication
![Page 9: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/9.jpg)
© 2016 IBM Corporation9
Data Transformation
�To clean or not to clean
−Strategic decision
−Decision criteria
�Identity normalization
−Alias challenge
−Alias challenge as it relates to Data Science
•Model Creation
•Model Score
![Page 10: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/10.jpg)
© 2016 IBM Corporation10
Data Transformation: Enriching the data by adding context to data
Context: The cumulative history derived from data observations about entities
� Example – Safety of firefighters
� Current environmental temperature
Or
� Current environmental temperature and temperature history for that person
Or
� Current environmental temperature, temperature history for that person, and how long it will take to exit the building
![Page 11: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/11.jpg)
© 2016 IBM Corporation11
Data Transformation: Threat
NCIC Charge Code NCIC Charge Category NCIC Charge
101 Sovereignty Treason
105 Sovereignty Sedition
�There a several thousand codes
− Which codes are considered “threat”
− How do codes compare in “threat”
− How do you combine codes
− How to figure in the temporal aspect of crime
![Page 12: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/12.jpg)
© 2016 IBM Corporation12
Scoring: Threat to Community
�Two main components: scoring of individual charges and crime history
�Scoring of each charge derived from two parameters− A loss-of-memory parameter, which determines how fast the severity
of the charge declines over time (this parameter might be zero)− A lack-of-forgiveness parameter, which will determine what
proportion of the original severity level remains forever� Scoring of crime history
− Scores of each charge/conviction are accumulated (the model determines how)
For each crime,
look up the scoring parameters, and the time
the crime was committed, and evaluate the
individual crime scores
Submit all of
those scores into the
cumulative history scoring
function
Threat to the
Community of the
offender
12
![Page 13: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/13.jpg)
© 2016 IBM Corporation13
Data Transformation
� Target Variable: Time to Release− The time-to-release variable is obtained by subtracting the booking time stamp
from the release time stamp.
� Counting− Total number of a type of crime
− Total number of a specific threat to community grouping
� Distance variables− Compare ZIP codes of booking location and arrestee’s home to determine if
arrestee is “local” to booking locality
� Date stamp variables− Did the booking date stamp occur on a weekend?
− Did the booking date stamp occur on a holiday, during holiday, just before a holiday?
� Time of Day− Early in shift? Late in shift?
− Net – about 1600 variables created
![Page 14: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/14.jpg)
© 2016 IBM Corporation14
Modeling Process is Iterative
Predictive Modeling Algorithm: Train Model
Evaluate and Tweak Model
Score and Assess Model
Divide Data Set into 3 Segments
![Page 15: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/15.jpg)
© 2016 IBM Corporation15
Picking a Model
�Target variable characteristics (binary, continuous, etc.) typically dictate model selection
�Model selection
− Assessment via Accuracy and Error
− Different models can select different variables as predictors
![Page 16: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/16.jpg)
© 2016 IBM Corporation16
05/1 1
Predictive Modeling Environment
![Page 17: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/17.jpg)
© 2016 IBM Corporation17
Model – Decision Tree
![Page 18: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/18.jpg)
© 2016 IBM Corporation18
Scoring - How good is the Model?
�Mission dictates model accuracy requirements
�Lots of different measurements of goodness
− Model Confidence
− Two types of error
• Number of people who were predicted to be released AND were not
• Number of people who were not to predicted to be released AND were
− Number of different other scoring mechanisms
![Page 19: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/19.jpg)
© 2016 IBM Corporation19
Disappointment
�Horrible accuracy and error
�Re-Think assumptions
�Aha moment
![Page 20: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?](https://reader034.fdocuments.in/reader034/viewer/2022042319/5f08b8d37e708231d4236712/html5/thumbnails/20.jpg)
© 2016 IBM Corporation20
Deployment
�Models (or rules) get deployed to the mission environment
− Can deploy more than one model
�Model should exploit new data as it arrives
�Predictive power of models must be monitored over time
− Develop thresholds which define the limits of allowable model variance; if model exceeds variable, must re-calibrate the model
− Need to establish monitoring mechanism