Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org)...
Transcript of Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org)...
![Page 1: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/1.jpg)
Today
• Finish Section on Linked Data • Begin data cleaning and pre-processing topic
![Page 2: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/2.jpg)
Graphs: Social networks
https://www.flickr.com/photos/marc_smith/5592302165
![Page 3: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/3.jpg)
Protein-Protein Interactions
http://www.nature.com/nrg/journal/v5/n2/fig_tab/nrg1272_F2.html
![Page 4: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/4.jpg)
The Internet Graph (https://en.wikipedia.org/wiki/Opte_Project)
![Page 5: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/5.jpg)
Linked Data
• We need to connect data together --- form links. – A key part of the Semantic Web – Also important for the Internet of Things
• (26 billion things by 2020, each continuously producing data)
1. Principles of links from Tim Berners-Lee 1. All kinds of conceptual things, they have names now that start with
HTTP. 2. If I take one of these HTTP names and I look it up, I will get back
some data in a standard format which is kind of useful data that somebody might like to know about that thing, about that event.
3. When I get back that information it's not just got somebody's height and weight and when they were born, it's got relationships. And when it has relationships, whenever it expresses a relationship then the other thing that it's related to is given one of those names that starts with HTTP.
![Page 6: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/6.jpg)
Linked Data Examples
• DBPedia – ~5 million “things” from Wikipedia – Can be linked to external datasets such as CIA World
Factbook, US Census Data – “Give me all cities in New Jersey with more than 10,000
people
• Freebase • FOAF (friend of a friend) • Google Knowledge Graph
• https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html
![Page 7: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/7.jpg)
Standards for Linked Data
• Widely used standards (W3C Recommendations) – JSON-LD (JSON Linked Data) – RDF (Resource Description Framework)
![Page 8: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/8.jpg)
JSON-LD (example from json-ld.org)
• Provide mechanisms for specifying unambiguous meaning in JSON data
• Provides extra keys with “@” sign – “@context” (used to define meanings of terms, map to
identifiers) – “@type” – “@id”
• Use cases – Google Knowledge Graph
![Page 9: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/9.jpg)
JSON-LD Example (from https://en.wikipedia.org/wiki/JSON-LD)
{"@context": { "name": "http://xmlns.com/foaf/0.1/name", "homepage": { "@id": "http://xmlns.com/foaf/0.1/workplaceHomepage", "@type": "@id" }, "Person": "http://xmlns.com/foaf/0.1/Person" }, "@id": "http://me.example.com", "@type": "Person", "name": "John Smith", "homepage": "http://www.example.com/" }
![Page 10: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/10.jpg)
Graphs – RDF (Resource Description Framework) [materials from w3.org]
![Page 11: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/11.jpg)
Serialisation of RDF Example Graph
This graph can be serialised as XML (don’t worry about syntax!)
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#">
<contact:Person rdf:about="http://www.w3.org/People/EM/contact#me"> <contact:fullName>Eric Miller</contact:fullName> <contact:mailbox rdf:resource="mailto:[email protected]"/> <contact:personalTitle>Dr.</contact:personalTitle> </contact:Person>
![Page 12: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/12.jpg)
RDF – Triple Store
• An alternative format for storing RDF type data – triple store <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#fullName> "Eric Miller" . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#mailbox> <mailto:[email protected]> . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#personalTitle> "Dr." . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/10/swap/pim/contact#Person> .
![Page 13: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/13.jpg)
Freebase
• A large database that connects entities (facts, people, places, organizations …) together as a graph – www.freebase.com – Freebase is the basis of the Google Knowledge graph that is
used to improve search. • https://developers.google.com/knowledge-graph/
• Retrieving data from the Google Knowledge Graph – Example adapted from http://www.nolan-nichols.com/
knowledge-graph-via-sparql.html
![Page 14: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/14.jpg)
Other formats for Graphs: Matrix Representation
A
C
D
B A B C D
A 0 0 1 0 B 0 0 0 0 C 0 1 0 0 D 0 1 0 0 A ‘1’ in the matrix iff there is an edge from node X to node Y. Or use a relational table
Source Destination
A C C B D B
![Page 15: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/15.jpg)
What you should know about data formats
• -Why do we have different data formats and why do we wish to transform between different formats?
• -Motivation for using relational databases to manage information • -Different between a (standard) relational database and a nosql database • -What is a csv, what is a spreadsheet, what is the difference? • -Be able to write regular expressions in python format (operators .^$*+|[]) • -Difference between HTML and XML and when to use each • -Motivation behind using XML and XML namespaces • -Be able to read and write data in XML (elements, attributes, namespaces) • -Be able to read and write data in JSON • -Difference between XML and JSON. Applications where each can be used. • -The purpose of using schemas for XML and JSON data. • -The motivation behind Linked Data and the purpose of using JSON-LD or RDF
to represent it.
![Page 16: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/16.jpg)
Further reading
• Further reading – Relational databases
• Pages 403-409 of http://i.stanford.edu/~ullman/focs/ch08.pdf
– XML • http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SG.html
– JSON and JSON-LD • http://json.org • http://crypt.codemancers.com/posts/2014-02-11-An-introduction-to-
json-schema/ • https://cloudant.com/blog/webizing-your-database-with-linked-data-in-
json-ld/#.Vtp_UMfB_Gw – RDF
• https://www.w3.org/DesignIssues/LinkedData.html • http://www-sop.inria.fr/acacia/cours/essi2006/Scientific%20American_
%20Feature%20Article_%20The%20Semantic%20Web_%20May%202001.pdf
• http://www.dlib.org/dlib/may98/miller/05miller.html
![Page 17: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/17.jpg)
COMP20008 Elements of Data Processing Data Pre-Processing and Cleaning
![Page 18: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/18.jpg)
Why is pre-processing needed?
Name Age Date of Birth
“Henry” 20.2 20 years ago
Katherine Forty-one 20/11/66
Michelle 37 5/20/79
Oscar@!! “5” 13th Feb. 2011
- 42 -
Mike___Moore 669 -
巴拉克奥巴⻢马 52 1961年8月4日
![Page 19: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/19.jpg)
Why is pre-processing needed?
• Measuring data quality – Accuracy
• Correct or wrong, accurate or not – Completeness
• Not recorded, unavailable – Consistency
• E.g. discrepancies in representation – Timeliness
• Updated in a timely way – Believability
• Do I trust the data is correct? – Interpretability
• How easily can I understand the data?
![Page 20: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/20.jpg)
Major data preprocessing activities
Data mining concepts and techniques, Han et al 2012
![Page 21: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/21.jpg)
Terminology
Height Weight Age Gender 1.8 80 22 Male 1.53 82 23 Male 1.6 62 18 Female
• The 4 columns (height, weight, age, gender) are features or attributes
• The data items (3 rows) are called instances or objects • Height, Weight and Age are continuous features • Gender is a categorical or discrete feature
![Page 22: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/22.jpg)
Data integration
• Bringing data from multiple sources together – Resolve conflicts – Detect duplicates
• Will cover in depth in weeks 8 and 9
Data Source
Data Source
Integrated Data Source
![Page 23: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/23.jpg)
Data reduction
• Decrease the number of features (columns) or instances (rows) – Sampling strategies – Remove irrelevant features and reduce noise – Easier to visualise, faster to analyse
• Will cover during section on visualisation (weeks 5 and 6), and feature analysis (weeks 9 and 10)
http://bigdataexaminer.com/data-science/understanding-dimensionality-reduction-principal-component-analysis-and-singular-value-decomposition/
![Page 24: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/24.jpg)
Data cleaning
• Incomplete (missing data) • Noisy data • Inconsistent data • Intentionally disguised data
![Page 25: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/25.jpg)
Data cleaning – The Process
• Many tools exist (Goole Refine, Kettle, Talend, …) – Data scrubbing – Data discrepancy detection – Data auditing – ETL (Extract Transform Load) tools: users specify
transformations via a graphical interface • Our emphasis will be to understand some of the methods
employed by some of these tools
![Page 26: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/26.jpg)
Missing or incomplete data
• Lacking feature values • Name=“” • Age=null
• Types of missing data (Rubin 1976) – Missing completely at random: Data are missing
independently of observed and unobserved data. – E.g/ Coin flipping to decide whether or not to answer
an exam question. – Missing not completely at random
• I create a dataset by surveying the class about how healthy they feel. What is the meaning of missing values for those who don’t respond?
• I set an exam and ask a question in hard to understand language. What is the meaning of missing values for those who don’t answer the question?
![Page 27: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/27.jpg)
Example: USA Salary survey data
• Is Person B’s salary missing at random? • Very difficult to determine reasons for missingness.
– In practice report assumptions about missingness.
Name Salary Person C $59k Person D $63k Person H $99k Person E $102k Person G $140k Person F $150k Person A $180k Person B -
![Page 28: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/28.jpg)
Causes of missing data
• Why does it occur? – Malfunction of equipment (e.g. sensors) – Not recorded due to misunderstanding – May not be considered important at time of entry – Deliberate
• How to handle it? – We will look at a number of strategies
![Page 29: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/29.jpg)
Extreme Missing data
• Movie Recommender systems
Person Star Wars
Batman Jurassic World
The Martian
The Revenant
Lego Movie
Selma ….
James 3 2 - - - 1 - John - - 1 2 - - - Jill 1 - - 3 2 1 -
Users and movies Each user only rates a few movies (say 1%) Netflix wants to predict the missing ratings for each user
![Page 30: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/30.jpg)
Noisy data
• Truncated fields (exceeded 80 character limit) • Text incorrectly split across cells (e.g. separator issues) • Salary=“-5” • Some causes
– Imprecise instruments – Data entry issues – Data transmission issues
![Page 31: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/31.jpg)
Inconsistent data
• Different naming representations (“Melbourne University” versus “University of Melbourne”) or (“three” versus “3”)
• Different date formats (“3/4/2016” versus “3rd April 2016”) • Age=20, Birthdate=“1/1/2002” • Two students with the same student id • Outliers
– E.g. 62,72,75,75,78,80,82,84,86,87,87,89,89,90,999 • No good if it is list of ages of hospital patients • Might be ok though for a listing of people number of
contacts on Linkedin though – Can use automated techniques, but also need domain
knowledge
![Page 32: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/32.jpg)
Disguised data
• Everyone’s birthday is January 1st? • Email address is [email protected] • Adriaans and Zantige
– “Recently, a colleague rented a car in the USA. Since he was Dutch, his post-code did not fit the fields of the computer program. The car hire representative suggested that she use the zip code of the rental office instead.”
• How to handle – Look for “unusual” or suspicious values in the dataset, using
knowledge about the domain
![Page 33: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/33.jpg)
Dealing with missing data
• What are the consequences of missing data? – May break application programs not expecting it – Less power for later analysis analysis – May bias later analysis
• So, how to handle it?
![Page 34: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/34.jpg)
Strategy 1: Delete all instances with a missing value
• Sometimes called case deletion • Effects
– Easy to analyse the new (complete data) – May produce bias on analysis if new sample size small or
structure exists in the missing data.
![Page 35: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/35.jpg)
Case deletion
Person Star Wars
Batman
Jurassic World
The Martian
The Revenant
Lego Movie
Selma
Mandy 1 2 1 3 3 2 3
Person Star Wars
Batman
Jurassic World
The Martian
The Revenant
Lego Movie
Selma
Mandy 1 2 1 3 3 2 3
James 3 2 - - - 1 -
John - - 1 2 - - -
Jill 1 - - 3 2 1 -
![Page 36: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/36.jpg)
Strategy 2: Manually correct
• A human eyeballs the missing value and fills it in using their expert knowledge
https://en.wikipedia.org/wiki/Eye
![Page 37: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/37.jpg)
Strategy 3: Imputation
• Impute a value (replace the missing value with a substitute one) • After imputing all missing values, can use standard analysis
techniques for complete datasets
Person Star Wars
Batman
Jurassic World
The Martian
The Revenant
Lego Movie
Selma ….
James 3 2 2 2 1 1 1
John 3 2 1 2 2 1 1
Jill 1 1 1 3 2 1 1
Person Star Wars
Batman
Jurassic World
The Martian
The Revenant
Lego Movie
Selma ….
James 3 2 - - - 1 -
John - - 1 2 - - -
Jill 1 - - 3 2 1 -
![Page 38: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/38.jpg)
Imputation: Fill in with zeros (or similar)
Person Star Wars
Batman Jurassic World
The Martian
The Revenant
Lego Movie
Selma ….
James 3 2 0 0 0 1 0
John 0 0 1 2 0 0 0
Jill 1 0 0 3 2 1 0
• Simple • Won’t break application programs • Limited utility for analysis
![Page 39: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/39.jpg)
Imputation: Fill in with mean value
• Popular method – Can be good for supervised classification – Apply separately to each attribute
Name Age
Daisy 10
Maisy 15
Harry 2
Jackie -
Jackie’s age is imputed to be (10+15+2)/3=9
![Page 40: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/40.jpg)
Imputation: Fill in with mean value cont
• Drawbacks – Reduces the variance of the feature – Incorrect view of the distribution of that attribute – Relationships to other features changes
• Can also use median instead of mean (if distribution is skewed) • Use mode (most frequent value) imputation for categorical
features
![Page 41: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/41.jpg)
Fill in with category mean
• Take categories/clusters and compute the mean ….
Name Age Gender Daisy 10 Female Maisy 15 Female Harry 2 Male Jackie - Female
Jackie’s age is imputed to be (10+15)/2=12.5 (considering the category “Female”)
![Page 42: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/42.jpg)
Time series: Last value carried forward
Day Kilometres Walked Day 1 8.9 Day 2 8.2 Day 3 9.6 Day 4 Day 5 11.6 Day 6 12.0
Kilometres walked on Day 4 = ?
![Page 43: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/43.jpg)
Acknowledgements
– Data Mining Concepts and Techniques. Han, Kamber and
Pei. 3rd edition (chapter 3). Available through library as ebook.
– Data analysis using regression and multilevel hierarchical models. Gelman and Hill (chapter 25), 2006.
![Page 44: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides](https://reader036.fdocuments.in/reader036/viewer/2022071210/6020cd7b7b8d94147e15c3a8/html5/thumbnails/44.jpg)
Next Week
• Second workshop is available on LMS – Practice with JSON and XML and Web scraping
• Project will be released • Continue data-preprocessing and cleaning
– Look at more complex techniques for value imputation (e.g. for the movie recommender system example)