Data 101: Fundamentals of Data in GIS
-
Upload
measure-evaluation -
Category
Technology
-
view
963 -
download
2
description
Transcript of Data 101: Fundamentals of Data in GIS
Data 101
Fundamentals of data in a GIS
Overview
Role of data
Data structures and schemas
Metadata
Linking data
Issues of confidentiality
Review
90 percent rule
90% Data Preparation
10% Mapping90% of the cost, time and effort will be devoted to data preparation
90% Rule
Data Preparation Collecting
Cleaning
Validating
Formatting
Linking with other data
Mapping Map design
Categorization decisions
Production
GIS analysis is only as strong as the data used.
Strategies for strong data
Accuracy
Timlieness
Properly structured
Properly documented
Data accuracy
Data should accurately reflect reality
In GIS there are two types of accuracy to be concerned with:
Spatial accuracy
Items located correctly
Attribute accuracy
Attributes are correct and properly linked to geography
Spatial accuracy
Hotel Suryaa
Real Location
Spatial Accuracy and Scale
Hotel Suryaa
Attribute Accuracy
Is the data associated with the location accurate?
Is it linked to the right geographic entity?
Attribute Accuracy
Timeliness
Is the data for the time period of interest? Boundaries change
New features created
Features change
Data Structure
Proper data structure is necessary in order to effectively use data
Software must know how to read the data, and query it.
The structure of the data is also known as data schema
Data Schema
For most programs, data will need to be stored in a row and column format
GIS programs expect well formed data in the following schema:
One record per geographic unit
Geographic units don’t repeat in records
Variables are stored in columns
No blank cells unless data is missing
Data Schema
Population China India United States
Indonesia
Total 1339724852 1210193422 312417000 237556363
Percent of World’s Population
19.23% 17.37% 4.48% 3.41%
Population Density
140/km2 368/km2 32/km2 121/km2
Poor data schema•Columns are geographic units•Variables are rows
Blank Cells
Duplicate D
istrict Nam
es
Proper Data Schema
One record per geographic unit
Columns are variables
Metadata
Data about data
Provides information on:
Source of data
Who created it
When it was created
Coordinate system and datum
Usage and sharing restrictions
Metadata
Metadata is especially important with spatial data because of issues of:
Spatial accuracy
Coordinate systems and datums
Confidentiality
Timeliness
Metadata formats
International standard
ISO 9115
Mandatory elements
Schema for metadata
Countries may have their own national standards that are compatible with the ISO standard but provide extra elements
Metadata Example
Data Types
Text
Numeric
Coordinates
Programs assign variables to be a specific type which can affect the way the program handles data
Data Types
Text
Arithmetic can not be conducted on values in text fields
Numeric
Arithmetic permitted
May require user to declare number of decimal places before entering data
This can be important when storing coordinates
Linking data
Key field
The field that contains information common between tables
Tables are linked using the key field
Can’t link using key fields that are two different types
District Population Male Pop Female Pop
North 24015 14409 9606
West 31154 16202 14952
South 62442 29972 32470
District Area (sq km)
North 243
West 310
South 602
District is the key field
District Population Male Pop Female Pop Area (sq km)
North 24015 14409 9606 243
West 31154 16202 14952 310
South 62442 29972 32470 602
Linking data
Linking using text fields can be problematic
Variations in spelling
District Population Male Pop Female Pop
North Kinley 24015 14409 9606
West 31154 16202 14952
South 62442 29972 32470
District Area (sq km)
N. Kinley 243
West 310
South 602
The two tables have different spellings for the district North Kinley
District Population Male Pop Female Pop Area (sq km)
West 31154 16202 14952 310
South 62442 29972 32470 602
Linking data
Linking using numeric fields is often more reliable and less vulnerable to variations and other issues
Countries often use numeric codes for administrative units to get around problems with spelling variations
If standardized national codes exist, it is a good idea to include them in data National Bureau of Statistics or Census often
manage such codes
District Dist code Population Male Pop Female Pop
North Kinley 100 24015 14409 9606
West 200 31154 16202 14952
South 300 62442 29972 32470
District Dist code Area (sq km)
N. Kinley 100 243
West 200 310
South 300 602
Dist code is the key field
District Dist Code Population Male Pop Female Pop
Area (sq km)
North 100 24015 14409 9606 243
West 200 31154 16202 14952 310
South 300 62442 29972 32470 602
Advantage of numeric codes
Can manage hierarchy effectively
North District Code 100
District Province Code
North Coast 101
North Mountain 103
North Savanna 105
Savanna
Mountain
Coast
Linking data key points
Key fields must be of the same type
Text fields can be problematic due to spelling variations
Numeric fields are often a more reliable key field
Unique geography codes, if available in a country is often the best option for making linkages
Data and confidentiality issues
Important issue when working with spatial data
Discuss issues of confidentiality and spatial tools
Present strategies for protecting confidentiality
Confidentiality
Protecting identity of individuals
Requirement
Informed consent agreements
Ethical research
The act of explicitly making data available that breaches confidentiality commitments.
Overt disclosure
Deductive Disclosure
45 year old female
45 year old female
45 year old female
Has 5 children
45 year old female
Has 5 children
45 year old female
Has 5 children
Works for General Electric in Delhi
45 year old female
Has 5 children
Works for General Electric in Delhi
28.67171, 77.21211
Spatial Data
Overt disclosure
Makes deductive disclosure easier
Geoprivacy
“[an] individual’s right to prevent disclosure of the location of one’s home, workplace, daily activities or trips.”
Protection of geoprivacy and accuracy of Spatial Information: How Effective are Geographical Masks?
Kwan, Casas, Schmitz
Cartographica, Vol 39, #2
Four Principles
Protection of Confidentiality
Social-Spatial Linkage
Data Sharing
Data Preservation
Confidentiality and spatially explicit data: Concerns and challenges
VanWey, Rindfuss, Gutmann, Entwisle, Balk PNAS, vol. 102, no. 43
1. Protection of Confidentiality
Fundamental to ethical research
Information that might lead to physical, emotional, financial or other harm
Protection of information that discloses identity
2. Social-Spatial Linkage
All human activity takes place on earth
Understanding that adds context and perspective
Key to advancement of science
Essential for understanding the diffusion of behaviors
3. Data Sharing
Essential on both scientific and financial grounds
Provide access to data for other researchers
Condition of funders
4. Data Preservation
Data available in the future
How long should data be deemed “sensitive”?
When, if ever, can it be released
Strategies
Random Perturbations
Random shifting of point locations
Pros: Easy (relatively) to do
Cons: Lose original location, introduces error
Affine Transformation
Change scale
Rotate
Shift a set distance
Combination
Pros: Easy to do
Cons: Easy to undo, can impact some types of analysis
Aggregate
Point locations are aggregated to higher unit of analysis
Pros: Easy to do
Cons: Requires sufficient data points, Finer data variations will be lost
Despatialize
Remove Coordinate System
Use Euclidean space
Pros: Simple, keeps relative position and placement
Cons: Loses contextual data
Nothing
Do not collect or release data
Cold room or on-site analysis only
Pros: Maintains all of the original spatial data
Cons: Complicated, limits data sharing, limits social-spatial link
“Ignoring is unacceptable”
Can get lost in the excitement about GIS
Those who collect data must think about the confidentiality issues
Data users must also think about how their analysis may increase the risk of deductive disclosure.
Key points
Confidentiality issues arise when spatial context is included in data.
It’s important to protect confidentiality. People have an expectation that their identities are protected.
There are strategies that can preserve confidentiality, but there is no “one-size-fits-all solution”