Processing Big Data - ULisboausers.isr.tecnico.ulisboa.pt/~csoares/BigDataDocs/EDA.pdfProcessing Big...
Transcript of Processing Big Data - ULisboausers.isr.tecnico.ulisboa.pt/~csoares/BigDataDocs/EDA.pdfProcessing Big...
Processing Big DataExploratory Data Analysis
Exploratory Data Analysis
What it is
Any method of looking at data that does not include formal statistical modeling and inference
NYC OD: Emergency Response Incidents
3
Why it matters
Confirmatory statistical analyses are based on models.
4
Why it matters
Confirmatory statistical analyses are based on models.
y = Ax+N (0,�2)
Structural component
Random component
4
Why it matters
Confirmatory statistical analyses are based on models.
y = Ax+N (0,�2)
Structural component
Random component
Signal Noise
4
Why it matters
Models are not perfect representations of the real world.
5
Why it matters
Models are not perfect representations of the real world.
5
Why it matters
Models are not perfect representations of the real world.
But some are close enough to be useful!
https://commons.wikimedia.org/w/index.php?curid=521370
5
Why it mattersWhat is close enough to reality?
6
Why it mattersWhat is close enough to reality?
Statistical inference always depends on model assumptions about the data.
6
Why it mattersWhat is close enough to reality?
Statistical inference always depends on model assumptions about the data.
6
Use EDA for:
7
Use EDA for:
• Detecting data noise
7
Use EDA for:
• Detecting data noise
• Checking assumptions
7
Use EDA for:
• Detecting data noise
• Checking assumptions
• Selecting data models
7
Use EDA for:
• Detecting data noise
• Checking assumptions
• Selecting data models
• Determining relationships between the explanatory variables
7
Use EDA for:
• Detecting data noise
• Checking assumptions
• Selecting data models
• Determining relationships between the explanatory variables
• Determining relationships between explanatory and outcome variables
7
TechniquesLook at the raw data
• What values the variables take? • How often variables take those values?
Compute summary statistics
• Look at the top and bottom of your data. • How much missing data? • How noisy is the data?
Visualize• Show comparisons • Show structure • Show multivariate data8
311 Service Requests from 2010 to Present
Non-emergency City services and information9
How to: Look at the raw data
Data sets are in general huge
Do not load a huge file into memory
10
How to: Look at the raw data
Data sets are in general huge
Do not load a huge file into memory
Example: ~500M tweets/day x 140 B x 7 days ~500 GB of RAM10
How to: Look at the raw dataOld school shell tools for raw data inspection
see more @ https://pixorblog.wordpress.com/2016/06/24/csv-files-and-bash/
11
Summary stats
Mean, mode, median
Variance
Range, max, min
Correlation
…
12
Beware of summary stats
Anscombe, F. J. (1973). "Graphs in Statistical Analysis". American Statistician. 27 (1): 17–21. From wikipedia
13
Beware of summary stats
https://www.autodeskresearch.com/publications/samestats
14
Data Visualization
Data points across some features
Features across all data points
Histograms
…
15
Followup
Do you have the right question?
Refine your question
Do you have the right data?
Do you need other data?More data?
Collect data according to needs
16