Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin...

28
Merce r Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer

Transcript of Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin...

Page 1: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer

Data Information Knowledge 3

Presentation originally from the University of Texas at Austin

Edits by Rick Mercer

Page 2: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer2

Outline

• Review scope of big data• Searching using Indexes • Analyzing Data

Page 3: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer3

Review: Big data is huge

• 50 petabytes of data = 25 trillion pages of text!

Page 4: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer4

Big data is increasing

2005 2006 2007 2008 2009 2010 2011 2012 2013

Big Data Created

Page 5: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer5

Big data is mostly unstructured

Big Data Organization

Unstructured data Structured data

Page 6: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer6

Try to structure part of the web?

• Top level domain names attempted, but– .edu .com .org .biz .ca .co.uk .nz .

mx

• Open Directory Project – Instead of applying a formula to search

strings, this lists directories that you drill into

• Compare searches here and on Google for–machine learning software

Page 7: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer7

Google Search Formula

• Google uses programs (spiders) to index and explore the Web:– visit webpages,– gather all of the links on each page visited,

and– add them to their list of pages to visit in the

future

• Google takes your words and examines its index for pages that have your words– Applies 200 questions to determine result

list

Page 8: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer8

What is an index?

• Indexes organizes conceptual topics and location pairings

• Google creates an index to look things up, much like the index in a book

• Why do searches use indexes? Efficiency!

Page 9: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer9

Why index?

• An index provides an easy way to find pertinent information related to a topic

• Building indexes is difficult, but indexes make it possible to get results in 0.5 seconds– Google index is100,000,000 gigabytes– over one million computing hours to

build it

• Play first 2 minutes of this video

Page 10: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer

Analyzing Data

Making sense of our world with numbers

Page 11: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer11

Old school – the Scientific Method

• Graph shows scientific method we are taught– Investigate

phenomena to acquire new knowledge

• Procedures vary• Statistical Hypothesis

Testing will be shown later, as an application of the scientific method

Page 12: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer12

Analyzing Statistics

Three uses of statistical analysis commonly used by scientists, mathematicians, politicians, and other professionals across the globe.

Page 13: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer13

Descriptive Statistics

1. Descriptive analytics - provide information about collected data via statistics such as  mean, median, mode, range– These tend to 'describe' circumstances, but do

not offer conjectures about unknowns– Example: the percentage of graduates

employed within 6 months of graduating– Application: Google’s indexing the web– Consider another site dealing with describing

data (recorded search history on any topic):• http://www.google.com/trends/

Page 14: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer14

Predictive Analytics

2. Predictive Analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trend– Does not predict the future, may be wrong– Example: Given that 90 of the 100 CS graduates

were employed within 6 months in 2011, it is __ % likely that 108 of the 120 CS graduates in 2015 will be employed within 6 months

– Upcoming Application: ranking pages based on a search query

– Ron Burgandy 8-second clip

Page 15: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer15

Prescriptive Analytics

3. Prescriptive Analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trend– Does not predict the future– Example: how likely is that I will find a

high-paying job if I choose to major in ‘computer science’ rather than ‘biology’

– Upcoming Application: autocomplete makes recommendation based on previous rankings

Page 16: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer16

How Useful are these three?

• The following grades each type of analysis on its utility (how useful is it?) and confidence (how likely is it to be true and/or valid?) in the context of decision making

AnalysisType

Utility

Level

Confidence

Level

Example

Descriptive

C A+ Obama got x% of the vote

Predictive B+ B- x% chance of winning

Prescriptive

A C How to run the campaign

Page 17: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer17

An example of hypothesis testing

Car Talk Puzzler – “The Case of the Finicky Volare”

Page 18: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer18

Not statistical hypothesis testing

• A man thinks he is having some peculiar car trouble:

“It doesn't like a certain kind of ice cream I buy."

• He goes on to explain that he only has three flavors of ice cream he likes:– Vanilla– Chocolate– Three-bean tofu mint chipped-beef ice cream.

Page 19: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer19

Instead, only establish an hypothesis

He says:

"When I go to buy chocolate I park in front of the ice cream parlor: I buy the chocolate,

and my car starts right up. I buy the vanilla, and my car starts right up.

However, if I buy the three-bean tofu mint chipped-beef, my car won't start.”

• What could be the issue with this car?

Page 20: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer20

Deriving an hypothesis

“When you go in to buy chocolate, you go into the freezer case and

there's chocolate in a container -- you take it, you pay for it, you get into your

car and drive away. Same thing with vanilla, but nobody buys mint chip beef

bean tofu, right? So, somebody must hand pack the ice cream into a special

container”• What could be the issue with this car?

Page 21: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer21

Hypothesis Example

• The car is old.• It takes longer to purchase hand packed ice

cream than pre-packed ice cream.• Ice cream is purchased more often in the

summer when it is hot

Hypothesis: A proposed explanation for this phenomenon

The car overheats and ‘vapor locks’ in the extra time it takes to purchase three-bean tofu mint chipped-beef ice cream.

Page 22: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer22

Statistical Hypothesis Testing

• To get an idea, it’s more than car mechanics

• More realistic to view 1.5 minutes of this example from Khan Academy

Page 23: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer23

Exploratory data analysis

• Exploratory data analysis is another method of scientific inquiry

• Utilize big data statistics to postulate correlations (data sets that are linked together) that have not yet been hypothesized

• Attempts to discover patterns in order to establish correlative links

Page 24: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer24

An example of exploratory data analysis

• Data from millions of searches to predict what you are looking for

• Discovers patterns• Correlations not guaranteed

Page 25: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer25

Analog autocomplete

Page 26: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer26

“How to change a _______” Y-A

xis

Fre

qu

en

cy

X-Axis Words in blank

Page 27: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer27

Exploratory data analysis

 Statistical Hypothesis

TestingInjections into rats

Exploratory Data AnalysisAutocomplete

Confidence

since the test is constrained to a specific issue (and the variables are known), confidence can be really high

confidence is typically lower, because the data are messier and the connections among them are unknown

Scope works over a typically circumscribed problem

usually applied to much larger datasets with more unknowns

Sample size

power increases with sample size, sometimes able to determine a minimum sample size to guarantee a desired confidence

power is dependent on sample size, because there are no established hypothesis and it is unknown how large a sample size must be to discover knowledge

Page 28: Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer.

Mercer28

Some big data sets and collections

• Google Public Data Explorer (Links to an external site.) (130 datasets from Bureau of Labor Statistics, U.S. Census Bureau, etc.)

• data.gov (Links to an external site.) - an online repository of datasets from U.S. Government– Many counties have searchable property databases, such as the 

Travis County Appraisal District (Links to an external site.) – Many counties have searchable legal databases, such as the 

Travis County Clerk (Links to an external site.) – Some data sets defy categorization, such as the 

Texas Death Row Executions (Links to an external site.) data set• Google's Ngram Data (Links to an external site.) - data on Google's catalog of millions of

books, including raw data sets • Google Trends (Links to an external site.) - detailed search history information, including

CSV downloads • NOAA National Climatic Data Center (Links to an external site.)• Knoema (Links to an external site.) - "free to use public and open data platform for users

with interests in statistics and data analysis, visual storytelling and making infographics"• Geocommons (Links to an external site.) - "all about open data analysis and maps"• Stat Silk (Links to an external site.) - "interactive maps of open data"• Better World Flux (Links to an external site.) - "a beautiful interactive visualization of

information on what really matters in life"• Gapminder (Links to an external site.) - "unveiling the beauty of statistics for a better

world view"