Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin...

Post on 16-Jan-2016

212 views 0 download

Tags:

Transcript of Mercer Data Information Knowledge 3 Presentation originally from the University of Texas at Austin...

Mercer

Data Information Knowledge 3

Presentation originally from the University of Texas at Austin

Edits by Rick Mercer

Mercer2

Outline

• Review scope of big data• Searching using Indexes • Analyzing Data

Mercer3

Review: Big data is huge

• 50 petabytes of data = 25 trillion pages of text!

Mercer4

Big data is increasing

2005 2006 2007 2008 2009 2010 2011 2012 2013

Big Data Created

Mercer5

Big data is mostly unstructured

Big Data Organization

Unstructured data Structured data

Mercer6

Try to structure part of the web?

• Top level domain names attempted, but– .edu .com .org .biz .ca .co.uk .nz .

mx

• Open Directory Project – Instead of applying a formula to search

strings, this lists directories that you drill into

• Compare searches here and on Google for–machine learning software

Mercer7

Google Search Formula

• Google uses programs (spiders) to index and explore the Web:– visit webpages,– gather all of the links on each page visited,

and– add them to their list of pages to visit in the

future

• Google takes your words and examines its index for pages that have your words– Applies 200 questions to determine result

list

Mercer8

What is an index?

• Indexes organizes conceptual topics and location pairings

• Google creates an index to look things up, much like the index in a book

• Why do searches use indexes? Efficiency!

Mercer9

Why index?

• An index provides an easy way to find pertinent information related to a topic

• Building indexes is difficult, but indexes make it possible to get results in 0.5 seconds– Google index is100,000,000 gigabytes– over one million computing hours to

build it

• Play first 2 minutes of this video

Mercer

Analyzing Data

Making sense of our world with numbers

Mercer11

Old school – the Scientific Method

• Graph shows scientific method we are taught– Investigate

phenomena to acquire new knowledge

• Procedures vary• Statistical Hypothesis

Testing will be shown later, as an application of the scientific method

Mercer12

Analyzing Statistics

Three uses of statistical analysis commonly used by scientists, mathematicians, politicians, and other professionals across the globe.

Mercer13

Descriptive Statistics

1. Descriptive analytics - provide information about collected data via statistics such as  mean, median, mode, range– These tend to 'describe' circumstances, but do

not offer conjectures about unknowns– Example: the percentage of graduates

employed within 6 months of graduating– Application: Google’s indexing the web– Consider another site dealing with describing

data (recorded search history on any topic):• http://www.google.com/trends/

Mercer14

Predictive Analytics

2. Predictive Analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trend– Does not predict the future, may be wrong– Example: Given that 90 of the 100 CS graduates

were employed within 6 months in 2011, it is __ % likely that 108 of the 120 CS graduates in 2015 will be employed within 6 months

– Upcoming Application: ranking pages based on a search query

– Ron Burgandy 8-second clip

Mercer15

Prescriptive Analytics

3. Prescriptive Analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trend– Does not predict the future– Example: how likely is that I will find a

high-paying job if I choose to major in ‘computer science’ rather than ‘biology’

– Upcoming Application: autocomplete makes recommendation based on previous rankings

Mercer16

How Useful are these three?

• The following grades each type of analysis on its utility (how useful is it?) and confidence (how likely is it to be true and/or valid?) in the context of decision making

AnalysisType

Utility

Level

Confidence

Level

Example

Descriptive

C A+ Obama got x% of the vote

Predictive B+ B- x% chance of winning

Prescriptive

A C How to run the campaign

Mercer17

An example of hypothesis testing

Car Talk Puzzler – “The Case of the Finicky Volare”

Mercer18

Not statistical hypothesis testing

• A man thinks he is having some peculiar car trouble:

“It doesn't like a certain kind of ice cream I buy."

• He goes on to explain that he only has three flavors of ice cream he likes:– Vanilla– Chocolate– Three-bean tofu mint chipped-beef ice cream.

Mercer19

Instead, only establish an hypothesis

He says:

"When I go to buy chocolate I park in front of the ice cream parlor: I buy the chocolate,

and my car starts right up. I buy the vanilla, and my car starts right up.

However, if I buy the three-bean tofu mint chipped-beef, my car won't start.”

• What could be the issue with this car?

Mercer20

Deriving an hypothesis

“When you go in to buy chocolate, you go into the freezer case and

there's chocolate in a container -- you take it, you pay for it, you get into your

car and drive away. Same thing with vanilla, but nobody buys mint chip beef

bean tofu, right? So, somebody must hand pack the ice cream into a special

container”• What could be the issue with this car?

Mercer21

Hypothesis Example

• The car is old.• It takes longer to purchase hand packed ice

cream than pre-packed ice cream.• Ice cream is purchased more often in the

summer when it is hot

Hypothesis: A proposed explanation for this phenomenon

The car overheats and ‘vapor locks’ in the extra time it takes to purchase three-bean tofu mint chipped-beef ice cream.

Mercer22

Statistical Hypothesis Testing

• To get an idea, it’s more than car mechanics

• More realistic to view 1.5 minutes of this example from Khan Academy

Mercer23

Exploratory data analysis

• Exploratory data analysis is another method of scientific inquiry

• Utilize big data statistics to postulate correlations (data sets that are linked together) that have not yet been hypothesized

• Attempts to discover patterns in order to establish correlative links

Mercer24

An example of exploratory data analysis

• Data from millions of searches to predict what you are looking for

• Discovers patterns• Correlations not guaranteed

Mercer25

Analog autocomplete

Mercer26

“How to change a _______” Y-A

xis

Fre

qu

en

cy

X-Axis Words in blank

Mercer27

Exploratory data analysis

 Statistical Hypothesis

TestingInjections into rats

Exploratory Data AnalysisAutocomplete

Confidence

since the test is constrained to a specific issue (and the variables are known), confidence can be really high

confidence is typically lower, because the data are messier and the connections among them are unknown

Scope works over a typically circumscribed problem

usually applied to much larger datasets with more unknowns

Sample size

power increases with sample size, sometimes able to determine a minimum sample size to guarantee a desired confidence

power is dependent on sample size, because there are no established hypothesis and it is unknown how large a sample size must be to discover knowledge

Mercer28

Some big data sets and collections

• Google Public Data Explorer (Links to an external site.) (130 datasets from Bureau of Labor Statistics, U.S. Census Bureau, etc.)

• data.gov (Links to an external site.) - an online repository of datasets from U.S. Government– Many counties have searchable property databases, such as the 

Travis County Appraisal District (Links to an external site.) – Many counties have searchable legal databases, such as the 

Travis County Clerk (Links to an external site.) – Some data sets defy categorization, such as the 

Texas Death Row Executions (Links to an external site.) data set• Google's Ngram Data (Links to an external site.) - data on Google's catalog of millions of

books, including raw data sets • Google Trends (Links to an external site.) - detailed search history information, including

CSV downloads • NOAA National Climatic Data Center (Links to an external site.)• Knoema (Links to an external site.) - "free to use public and open data platform for users

with interests in statistics and data analysis, visual storytelling and making infographics"• Geocommons (Links to an external site.) - "all about open data analysis and maps"• Stat Silk (Links to an external site.) - "interactive maps of open data"• Better World Flux (Links to an external site.) - "a beautiful interactive visualization of

information on what really matters in life"• Gapminder (Links to an external site.) - "unveiling the beauty of statistics for a better

world view"