web.fscj.eduweb.fscj.edu/Janson/cgs1060/20190115.DataScienceModule.d… · Web viewIn addition, the...

Data Science

So far, you have been exposed to small, clearly defined groups of data that fit easily into spreadsheets or databases (DBs). However, there has been an explosion of captured data due to technological advances. Mobile devices, Google searches, Instagram and Facebook social media posts not to mention cameras, sensors, and internet enabled devices like cars and appliances all generate data that overwhelm the capabilities of Data Base Management Systems (DBMS). By overwhelm we mean not only the sheer volume of data (1.47 billion people log into Facebook daily. Every ad they click or post they like is data.) but also the rapidity with which the data changes and the unstructured nature of the data are problems for a DBMS. How do you quantify images posted on Instagram to see what is trending? Or accumulate the Google searches regarding a subject?

But you may be asking yourself, why would I want to look at Google searches? In this mass of data, “insights” can be found. For example, law enforcement could monitor Facebook posts about upcoming political events and based on the intensity of opinions in proximity to the events they might allocate police officers differently. Or if Google searches revealed an increase in “how to commit suicide” searches in a particular location, authorities could insure that more personnel where assigned to suicide hot lines (in anticipation of an increase in calls) thereby insuring that nobody is put on hold when they call and hopefully decrease deaths.

These examples are relatively simple insights. The true “magic” of data science is using math, computer science, statistics, and data analysis against these massive data sets to generate data models, processes, and advanced systems that generate actionable conclusions or predictions that help improve the human condition. These improvements can range from making a business more profitable to preventing deaths, improving people’s health, or protecting the environment better.

Data science encompasses the acquisition, storage, and processing of data, including “Big Data”. Big Data is a phrase that describes massive data sets that are beyond the capability of a DBMS. Big Data can be comprised of three types of data.

Structured data – data that is clearly formatted and can be stored and processed by a DBMS.

Unstructured data – data that does not fit into a DB. Things like images, narratives, voice messages, twitter posts, and web pages are all examples of things that contain data but are not compatible with a DBMS

Semi-structured – data with some format or hierarchy. For example, spreadsheets can contain data that is very structured and easily adaptable to a DBMS but it can also contain data that is not. Semi-structured data can also be defined using XML or CSV (comma separated values) files.

So a data scientist’s job is to acquire the data, transform it into an analyzable form, store the data, analyze the data and then present the insights to stakeholders. Wow, that sounds like a lot of work! Fortunately, there are automated tools to help with all these tasks.

We will break up the rest of this lesson into the three main data science tasks: Data Wrangling, Data Analysis and Visualization.

Data WranglingThere a lots of other terms to describe this activity. A few of them are:

Data PreprocessingData PreparationData CleansingData ScrubbingData MungingData Transformation

The data scientist does not always have to accumulate the data. There are many publicly available sources of data. For example, www.data.gov has a large amount of economic, environmental, and industry data.

Here is one page available from www.data.gov that provides access to 2015 crime statistics. https://ucr.fbi.gov/crime-in-the-u.s https://ucr.fbi.gov/crime-in-the-u.s/2015/crime-in-the-u.s.-2015/resource-pages/downloads/download-printable-files

https://ucr.fbi.gov/crime-in-the-u.s

http://www.data.gov/

If you scroll down, you’ll notice that the data is provided in various formats such as Excel files and PDFs.

And the following is an example of one of the many Excel files available.

Even with so much data available, it can still be a considerable task to find the data you’re looking for. The example above doesn’t shed any light on what time of year or day that most vehicles are stolen. Or where they are the stolen from – public places vs. private residences. Bottom line: finding the data can be a lot of work.

Then there is the problem of finding data that is not formatted or has integrity problems. Take the following narrative: “Looks like my V8 Chevy is running low on fuel. I just filled it up the day before.” There is clearly information here but it is unstructured. The data scientist would fill in some of the blanks from the statement (like how full the gas tank was), clarify other data (what day was the day before), and provide structure by putting the data into a DB table as follows:

Here is another example of unstructured data:

DALDFWSFOEWRBOSDCALAXORDJFKMCO

This looks like a nonsensical string of random letters. However, with a little structure the underlying information becomes very apparent:

DAL DFW SFO EWR BOS DCA LAX ORD JFK MCO

Other problems include missing data, incorrect data and poorly structured data. For example, here is a CSV (comma separated values) file with name and address information.

We then used an automated tool to convert that data into a DB table.

The missing table data is the result of the CSV file missing data but also its badly formatted data resulted in the third person’s name and address being stored in the third rows Customer field. These are all problems that the data scientist must resolve before analysis can be performed.

Even if the data is accurate and complete, it can still be poorly formatted (even if it is in a DB table). For example, here is a sample invoice:

The invoice contains a lot of data: The selling company’s and purchaser’s name and addressThe invoice number and date The items being purchased, their quantities and pricesThe subtotal, discount, and the invoice total

The data scientist can capture all that data in a DB table like this:

Though all the data is there, this table is poorly structured. One telltale sign of a poorly structure table is duplicate data. Notice that the selling company’s and purchaser’s name and address is repeated in each row. If there had been 25 items in the invoice, that data would have been duplicated 25 times. The subtotal, discount and total data has the same duplication problem.

There is a procedure called data normalization that dictates how tables should be structured to reduce duplication and data integrity problems. In this case, normalization would have led the data scientist to create two tables. One table with the one time invoice data: selling company’s and purchaser’s name and address, invoice number, date, subtotal, discount and total. Each invoice would take up one row in this table like this:

The other table would hold the invoice detail information regarding each item in the invoice. In this second table, each item would take up a single row.

Note that the duplicate data is eliminated and that we still have all the data.

Here is a link to an article https://www.trifacta.com/blog/six-core-data-wrangling-activities/ that details one company’s approach to data wrangling.

Wrangling tools

https://www.trifacta.com/blog/six-core-data-wrangling-activities/

There are many automated tools that not only help you find inconsistencies in data but also provide tools that allow users to easily correct them. Some of the more popular ones are:

OpenRefineTrifacta WranglerDrakeTIBCO ClarityWinpure

View the six-minute “Explore Data” video at http://openrefine.org/ to get a better feel for how the automated tools can speed up the wrangling process.

Data Warehouse

A data warehouse (DW) stores data from many disparate sources. For instance, data from many databases (DB) may comprise a data warehouse. In addition, data warehouses organize data for query optimization (or OLAP – online analytical processing). Databases are used to support OLTP – online transaction processing. For instance, an inventory DB would hold information about all the products a company sells – price, color, size, etc. The data in a DB is organized for maximal transaction efficiency. When a sale occurs, the price of the sale item has to be retrieved quickly. To achieve this, data redundancy is reduced to a minimum to improve retrieval and update speed. (I.e. the more data there is, the longer it takes to search, find and update.)

As mentioned earlier, the process to decrease redundancy is called normalization. However, this normalized data make queries more complicated. The user has to have a good understanding of all the different tables and, to extract information from many tables, has to create much more complex queries. Data warehouses re-introduce redundancy to make the queries simpler.

In addition, data warehouses often contain historical data that OLTP systems don’t require. For example, a data warehouse might include historical price data. A sales application and DB only need the current price of an item so, historical price data would not be in the database. If a data scientist were trying to gain some insight into when and why prices change, the historic price data would be useful. In addition, the historic price of the sale items components would be useful. For example, if the sale items were oil based or made of copper, the historic prices for these materials might help yield data insights into price changes. None of this information would be stored in a DB supporting OLTP.

There are some other differences between DWs and DBs. Usually DBs have to have 99.99% availability. If the price of an item is not available, a sale will be lost so the DB has to be available all the time. By their nature, DWs have to be unavailable to periodically load vast amounts of data from many sources. (This is commonly called ETL – extract, transform, and load.) In addition, DBs are used to support many users and many simple, fast transactions. DWs support fewer users running long, complex queries.

Hadoop is a very popular software used for DWs. Hadoop breaks big data into manageable data sets. Some major components of Hadoop:

HDFS (Hadoop Distributed File System) - distributed storage - splits data onto multiple systems

MapReduce - filters, sorts, and summarizes dataHive - reads, writes, and manages large datasets residing in distributed storage

Here’s are a couple short (6-8 minute) videos that explain Hadoop in more detail: https://www.youtube.com/watch?v=4DgTLaFNQq0https://www.youtube.com/watch?v=FHVuRxJpiwI

Data AnalyticsThis is where the magic happens - turning data into insights. Let’s further define what we mean by insights and then explore some of the techniques to uncover these insights. There are four types of insights:

Descriptive - what happened (generally historical). What data was correlated in the past. For example, Netflix tracks what subscribers are viewing and look for correlations between customer types and the type of shows viewed. to improve their recommendation engine.

Diagnostic - why did it happen. There is a correlation between users who view Disney shows and horror films. Digging deeper into the data reveals that when this occurs there are usually two users on the same account – one a Disney fan and the other a horror film fan. So, there is no correlation between Disney shows and horror films.

Predictive - predict future events or trends. This type of analysis results in complex model building and analysis to predict the likely hood of certain events. In the Netflix example, they use the historical associations between shows viewers watch to improve their recommendation engine. Helping the engine make better and better suggestions. Prescriptive – optimize processes and systems based on the predictions. With better suggestions, more shows are being viewed which mean server usage is increasing. Based on viewership predictions need to increase infrastructure (servers) and personal to manage the added hardware.

Here are several bizarre descriptive insights and the organizations that found them (Source: https://www.predictiveanalyticsworld.com/patimes/nine-bizarre-surprising-predictive-insights-data-science/8183/):

Walmart - Pop-Tarts before a hurricane. Pre-hurricane, Strawberry Pop-Tart sales increased about sevenfold.

Uber - Higher crime, more Uber rides. In San Francisco, the areas with the most prostitution, alcohol, theft, and burglary are most positively correlated with Uber trips.

A financial services startup company - Typing with proper capitalization indicates creditworthiness. Online loan applicants who complete the application form with the correct case are more dependable debtors. Those who complete the form with all lower-case letters are slightly less reliable payers; all capitals reveals even less reliability.

https://www.predictiveanalyticsworld.com/patimes/nine-bizarre-surprising-predictive-insights-data-science/8183/

https://www.predictiveanalyticsworld.com/patimes/nine-bizarre-surprising-predictive-insights-data-science/8183/

https://www.youtube.com/watch?v=4DgTLaFNQq0

A human resources professional services firm (using employee data from Xerox and other companies) - Users of the Chrome and Firefox browsers make better employees. Among hourly employees engaged in front-line service and sales-based positions, those who use these two custom Web browsers perform better on employment assessment metrics and stay on longer.

Harvard University medical researchers - Men who skip breakfast get more coronary heart disease. American men 45 to 82 who skip breakfast showed a 27 percent higher risk of coronary heart disease over a 16-year period.

Shell - More engaged employees have fewer accidents. Among oil refinery workers, a one percentage-point increase in team employee engagement is associated with a 4 percent decrease in the number of safety incidents per employee.

Researchers at the University of Cambridge and Microsoft - Smart people like curly fries. Liking “Curly Fries” on Facebook is predictive of high intelligence.

University researchers - Female-named hurricanes are more deadly. Based on a study of the most damaging hurricanes in the United States during six recent decades, the ones with “relatively feminine” names killed an average of 42 people, almost three times the 15 killed by hurricanes with “relatively male” names.

Researchers examining Wikipedia behavior - Higher status, less polite. Editors on Wikipedia who exhibit politeness are more likely to be elected to “administrative” status that grants greater operational authority. However, once elected, an editor’s politeness decreases.

Let’s go through a hypothetical example using the Walmart finding above to show the difference between the various analytic processes. During the descriptive analytic phase, Walmart was able to verify the casual relationship between Pop-Tarts and hurricanes using standard statistical methods (such as Hypothesis tests and confidence intervals) and data mining (searching for patterns and trends).

During the diagnostic analytic phase, more complex techniques (drilling down into subsets of the data, finding further correlations within these subsets, etc.) are used to dive deeper into the data. Walmart dove deeper into the data and gained a deeper understanding of the data. For example, though there was a 7-fold increase in sales, many stores ran out of Pop Tarts so Walmart determined the sales actually would have been higher if the inventory was available. There was also a correlation between sales and the predicted strength of a hurricane and how close a hurricane was predicted to be to a store and the increased sales amount.

After thoroughly understanding the data and their correlations, Walmart was ready to move on to the predictive analysis phase. In this phase statistical techniques such regression analysis, decision trees, clustering (grouping data by similarities), and machine learning are used. Using these tools a model (produced using a software application) would be developed to predict Pop-Tart sales in individual stores based on a predicted hurricane’s strength and path (from NOAA (National Oceanic and Atmospheric Administration)) and their store locations. In addition, it would incorporate the research about from the male/female hurricane names to further refine the store-by-store Pop-Tart predictions.

The prescriptive analysis phase would look at predictions from the model and determine how the company’s current processes have to be changed to support the massive redistribution of product to the affected stores. For instance, would it better to have a single, central (central to hurricane activity) warehouse with all the “comfort” items that people buy before hurricanes and ship them to the affected stores or should excess product from nearby unaffected stores be reallocated to the predicted affected stores? Based on the choice what would this require - new inventory space, trucks, employees (truck drivers, inventory attendants, purchasers).

There are many data analytics software systems to help the data scientist. Here is a list of some:

Sisense LookerZoho Reports DomoQlik Sense GoodDataBirst IBM AnalyticsIBM Cognos IBM WatsonMATLAB Google AnalyticsApache Hadoop Apache SparkMinitab SAP Business Intelligence PlatformStata RapidMinerAlteryx Tableau

Many of the tools have data wrangling functions (like data anomaly detection) but their claim to fame are easy to use analytic tools. For instance, most provide the ability to filter, explore and mine data by simply clicking and dragging – no programming required. Many support a “natural language dialogue” (user uses their own words to ask a question), mobile device interaction, and offer cloud based systems. Other common functions include trend analysis, what if analysis, and machine learning.

The one thing they all share is data visualization. This means presenting data in a non-numeric format, mostly charts and graphs. Visualization is used throughout the Data Wrangling and the Analytic phases to emphasize or even discover data relationships.

VisualizationWhen designing data-visualizations you want to make the graphic easy to understand, clear, and concise. Colors and fonts should be pleasing to the eye and support the message. For example, if you were showing profits and losses you would want to choose black for the profits and red for the losses. Switching the colors would cause confusion and maybe make users of the graphic even draw the wrong conclusion. Similarly choosing blue and purple would make the graphic harder to understand – which color is good, blue or purple?

One of the classics of visualization is Charles Minard’s graphic about Napoleon’s campaign to conquer Russia in 1812 - 1813. (A larger and downloadable image is available at https://en.wikipedia.org/wiki/Charles_Joseph_Minard.) It was revolutionary for its time and hailed as the “best statistical graphic ever drawn”.

In two dimensions, the graphic represents six types of data: the number of Napoleon's troops (thickness of the yellow and black lines), distance, temperature, latitude and longitude, direction of travel (yellow towards Moscow, black away from), and the army’s location relative to specific dates.

However, is it a good chart? There certainly is a lot of information, but what is the data relationship(s) we should take away from the chart? The effect of temperature on the size of the army? Distance and army size? Latitude and size? Some might argue a much simpler graphic like this:

would tell the story more effectively. (Graphic from http://excelcharts.com/minard-tufte-kosslyn-godin-napoleon/.)

A more modern graphic example is below:

http://excelcharts.com/minard-tufte-kosslyn-godin-napoleon/

http://excelcharts.com/minard-tufte-kosslyn-godin-napoleon/

(Graphic from https://www.tableau.com/learn/articles/best-beautiful-data-visualization-examples.)

This graphic brings out the size relationship between the various budget categories better than list of budget dollar values. Notice that in each category the subcategory rectangles are all in the same color palette (i.e. All Income Security subcategories are shades of green, National Defense is all yellows, Veteran Benefits are oranges, etc.) and that the smaller the dollar value, the lighter the color shade.

Visualization is not limited to static charts and graphs. Look at the animation on this page (http://www.pewresearch.org/next-america/#Two-Dramas-in-Slow-Motion) which shows US population age distribution over 100 years. You would need a lot of charts to get the same message across as this one animation!

Most of the automated tools that we have mentioned generate visuals quickly through a data dashboard. A data dashboard is a software application that tracks, analyzes and displays Big Data in the form of tables, line charts, bar charts and gauges. A data dashboard provides a central location for organizations to see key data and relationships.

Below is an example of a dashboard (source https://datastudio.google.com/reporting/0B2-rNcnRS4x5UG50LTBMT0E4aXM/page/nQN).

Notice that there is a wide variety of data (revenue, product detail views, platform used to access the store) in various forms (line charts, tables, bar charts). Most dashboards are interactive and allow you to drill down into the data to get more details. In the above example, we clicked on the revenue line chart to look at the revenue for a particular date compared to the revenue one month earlier.

In addition, most dashboards are customizable. You can remove, change, and add data, change the format, colors, etc. In the example above, notice in the upper right you can specify the time frame of the data and in the line below you can select data by the type of device being used to access the store (mobile, PC, tablet), countries, and user types (new or returning user). You can go to https://datastudio.google.com/reporting/0B2-rNcnRS4x5UG50LTBMT0E4aXM/page/nQN and work with the dashboard in real time and make these types of changes.

There are many technologies used to develop these software tools. For example, many parts of Hadoop were created using the Java programming language. The Python language is frequently used for data wrangling, the R language is used for statistical programming and analysis and JavaScript is used for data visualization. SAS provides analytics features and report writing tools.

https://datastudio.google.com/reporting/0B2-rNcnRS4x5UG50LTBMT0E4aXM/page/nQN

web.fscj.eduweb.fscj.edu/Janson/cgs1060/20190115.DataScienceModule.d… · Web viewIn addition, the...

Documents

Transcript of web.fscj.eduweb.fscj.edu/Janson/cgs1060/20190115.DataScienceModule.d… · Web viewIn addition, the...