Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

150
DATA SCIENCE Colloquium (7) MS(LIS) 2013-2015 Indian Statistical Institute Documentation Research and Training Centre

Transcript of Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Page 1: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

DATA SCIENCEColloquium (7)

MS(LIS) 2013-2015

Indian Statistical InstituteDocumentation Research and

Training Centre

Page 2: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Page 3: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

● Data Science is a newly emerging field dedicated to analyzing and manipulating data to derive insights and build data products. It combines skill-sets ranging from computer science, to mathematics, to art. (www.kaggle.com)

Page 4: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

● Data science imply a focus involving data and, by extension, statistics, or the systematic study of the organization, properties, and analysis of data and its role in inference, including our confidence in the inference. (D.J.Patil)

● In simple word we can say that it is process which extract information/knowledge from huge data.

Page 5: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Page 6: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Evolution• 1900 - Statistics• 1960 - “Data Mining” • 2006 - Google Analytics appears• 2007 - Business/Data/Predictive Analytics• 2012 - Big Data surge• 2013 - Data Science • 2015 - ??

Page 7: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Page 8: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

● Data is growing at very high pace(exponentially).

● According to IBM, 2.5 exabytes - that's 2.5 billion gigabytes (GB) - of data was generated every day in 2012. About 75% of data is unstructured, coming from sources such as text, voice and video.

Page 9: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

● In 2012 it reached 2.8 zettabytes and IDC forecasts that we will generate 40 zettabytes (ZB) by 2020 which is the equivalent of 5,200 GB of data for every man, woman and child on Earth.

● 90% of all the data in the world today has been created in the past few years.

Page 10: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Page 11: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

S.No. Sub-Topic Speaker

1. What is Data Science Sandip Das

2. Data Scientist Anwesha Bhattacharya

3. Applications of Data Science Manasa Rath

4. Workflow of Data Science Dibakar Sen

5. Challenges in Workflow of Data Science

Jayanta Kr. Nayek

6. Tools and Technology Tanmay & Manash

7. Machine Learning in Data Science

Samhati Soor

8. Conclusion Shiv Shakti Ghosh

Page 12: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

References● http://bit.ly/1gyRYcM ● http://bit.ly/SdJ2OU ● http://bit.ly/RzrZ9k● http://bit.ly/1pwlEY4 ● http://bit.ly/1pwlUq6

Page 13: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

What is Data Science

Sandip Das

Page 14: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

DATA SCIENCE

DATA SCIENCE

Page 15: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Data

What kind of data might you collect?

Page 16: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Data How many Lily pads

Measures the inchesof the Lily pads

How many small,medium or largeLily pads

How many frogs

Page 17: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

What is Data? It is something you want to know.

A collection of fact.

Facts and statistics collected together for reference or analysis.

Data as the plural form of datum; as pieces of information; and as a collection of object-units that are distinct from one another.

Data is undifferentiated observation of facts in terms of words, numbers, symbols, etc.

Page 18: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

What is Data? Computer data is information processed or stored by a

computer. This information may be in the form of text documents, images, audio clips, software programs, or other types of data. Computer data may be processed by the computer's CPU and is stored in files and folders on the computer's hard disk.

Page 19: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Science The systematic observation of natural events and

conditions in order to discover facts about them and to formulate laws and priciples based on these facts.

Science involves more than the gaining of knowledge.It is about gaining a deeper and often useful understanding of the world.

Page 20: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

The Science is an art of Discovering what we don't know from data

Obtaining predictive,actionable insight from data

Creating Data products that have business impact

Building confidence in decisions that drive business value

Page 21: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Data science According to Computer scientist Peter Nauer

“The science of dealing with Data, once they have been established”

Data Science is the scientific study of the creation, validation and transformation of data to create meaning.

Data science is the study of the generalizable extraction of knowledge from data.

Page 22: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Multidisciplinary Approach

Page 23: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Domain Expertise

Domain expertise is proficiency, with special knowledge or skills, in a particular area or topic.

Domain expertise includes knowing what problems are important to solve and knowing what sufficient answers look like. Domain experts understand what the customers of their knowledge want to know.

Page 24: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Data EngineeringIt is the data part of data science. It involves

Acquiring

Ingesting

Transforming

Storing

Retrieving data

Page 25: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Scientific Method

It is the process for acquiring new knowledge by applying the principles of reasoning on empirical evidence derived from testing hypotheses through repeatable experiments.

Page 26: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Statistics & Mathematics

Statistics (along with mathematics) is the cerebral part of Data Science. They collect, Organize, analyse and interpret data.

Page 27: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Advanced Computing

Advanced computing is the heavy lifting of data science. It consists software design and programming language.

Page 28: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Visualization It is the pretty face of data science.

A good visualization is the result of a creative process that composes an abstraction of the data in an informative and aesthetically interesting form.

Page 29: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Hacker mindset

Hacking is modifying one's own computer system, icluding building, rebuilding, modifying and creating software, electronic hardware or peripherals, in order to make it better, make it faster, give it added features.

Data science hacking involves inventing new models, exploring.

Page 30: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

References● http://bit.ly/1jZR0WA ● http:// bit.ly/1pwmV1m ● http://bit.ly/1tkKyKG ● http://bit.ly/1ntd13L ● http://bit.ly/1wi9t5Z

Page 31: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Data Scientist

Anwesha Bhattacharya

(& I am not a data scientist)

Page 32: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Who is a data scientist?● A practitioner of data

science is called a data scientist.(~Wikipedia)

● Data scientists use technology and skills to increase awareness, clarity and direction for those working with data. (http://www.datascientists.net)

Page 33: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Page 34: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Why do we need data scientists?● Firstly, there is more data than we can

consume. We require a data scientist who can look at the data and say, “This is important. Check out this one.”

● They are the people who can understand and provide meaning to the piles and piles of data that are collected. “Big data” is the buzzword that represents those piles.

● Minimise the disruption that are encountered while dealing with data.

● Present data with an awareness of the consequences of presenting that data.

Page 35: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Data Scientist aims

Page 36: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Types of Data ScientistsData scientists can be

broadly classified into two categories:

Product-focused data scientists.

Business Intelligence style of data scientists.

There are roughly 4 to 5 groups in each category.

Page 37: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Product-focused Data Scientists Data Researcher

The professionals in this category come from the academic world and have in-depth backgrounds in statistics or the physical or social sciences. This type of data scientist often holds a PhD but is weakly skilled in Machine learning, Programming or Business.

Data Developer These guys tend to concentrate on

technical issues that come with handling data. They are strong in programming and machine learning but weak in business and statistics skills.

Data Creatives These are the guys who make something

innovative out of mountains of data. They are strongly skilled in machine learning, Big Data, programming and other skills to handle massive data.

Data Business people They represent the business side and are

responsible for making vital business decisions through data analytics techniques. They are a blend of business and technical proficiency.

Page 38: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Business Intelligence based Data Scientists● Quantitative, exploratory Data Scientists

Quantitative, exploratory data scientists are inclined to have PhDs and use theory to comprehend behaviour. By combining theory and exploratory research, these data scientists improve products.

● Operational Data ScientistsOperational data scientists frequently work in

finance, sales or operations teams in an organization. His role is to analyse performance, responses and behavior of a process, to improve organization’s strategy and efficiency.

● Product Data ScientistsProduct data scientists fit in to product management

or engineering. Their job is to understand the way users make use of a product and make use of that knowledge to fine tune the product.

● Marketing Data ScientistsMarketing data scientists focuses on the user base,

evaluate performance and work on improving efficiency, pretty much like the standard marketing guy.

● Research Data Scientists Research data scientists create insights from a data

set.

Page 39: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Profile of Data Scientist● They love data● Have investigative mind set● Goal of work: finding patterns in

data and data driven products● Are practitioners, not theorists● Have “hands on” skills● Have domain expertise ● Team players● Technically focused● Versatile communication and

collaboration skills● Curiosity for exploring and

experimenting with data.● Sceptical people, likely to ask a

lot of questions around the viability of a given solution and whether it will really work.

Page 40: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Required skills● Data mining - Computational process of discovering patterns in

large data sets. The analysis step of the "Knowledge Discovery in Databases".

● Programming - The act of instructing computers to perform tasks.

● Algorithms - Step-by-step procedure for calculations used for analysis of data.

● Statistics – The collection, organization, analysis, interpretation and presentation of data.

● NLP - Interactions between computers and human languages. ● Machine learning - The science of getting computers to act

without being explicitly programmed.● Distributed systems – The components located on networked

computers communicate and coordinate their actions by passing messages.

● Visualization - The creation and study of the visual representation of data, communicate both abstract and concrete ideas.

● .........

Page 41: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

What Does a Data Scientist Do?10 Things [most] Data Scientists Do:1) Ask Good Questions.

What is What?We don’t know! We’d like to know?

2) Explore data & generate hypothesis. Run experiments3) Scoop, Scrap & Sample Data4) Tame Data 5) Discover the unknowns. 6) Model Data. Model Algorithms. 7) Understand Data Relationships 8) Tell the Machine How to Learn from Data 9) Create Data Products that Deliver Actionable Insight 10) Communicate the results using visualization, presentations

Page 42: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

DIKUW

I K U WD

Raw What How to Why When

Numbers Description Extract Cause & Effect Prediction

Letters Context Test Proved What's best

Symbols Relationships Instruction Known Unknowns

Unknown Unknowns

Data Information Knowledge Understanding Wisdom

Data Engineer Data Analyst Data Miner Data Scientist

PAST FUTURE

Page 43: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Data Scientist Data AnalystFamiliarity with database systems e.g MySQL

Familiarity with data warehousing and business intelligence concepts

Better to be familiar with Java, Python

In-depth exposure of SQL and analytics

Should have clear understanding of various analytical functions - median, rank etc. and how to use them on data sets

Strong understanding of Hadoop based analytics

Perfection in mathemetics, statistics, correlation, data mining etc.

Perfection regarding the tools and components of data architecture

Deep statistical insights and machine learning

Proficiency in decision making

● Data analysis has been generally used as a way of explaining some phenomenon by extracting interesting patterns from individual data sets with well-formulated queries.

● Data science, on the other hand, aims

to discover and extract actionable knowledge from the data, that is, knowledge that can be used to make decisions and predictions, not just to explain what’s going on.

Data Scientist vs Data Analyst

Page 44: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Challenges of data scientist● Red tape

No access allowed● Unknown need

What's the organization's goal?

● TerminologyWhat's a wonkulator?

● Real world dataMessy, noisy, missing

● Analysis distrust...but I dont like that

result

Page 45: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

References● Zhukov, Leonid. Data Scientists. Higher School of

Economics. National Research University.● http://bit.ly/1kduMvA ● http://bit.ly/1orF9DL ● http://bit.ly/1tMBBvQ ● http://bit.ly/1kJ9gU8 ● http://bit.ly/TS9H5e ● http://bit.ly/1jZR0WA

Page 46: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

APPLICATIONS of DATA SCIENCE

by

Manasa Rath

Page 47: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Reaching to Data Science

Page 48: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

APPLICATIONS

agriculture

pharmacy

energy

retail

tourism

realestate

import-export

finance

business

services

Page 49: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Applications in Education sector-Survey done by Pearson group to improve the learing

softwares, course materials better quality and efficacy in learning

-Tools used is Python, R, Google Big Query

Page 50: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Data Science in Healthcare Industry-where a group has been diagnosed with Type2 Diabetes & some

subset of this group has developed complications

-would like to know whether there is any pattern to complications and whether the probability of complication can be predicted and therefore acted upon

Healthcare Use Database Snippet

Page 51: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Extracting Interesting Patterns of Health outcomes from Healthcare System Care

Whether the pattern is robust and predictive ?? OBSERVATIONS

What is incidence of complications of Type 2 diabetes for peple over 37 who are on more than six medications?

Page 52: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Remarks-Predictive accuracy becomes a primary objective, the

computer tends to play a significant role in model building and decision making

Shows an integrated skill set spanning mathematics,statistics, AI,databases, optimization along with deep understanding of the craft problem formulation to engineer effective problems

Page 53: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Applications in Social Networking sites

Page 54: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Page 55: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Page 56: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Key Points--ability to interpret unstructured data and integrate it with

numbers further increases our ability to extract useful knowledge in real-time and act on it

Page 57: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

References1.Data Science and Prediction by Vasant Dhar http://bit.ly/1tiRvMr

Page 58: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Workflow of Data Science

Dibakar Sen

Page 59: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Work flow of Data Science● The work flow

process consist of three major activities-

-Organising-Packaging-Delivering

Page 60: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Work flow PhasesUnderstanding

of data

/ Evaluation

Page 61: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Understanding of Data - set objectives or goal - set data fields - data collection procedure

Page 62: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Preparation Phase

Understanding of

data

/ Evaluation

Page 63: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Preparation Phase● Acquire data The obvious first step in any data

science workflow is to acquire the data to analyze. Data can be acquired from a variety of sources. e.g.,:

-Existing Data can be used (e.g., U.S. Census data sets).

-Data can be automatically generated by computer software.

-Data can be manually entered into a spreadsheet or text file by a human through survey.

Page 64: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Preparation Phase

● Reform and clean data -Before analysis begins, we need to verify that the data

are accurate and that the variables are well named and properly labeled.

-We have to store the data in desired format, - Verify the sample and variables - Do the variables have the correct values? - Are missing data coded appropriately? -Are the data internally consistent? - Is the sample size correct? etc. -Programmers reformat and clean data either by writing

scripts or by manually editing data, say, a spreadsheet.

Page 65: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Analysis PhaseUnderstanding

of data

/ Evaluation

Page 66: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Analysis Phase● Data Analysis - The core activity of data

science is the analysis phase: writing, executing, and refining computer programs to analyze and obtain insights from data.

- Different "scripting"

languages such as Python, Perl, R, and MATLAB are used to analysis the data. However, they also use compiled languages such as C, C++, and Fortran when appropriate.

Page 67: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

● In the analysis phase, the programmer engages in a repeated iteration cycle of editing scripts, executing to produce output files, inspecting the output files to gain insights and discover mistakes, debugging, and re-editing.

Page 68: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Reflection/Evaluation PhaseUnderstanding

of data

/ Evaluation

Page 69: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Reflection / Evaluation Phase

The analysis phase involves programming, the reflection phase involves thinking and communicating about the outputs of analyses. After inspecting a set of output files, a data scientist might perform the following types of reflection:

-Take notes - Hold meetings - Make comparisons and explore alternatives

Page 70: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Dissemination PhaseUnderstanding

of data

/ Evaluation

Page 71: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Dissemination Phase

The final phase of data science is disseminating results. Prepare reports in order to communicate findings to the appropriate audience. Results are most commonly in the form of written reports such as internal memos, slideshow presentation, business / policy white paper, or academic research publications.

● Beyond presenting results in written form, some data scientists also want to distribute their software so that colleagues can reproduce their experiments or play with their prototype systems.

Page 72: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

References● http://bit.ly/1jZcx2I ● http://bit.ly/1jZeTyN ● http://bit.ly/1hbQuWx

Page 73: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Challenges in Workflow of Data Science

Jayanta Kr. Nayek

Page 74: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Preparation phase Acquire data:-Keeping track of provenance :-Where each piece of data comes from and whether it is still

up-to-date.

-Data management : -Programmers must assign names to data files that they create

or download and then organize those files into directories. -When they create or download new versions of those files, they

must make sure to assign proper filenames to all versions and keep track of their differences.

-Storage :-Sometimes there is so much data that it cannot fit on a single

hard drive, so it must be stored on remote servers.

Page 75: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Preparation Phase Reformat and clean data :-A related problem is that raw data often contains semantic

errors(an error in logic or arithmetic that must be detected at run time), missing entries, or inconsistent formatting, so it needs to be "cleaned" prior to analysis.

-Data integration :-Data integration involves combining data residing in different

sources and providing users with a unified view of these data.-Heterogeneous Data:-data integration involves synchronizing huge quantities of

variable, heterogeneous data resulting from internal legacy systems (an old method, technology, computer system, or application program,"of, relating to, or being a previous or outdated computer system) that vary in data format. Legacy systems may have been created around flat file, network, or hierarchical databases.

Page 76: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Preparation Phase● Data Integration Problems:

-Unanticipated Costs:-Labor costs for initial planning, evaluation,

programming and additional data acquisition-Software and hardware purchases-Unanticipated technology changes/advances-Both labor and the direct costs of data storage and

maintenance

-Lack of Data Management Expertise: -support required to engage and convey to everyone in

the agency the need for and benefits of data integration is unlikely to flow from leaders who lack awareness of or commitment to the benefits of data integration.

Page 77: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Preparation Phase Data transmission:

-It is the physical transfer of data over a point-to-point or point-to-multipoint communication channel.

-Cloud data storage is popularly used as the development of cloud technologies.

-We know that the network bandwidth capacity is the bottleneck in cloud and distributed systems, especially when the volume of communication is large.

-On the other side, cloud storage also lead to data security problems as the requirements of data integrity checking.

Page 78: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Analysis Phase-Data inconsistence and incompleteness:-A number of data preprocessing techniques, including data

cleaning, data integration, data transformation and date reduction, can be applied to remove noise and correct inconsistencies.

-Scalability:-The biggest and most important challenge is scalability when we

deal with the Big Data analysis.-In the last few decades, researchers paid more attentions to

accelerate analysis algorithms to cope with increasing volumes of data and speed up processors following the Moore’s Law.

-Data Curation:-Data curation is aimed at data discovery and retrieval, data

quality assurance, value addition, reuse and preservation over time.

-The existing database management tools are unable to process Big Data that grow so large and complex.

Page 79: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Analysis Phase-Timeliness:-Real-time Big Data applications, like navigation, social networks, finance,

biomedicine, astronomy, intelligent transport systems, and internet of thing, timeliness is at the top priority. How can we guarantee the timeliness of response when the volume of data will be processed is very large?

-File and metadata management:-Repeatedly editing and executing scripts while iterating on experiments

causes the production of numerous output files, such as intermediate data, textual reports, tables, and graphical visualizations.

-However, doing so leads to data management problems due to the abundance of files and the fact that programmers often later forget their own ad-hoc naming conventions.

-Data security:-Firstly, the size of Big Data is extremely large, channelling the protection

approaches.

-Secondly, it also leads to much heavier workload of the security.

Page 80: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Analysis Phase-Absolute running times:Scripts might take a long time to terminate, either due to

large amounts of data being processed or the algorithms being slow.

-Incremental running times:Scripts might take a long time to terminate after minor

incremental code edits done while iterating on analyses, which wastes time re-computing almost the same results as previous runs.

-Crashes from errors:Scripts might crash prematurely due to errors in either the

code or inconsistencies in data sets. Programmers often need to endure several rounds of debugging before their scripts can terminate with useful results.

Page 81: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Reflection Phase● Take notes:Since notes are a form of data, the usual data management

problems arise in notetaking, most notably how to organize notes and link them with the context in which they were originally written.

● Make comparisons and explore alternatives:

Data scientists must organize, manage, and compare these graphs to gain insights and ideas for what alternative hypotheses to explore.

Page 82: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Dissemination Phase-Functionalities:

-To convey information easily by providing knowledge hidden in the complex and large-scale data sets, both aesthetic form and functionality are necessary.

-Current tools mostly have poor performances in functionalities and response time.

-Scalability :

-It is particularly difficult to conduct data visualization (the main objective of data visualization is to represent knowledge more intuitively and effectively by using different graphs) because of the large size and high dimension of Big Data.

Page 83: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Dissemination Phase● Difficult to distribute research code:Some data scientists also want to distribute their

software so that colleagues can reproduce their experiments or play with their prototype systems. It is difficult to distribute research code in a form that other people can easily execute on their own computers.

● Difficult to reproduce the results:It is even difficult to reproduce the results of one's

own experiments a few months or years in the future, since one's own operating system and software inevitably get upgraded in some incompatible manner such that the original code no longer runs.

Page 84: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Reference● Chen,Philip C.L. And Zhang,Chun-Yang.

(2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data.Information Sciences.ELSEVIER.Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, Macau, China.

● http://bit.ly/1jZcx2I ● http://1.usa.gov/SNspKm

Page 85: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

TECHNOLOGY and Tools for DATA SCIENCE

TANMAY MONDAL & MANASH KUMAR

Page 86: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

We need

● Organise Data

● Analyse Data

● Package and Deliver Data

Page 87: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Data Science Tools Language

Java, R, Python, ... Databases/Data Warehouses

Apache Cassandra, Apache HBase, MongoDB, ....

Data Mining RapidMiner/RapidAnalytics, Orange, Weka, ....

File Systems Gluster, Hadoop Distributed File System, ...

Page 88: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Data Science Tools Big Data Search

Lucene, Solr, ... Data Aggregation and Transfer

Sqoop, Flume, .... Miscellaneous Big Data Tools

– Hadoop, Avro, Zookeeper, ... ......................

Page 89: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Page 90: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

What is Hadoop?● The Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Nodes

Hadoop cluster

Page 91: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Why Hadoop?

• Handles enormous data volumes.

• Cost-effective.

• Scalable.

• Fault tolerant.

Page 92: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Origin of Hadoop

• Google introduced two key technology for handling Big data, Google File System (a distributed file system technology) in 2003 and MapReduce ( framework for distributed compute model) in 2004 to the world.

• Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS.

• In February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop.

• First release of Apache Hadoop in September 2007

Page 93: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

When should we go for Hadoop ?

Data is too huge

Unstructured data

Parallelism

Processes are independent

Need better scalability

Page 94: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

The Hadoop Ecosystem● HDFS - Hadoop Distributed File System.

● MapReduce - A distributed framework for

executing work in parallel.

• Hive - Hive is a data warehouse infrastructure built

on top of Hadoop for providing data summarization,

query, and analysis.

● Pig – Pig is a high-level platform for creating

MapReduce programs used with Hadoop.

● HBase – A non-rational, distributed database

system.

● ..........

Page 95: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

The Major Component of HadoopHadoop use its own distributed file system,HDFS, which makes data available to multiple computing nodes.

Hadoop uses MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.

Page 96: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

HDFSHierarchical UNIX-like file system for data storage sort of Splitting of large files into blocks.

Stores files in blocks across many nodes in

a

cluster.

Distribution and replication of blocks to

different nodes.

Have master slave architecture.

Page 97: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

HDFS Architecture

Page 98: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

HDFS ...NameNode

Runs on a single node as a master processHolds file metadata (which blocks are where)Directs client access to files in HDFS

SecondaryNameNodeMaintains a copy of the NameNode metadata

Data Node●Stores data in the local file system●Periodically sends a report of all existing blocks to the NameNode

Page 99: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

WHAT IS MAP REDUCE?

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster

Page 100: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Page 101: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Map Reduce Paradigm

Data processing system with two key phase

Map

Perform a map function on input key/value pairs to generate intermediate key/value pairs

Reduce

Perform a reduce function on intermediate key/value groups to generate output key/value pairs

Page 102: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Map Reduce Daemons

•JobTracker (Master)

-Monitors job and task progress

- Manages MapReduce jobs

-Giving tasks to different nodes

•TaskTracker (Slave)

- Creates individual map and reduce tasks

- Reports task status to JobTracker

-Runs on same node as DataNode service

Page 103: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Page 104: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Hadoop Map Reduce Components

Reduce Phase

Shuffle

Sort

Reducer

Output Format

Map Phase

Input Format

Record Reader

Mapper

Combiner

Page 105: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

105

How does Map Reduce work?➢The run time partitions the input and provides it to different Map instances

➢Map (key, value) (key’, value’)

➢The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’. ➢Map and Reduce are user written functions in java

Page 106: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Page 107: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

WORD COUNT IN MAP REDUCE

Page 108: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Validation of data extract and load into EDW(Enterprise Data Warehouse)

Once map-reduce process is completed and data output files are generated, then data is moved to enterprise data warehouse or any other transactional systems depending on the requirement.

Page 109: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

USERS OF HADOOPYahoo! - More than 100,000 CPUs in 40,000 computers

running Hadoop Produces data that was used in every Yahoo!

Web search queryFacebook - In 2010 Facebook claimed that they had the

largest Hadoop cluster in the world with 21 PB of storage.

On June 13, 2012 they announced the data had grown to 100 PB.

Each (commodity) node has 8 cores and 12 TB of storage

Page 110: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

USERS OF HADOOP

Adobe -Adobe uses Apache Hadoop and Apache HBase in several areas from social services to structured data storage and processing for internal use.Currently have about 30 nodes running HDFS

Ebay -532 nodes cluster (8 * 532 cores, 5.3PB)Heavy usage of Java MapReduce, Apache Pig, Apache Hive, Apache HBase Using it for Search optimization and Research.

Page 111: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Twitter We use Apache Hadoop to store and process

tweets, log files, and many other types of data generated across Twitter.

GBIF (Global Biodiversity Information Facility)

Nonprofit organization that focuses on making scientific data on biodiversity available via the Internet

18 nodes running a mix of Apache Hadoop and Apache HBase

Page 112: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

University of Glasgow

30 nodes cluster (Xeon Quad Core 2.4GHz, 4GB RAM, 1TB/node storage). To facilitate information retrieval research & experimentation, particularly for TREC

Greece.com

Using Apache Hadoop for analyzing data for millions of images, log analysis, data mining

Page 113: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Referenceshttp://bit.ly/1km1e46

http://bit.ly/Rzuzfz

http://yhoo.it/1pheFVK

Big data: Testing Approach to Overcome Quality Challenges\By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja.

Page 114: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Machine Learning

Samhati Soor

Page 115: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

What is it?Learning is a process of knowledge acquisition with specific purpose.

Machine learning is the study of how to use computers to simulate human learning activities.

TrainingSet

LearningAlgorithm

hypothesis Predicted OutputInput

Feedback

Page 116: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Why Machine Learning is Possible?

Mass StorageMore data available

Higher Performance of ComputerLarger memory in handling the data

Greater computational power for calculating and even online learning

Machine Learning Basics: 1. General Introduction

Page 117: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Basic Structure of the Machine Learning System

Externalenvironment

Corpusstudy

KnowledgeRepresentation

Execution

Machine Learning Model

Page 118: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

The Goal of Machine Learning is...to create a predictive model that is

indistinguishable from a correct model.

Without Logic

With Logic

Page 119: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Two Phases

Machine learning methods are broken into two phases:

TrainingApplication

Page 120: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Types of Machine Learning

Other types:

1. Semi-supervised learning

2. Time-series forecasting

3. Anomaly detection

4. Active learning

Main types:

1. Supervised Learning

2. Unsupervised learning

3. Reinforcement learning

Page 121: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

The Main Research Work on Machine Learning Field

Task-oriented researchCognitive simulationTheoretical analysis

Page 122: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Data Science and Machine Learning

If we are giving the computer rules and/or algorithms to automatically search through your data to “learn” how recognize patterns and make complex decisions (such as identifying spam emails), we are implementing machine learning.

In Data science, Data scientists use both statistical techniques and machine learning algorithms for identifying patterns and structure in data.

Page 123: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

.

Role of Machine Learning in Data Science

https://doubleclix.wordpress.com/category/data-science/

Page 124: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

A Simple ImplementationLet, we have a model consisted of the likelihood of

the coin landing heads (prior over θ), while the data consisted of the results of N coin flips.

We are observing some data.

Our goal is to determine the model from the data i.e. we will find the probability of getting desired model using the given data or p(model|data).

Page 125: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Using Conditional Probability,p(data|model) =p(data and model) * p(model) --(1)p(model|data) =p(data and model) * p(data) --(2)

From (1) and (2) we get,p(data|model) / p(model) = p(model|data) / p(data)

That implies : p(model|data) = (p(model|data) * p(data)) /

p(model)

posterior likelihood prior evidence

Page 126: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

The likelihood distribution describes the likelihood of data given model — it reflects our assumptions about how the data c was generated.

The prior distribution describes our assumptions about model before observing the data.

The posterior distribution describes our knowledge of model, incorporating both the data and the prior.

The evidence is useful in model selection.

Page 127: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Working Method of a Predictive Modeler and a Data ScientistA predictive modeler may use machine learning approach

to predict a value or likelihood of an outcome, given a number of input variables.

A data scientist applies these same approaches on large data sets, writing code and using software adapted to work on big data.

Page 128: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

The available library of statistical and machine learning algorithms for evaluating and learning from big data is growing, but is not yet as comprehensive as the algorithms available for the non-distributed world.

The algorithms vary by product, so it is important to understand what is and is not available.

Even not all algorithms familiar to the statistician and data miner are easily converted to the distributed computing environment.

The bottom line is that, while fitting models on big data has the potential benefit of greater predictive power, some of the costs are loss of flexibility in algorithm choices and/or extensive programming time.

Prospective

Page 129: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

References

Machine Learning and Data MiningLecture NotesCSC 411/D11Computer Science DepartmentUniversity of TorontoVersion: February 6, 2012

The Discipline of Machine LearningTom M. MitchellJuly 2006CMU-ML-06-108School of Computer ScienceCarnegie Mellon UniversityPittsburgh, PA 15213

Statistical Machine Learning-Nic Schraudolph

http://bit.ly/1oFt1ws

http://bit.ly/1oFtNty

Page 130: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Conclusion

Shiv Shakti Ghosh

Page 131: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Research AreasCloud computingDatabases and Database Management

SystemsNatural language processingSignal ProcessingComputer vision

Page 132: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Cloud computing Cloud computing involves distributed

computing over a network, where a program or application may run on many connected computers at the same time.

It specifically refers to a server connected through a communication network such as the Internet, an intranet, a local area network (LAN) or wide area network (WAN).

Page 133: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

IssuesPrivacy -The increased use of cloud

computing services such as Gmail and Google Docs has pressed the issue of privacy concerns. The greater use of cloud computing services has given access to a plethora of data which has the immense risk of data being disclosed either accidentally or deliberately.

Page 134: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Contd..Legal-certain legal issues arise with cloud

computing, including trademark infringement, security concerns and sharing of proprietary data resources.

Vendor lock-in-cloud computing is still relatively new, standards are still being developed. Many cloud platforms and services are built on the specific standards, tools and protocols developed by a particular vendor for its particular cloud offering. This is a major challenge in interoperability.

Page 135: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Research areasopen interoperation across cloud solutions at

IaaS, PaaS and SaaS levelsmanaging multi tenancy at large scale and in

heterogeneous environmentsdynamic and seamless elasticity from private

clouds to public clouds for unusual and/or infrequent requirements

data management in a cloud environment, taking the technical and legal constraints into consideration

Page 136: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Databases &DBMS

A database is an organized collection of data. The data are typically organized to in a way that supports processes requiring this information.

Database management systems (DBMSs) are specially designed software applications that interact with the user, other applications, and the database itself to capture and analyze data.

Page 137: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

IssuesData definition – Defining new data structures for a

database, removing data structures from the database, modifying the structure of existing data.

Update – Inserting, modifying, and deleting data.Retrieval – Obtaining information either for end-

user queries and reports or for processing by applications.

Administration – Registering and monitoring users, enforcing data security, monitoring performance, maintaining data integrity, dealing with concurrency control, and recovering information if the system fails.

Page 138: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Research areasResearch activity includes theory and

development of prototypes and models. Notable research topics include, the atomic transaction concept and related concurrency control techniques, query languages and query optimization methods, RAID, and more.

Page 139: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

NLPNatural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction.

Page 140: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Human-level natural language processing is an AI problem, that is equivalent to making computers as intelligent as people. NLP's future is therefore tied closely to the development of AI in general.

As natural language understanding improves, computers will be able to learn from the information online and apply what they learned in the real world.

In the future, humans may not need to code programs, but will dictate to a computer in a human natural language, and the computer will understand and act upon the instructions.

Page 141: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Signal ProcessingSignal processing is an area of Systems Engineering, Electrical Engineering and applied mathematics that deals with operations on or analysis of analog as well as digitized signals, representing time-varying or spatially varying physical quantities.Signals of interest can include sound, electromagnetic radiation, images, and sensor readings, for example biological measurements such as electrocardiograms, control system signals, telecommunication transmission signals, and many others.

Page 142: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Computer visionComputer vision is a field that includes methods for acquiring, processing, analyzing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information. A theme in the development of this field has been to duplicate the abilities of human vision by electronically perceiving and understanding an image.

Page 143: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Data Science Higher Education programmes 2014 Programs in 2014 Institute / Organization Course

Indiana University, Indiana, US *

Online Certificate in Data Science(January 2014 ).

University of California, Berkeley Master of Information and Data Science program.

Saint Peters University, US ** Master of Science in Data Science program.

Worcester Polytechnic Institute, Worcester, Massachusetts, US

Master of Science in Data Science program.

University of Virginia , US *** Master of Science in Data Science

* The program consists of 12 credits, including cloud computing, data management and data analysis.** The program’s curriculum will include topics such as decision analysis and optimization, predictive modeling, data mining and visualization.*** A professional program to prepare students for the use of data analysis in major industries such as health care, business, and science.

Page 144: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Conferences on Data Science2014International Conference on Data Science and

Engineering, (26-28 August 2014) Hosted By :

School of Computer Science StudiesCochin University of Science & Technology,

Co-Sponsored by IEEE Kerala.DataEDGE Conference : A new vision for data

science, (May 8–9, 2014 Berkeley, CA ) Discussions will be on the way organizations are

using data to address business and social issues, about the challenges of working with data at scale,  and about the most pressing questions and debates facing data scientists today.

Page 145: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

O’REILLY Strata is organising three conferences:

New York(October 15-17, 2014 ) Discussions will be on complex issues and opportunities brought to business by big data, data science, and pervasive computing.

Barcelona, Spain (November 19–21,2014) Discussions will be on big data analytics.San Jose, CA (February 18–20, 2015)

Page 146: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

ASE(Academy of Science and Engineering) is organising three conferences:

Stanford University, CA, USA, (May 27 - May 31, 2014)

Tsinghua University, Beijing, China, (August 4-7, 2014)

Harvard University, Cambridge, MA, US (December 15-19, 2014).

IEEE International Conference on Big Data Science and Engineering (Tsinghua University, Beijing, China, 24-26 Sept. 2014).

The 2014 International Conference on Data Science and Advanced Analytics(October 30 - November 1, 2014, Shanghai, China).

Page 147: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Journals of Data ScienceJournal of Data Science-an international

journal devoted to applications of statistical methods at large.

Online version is free. Hard copy version- 300 USD/ yearCODATA Data Science Journal Published by Codata.EPJ Data Science: a Springer Open JournalInternational Journal of Data Science :

Inder Science Publishers.

Page 148: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Referenceshttp://bit.ly/1omFc3Bhttp://bit.ly/1jZbP5Fhttp://bit.ly/1mCBzqvhttp://oreil.ly/1jZc4O0http://bit.ly/1mnyJRehttp://bit.ly/1tMzzvxhttp://bit.ly/1pwnZlNhttp://bit.ly/1iq0y9ahttps://bitly.com/

Page 149: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Page 150: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg