Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
-
Upload
shiv-shakti-ghosh -
Category
Data & Analytics
-
view
40 -
download
0
Transcript of Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
![Page 1: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/1.jpg)
DATA SCIENCEColloquium (7)
MS(LIS) 2013-2015
Indian Statistical InstituteDocumentation Research and
Training Centre
![Page 2: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/2.jpg)
![Page 3: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/3.jpg)
● Data Science is a newly emerging field dedicated to analyzing and manipulating data to derive insights and build data products. It combines skill-sets ranging from computer science, to mathematics, to art. (www.kaggle.com)
![Page 4: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/4.jpg)
● Data science imply a focus involving data and, by extension, statistics, or the systematic study of the organization, properties, and analysis of data and its role in inference, including our confidence in the inference. (D.J.Patil)
● In simple word we can say that it is process which extract information/knowledge from huge data.
![Page 5: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/5.jpg)
![Page 6: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/6.jpg)
Evolution• 1900 - Statistics• 1960 - “Data Mining” • 2006 - Google Analytics appears• 2007 - Business/Data/Predictive Analytics• 2012 - Big Data surge• 2013 - Data Science • 2015 - ??
![Page 7: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/7.jpg)
![Page 8: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/8.jpg)
● Data is growing at very high pace(exponentially).
● According to IBM, 2.5 exabytes - that's 2.5 billion gigabytes (GB) - of data was generated every day in 2012. About 75% of data is unstructured, coming from sources such as text, voice and video.
![Page 9: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/9.jpg)
● In 2012 it reached 2.8 zettabytes and IDC forecasts that we will generate 40 zettabytes (ZB) by 2020 which is the equivalent of 5,200 GB of data for every man, woman and child on Earth.
● 90% of all the data in the world today has been created in the past few years.
![Page 10: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/10.jpg)
![Page 11: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/11.jpg)
S.No. Sub-Topic Speaker
1. What is Data Science Sandip Das
2. Data Scientist Anwesha Bhattacharya
3. Applications of Data Science Manasa Rath
4. Workflow of Data Science Dibakar Sen
5. Challenges in Workflow of Data Science
Jayanta Kr. Nayek
6. Tools and Technology Tanmay & Manash
7. Machine Learning in Data Science
Samhati Soor
8. Conclusion Shiv Shakti Ghosh
![Page 12: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/12.jpg)
References● http://bit.ly/1gyRYcM ● http://bit.ly/SdJ2OU ● http://bit.ly/RzrZ9k● http://bit.ly/1pwlEY4 ● http://bit.ly/1pwlUq6
![Page 13: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/13.jpg)
What is Data Science
Sandip Das
![Page 14: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/14.jpg)
DATA SCIENCE
DATA SCIENCE
![Page 15: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/15.jpg)
Data
What kind of data might you collect?
![Page 16: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/16.jpg)
Data How many Lily pads
Measures the inchesof the Lily pads
How many small,medium or largeLily pads
How many frogs
![Page 17: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/17.jpg)
What is Data? It is something you want to know.
A collection of fact.
Facts and statistics collected together for reference or analysis.
Data as the plural form of datum; as pieces of information; and as a collection of object-units that are distinct from one another.
Data is undifferentiated observation of facts in terms of words, numbers, symbols, etc.
![Page 18: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/18.jpg)
What is Data? Computer data is information processed or stored by a
computer. This information may be in the form of text documents, images, audio clips, software programs, or other types of data. Computer data may be processed by the computer's CPU and is stored in files and folders on the computer's hard disk.
![Page 19: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/19.jpg)
Science The systematic observation of natural events and
conditions in order to discover facts about them and to formulate laws and priciples based on these facts.
Science involves more than the gaining of knowledge.It is about gaining a deeper and often useful understanding of the world.
![Page 20: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/20.jpg)
The Science is an art of Discovering what we don't know from data
Obtaining predictive,actionable insight from data
Creating Data products that have business impact
Building confidence in decisions that drive business value
![Page 21: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/21.jpg)
Data science According to Computer scientist Peter Nauer
“The science of dealing with Data, once they have been established”
Data Science is the scientific study of the creation, validation and transformation of data to create meaning.
Data science is the study of the generalizable extraction of knowledge from data.
![Page 22: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/22.jpg)
Multidisciplinary Approach
![Page 23: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/23.jpg)
Domain Expertise
Domain expertise is proficiency, with special knowledge or skills, in a particular area or topic.
Domain expertise includes knowing what problems are important to solve and knowing what sufficient answers look like. Domain experts understand what the customers of their knowledge want to know.
![Page 24: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/24.jpg)
Data EngineeringIt is the data part of data science. It involves
Acquiring
Ingesting
Transforming
Storing
Retrieving data
![Page 25: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/25.jpg)
Scientific Method
It is the process for acquiring new knowledge by applying the principles of reasoning on empirical evidence derived from testing hypotheses through repeatable experiments.
![Page 26: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/26.jpg)
Statistics & Mathematics
Statistics (along with mathematics) is the cerebral part of Data Science. They collect, Organize, analyse and interpret data.
![Page 27: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/27.jpg)
Advanced Computing
Advanced computing is the heavy lifting of data science. It consists software design and programming language.
![Page 28: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/28.jpg)
Visualization It is the pretty face of data science.
A good visualization is the result of a creative process that composes an abstraction of the data in an informative and aesthetically interesting form.
![Page 29: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/29.jpg)
Hacker mindset
Hacking is modifying one's own computer system, icluding building, rebuilding, modifying and creating software, electronic hardware or peripherals, in order to make it better, make it faster, give it added features.
Data science hacking involves inventing new models, exploring.
![Page 30: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/30.jpg)
References● http://bit.ly/1jZR0WA ● http:// bit.ly/1pwmV1m ● http://bit.ly/1tkKyKG ● http://bit.ly/1ntd13L ● http://bit.ly/1wi9t5Z
![Page 31: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/31.jpg)
Data Scientist
Anwesha Bhattacharya
(& I am not a data scientist)
![Page 32: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/32.jpg)
Who is a data scientist?● A practitioner of data
science is called a data scientist.(~Wikipedia)
● Data scientists use technology and skills to increase awareness, clarity and direction for those working with data. (http://www.datascientists.net)
![Page 33: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/33.jpg)
![Page 34: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/34.jpg)
Why do we need data scientists?● Firstly, there is more data than we can
consume. We require a data scientist who can look at the data and say, “This is important. Check out this one.”
● They are the people who can understand and provide meaning to the piles and piles of data that are collected. “Big data” is the buzzword that represents those piles.
● Minimise the disruption that are encountered while dealing with data.
● Present data with an awareness of the consequences of presenting that data.
![Page 35: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/35.jpg)
Data Scientist aims
![Page 36: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/36.jpg)
Types of Data ScientistsData scientists can be
broadly classified into two categories:
Product-focused data scientists.
Business Intelligence style of data scientists.
There are roughly 4 to 5 groups in each category.
![Page 37: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/37.jpg)
Product-focused Data Scientists Data Researcher
The professionals in this category come from the academic world and have in-depth backgrounds in statistics or the physical or social sciences. This type of data scientist often holds a PhD but is weakly skilled in Machine learning, Programming or Business.
Data Developer These guys tend to concentrate on
technical issues that come with handling data. They are strong in programming and machine learning but weak in business and statistics skills.
Data Creatives These are the guys who make something
innovative out of mountains of data. They are strongly skilled in machine learning, Big Data, programming and other skills to handle massive data.
Data Business people They represent the business side and are
responsible for making vital business decisions through data analytics techniques. They are a blend of business and technical proficiency.
![Page 38: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/38.jpg)
Business Intelligence based Data Scientists● Quantitative, exploratory Data Scientists
Quantitative, exploratory data scientists are inclined to have PhDs and use theory to comprehend behaviour. By combining theory and exploratory research, these data scientists improve products.
● Operational Data ScientistsOperational data scientists frequently work in
finance, sales or operations teams in an organization. His role is to analyse performance, responses and behavior of a process, to improve organization’s strategy and efficiency.
● Product Data ScientistsProduct data scientists fit in to product management
or engineering. Their job is to understand the way users make use of a product and make use of that knowledge to fine tune the product.
● Marketing Data ScientistsMarketing data scientists focuses on the user base,
evaluate performance and work on improving efficiency, pretty much like the standard marketing guy.
● Research Data Scientists Research data scientists create insights from a data
set.
![Page 39: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/39.jpg)
Profile of Data Scientist● They love data● Have investigative mind set● Goal of work: finding patterns in
data and data driven products● Are practitioners, not theorists● Have “hands on” skills● Have domain expertise ● Team players● Technically focused● Versatile communication and
collaboration skills● Curiosity for exploring and
experimenting with data.● Sceptical people, likely to ask a
lot of questions around the viability of a given solution and whether it will really work.
![Page 40: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/40.jpg)
Required skills● Data mining - Computational process of discovering patterns in
large data sets. The analysis step of the "Knowledge Discovery in Databases".
● Programming - The act of instructing computers to perform tasks.
● Algorithms - Step-by-step procedure for calculations used for analysis of data.
● Statistics – The collection, organization, analysis, interpretation and presentation of data.
● NLP - Interactions between computers and human languages. ● Machine learning - The science of getting computers to act
without being explicitly programmed.● Distributed systems – The components located on networked
computers communicate and coordinate their actions by passing messages.
● Visualization - The creation and study of the visual representation of data, communicate both abstract and concrete ideas.
● .........
![Page 41: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/41.jpg)
What Does a Data Scientist Do?10 Things [most] Data Scientists Do:1) Ask Good Questions.
What is What?We don’t know! We’d like to know?
2) Explore data & generate hypothesis. Run experiments3) Scoop, Scrap & Sample Data4) Tame Data 5) Discover the unknowns. 6) Model Data. Model Algorithms. 7) Understand Data Relationships 8) Tell the Machine How to Learn from Data 9) Create Data Products that Deliver Actionable Insight 10) Communicate the results using visualization, presentations
![Page 42: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/42.jpg)
DIKUW
I K U WD
Raw What How to Why When
Numbers Description Extract Cause & Effect Prediction
Letters Context Test Proved What's best
Symbols Relationships Instruction Known Unknowns
Unknown Unknowns
Data Information Knowledge Understanding Wisdom
Data Engineer Data Analyst Data Miner Data Scientist
PAST FUTURE
![Page 43: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/43.jpg)
Data Scientist Data AnalystFamiliarity with database systems e.g MySQL
Familiarity with data warehousing and business intelligence concepts
Better to be familiar with Java, Python
In-depth exposure of SQL and analytics
Should have clear understanding of various analytical functions - median, rank etc. and how to use them on data sets
Strong understanding of Hadoop based analytics
Perfection in mathemetics, statistics, correlation, data mining etc.
Perfection regarding the tools and components of data architecture
Deep statistical insights and machine learning
Proficiency in decision making
● Data analysis has been generally used as a way of explaining some phenomenon by extracting interesting patterns from individual data sets with well-formulated queries.
● Data science, on the other hand, aims
to discover and extract actionable knowledge from the data, that is, knowledge that can be used to make decisions and predictions, not just to explain what’s going on.
Data Scientist vs Data Analyst
![Page 44: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/44.jpg)
Challenges of data scientist● Red tape
No access allowed● Unknown need
What's the organization's goal?
● TerminologyWhat's a wonkulator?
● Real world dataMessy, noisy, missing
● Analysis distrust...but I dont like that
result
![Page 45: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/45.jpg)
References● Zhukov, Leonid. Data Scientists. Higher School of
Economics. National Research University.● http://bit.ly/1kduMvA ● http://bit.ly/1orF9DL ● http://bit.ly/1tMBBvQ ● http://bit.ly/1kJ9gU8 ● http://bit.ly/TS9H5e ● http://bit.ly/1jZR0WA
![Page 46: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/46.jpg)
APPLICATIONS of DATA SCIENCE
by
Manasa Rath
![Page 47: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/47.jpg)
Reaching to Data Science
![Page 48: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/48.jpg)
APPLICATIONS
agriculture
pharmacy
energy
retail
tourism
realestate
import-export
finance
business
services
![Page 49: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/49.jpg)
Applications in Education sector-Survey done by Pearson group to improve the learing
softwares, course materials better quality and efficacy in learning
-Tools used is Python, R, Google Big Query
![Page 50: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/50.jpg)
Data Science in Healthcare Industry-where a group has been diagnosed with Type2 Diabetes & some
subset of this group has developed complications
-would like to know whether there is any pattern to complications and whether the probability of complication can be predicted and therefore acted upon
Healthcare Use Database Snippet
![Page 51: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/51.jpg)
Extracting Interesting Patterns of Health outcomes from Healthcare System Care
Whether the pattern is robust and predictive ?? OBSERVATIONS
What is incidence of complications of Type 2 diabetes for peple over 37 who are on more than six medications?
![Page 52: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/52.jpg)
Remarks-Predictive accuracy becomes a primary objective, the
computer tends to play a significant role in model building and decision making
Shows an integrated skill set spanning mathematics,statistics, AI,databases, optimization along with deep understanding of the craft problem formulation to engineer effective problems
![Page 53: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/53.jpg)
Applications in Social Networking sites
![Page 54: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/54.jpg)
![Page 55: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/55.jpg)
![Page 56: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/56.jpg)
Key Points--ability to interpret unstructured data and integrate it with
numbers further increases our ability to extract useful knowledge in real-time and act on it
![Page 57: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/57.jpg)
References1.Data Science and Prediction by Vasant Dhar http://bit.ly/1tiRvMr
![Page 58: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/58.jpg)
Workflow of Data Science
Dibakar Sen
![Page 59: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/59.jpg)
Work flow of Data Science● The work flow
process consist of three major activities-
-Organising-Packaging-Delivering
![Page 60: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/60.jpg)
Work flow PhasesUnderstanding
of data
/ Evaluation
![Page 61: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/61.jpg)
Understanding of Data - set objectives or goal - set data fields - data collection procedure
![Page 62: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/62.jpg)
Preparation Phase
Understanding of
data
/ Evaluation
![Page 63: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/63.jpg)
Preparation Phase● Acquire data The obvious first step in any data
science workflow is to acquire the data to analyze. Data can be acquired from a variety of sources. e.g.,:
-Existing Data can be used (e.g., U.S. Census data sets).
-Data can be automatically generated by computer software.
-Data can be manually entered into a spreadsheet or text file by a human through survey.
![Page 64: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/64.jpg)
Preparation Phase
● Reform and clean data -Before analysis begins, we need to verify that the data
are accurate and that the variables are well named and properly labeled.
-We have to store the data in desired format, - Verify the sample and variables - Do the variables have the correct values? - Are missing data coded appropriately? -Are the data internally consistent? - Is the sample size correct? etc. -Programmers reformat and clean data either by writing
scripts or by manually editing data, say, a spreadsheet.
![Page 65: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/65.jpg)
Analysis PhaseUnderstanding
of data
/ Evaluation
![Page 66: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/66.jpg)
Analysis Phase● Data Analysis - The core activity of data
science is the analysis phase: writing, executing, and refining computer programs to analyze and obtain insights from data.
- Different "scripting"
languages such as Python, Perl, R, and MATLAB are used to analysis the data. However, they also use compiled languages such as C, C++, and Fortran when appropriate.
![Page 67: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/67.jpg)
● In the analysis phase, the programmer engages in a repeated iteration cycle of editing scripts, executing to produce output files, inspecting the output files to gain insights and discover mistakes, debugging, and re-editing.
![Page 68: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/68.jpg)
Reflection/Evaluation PhaseUnderstanding
of data
/ Evaluation
![Page 69: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/69.jpg)
Reflection / Evaluation Phase
The analysis phase involves programming, the reflection phase involves thinking and communicating about the outputs of analyses. After inspecting a set of output files, a data scientist might perform the following types of reflection:
-Take notes - Hold meetings - Make comparisons and explore alternatives
![Page 70: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/70.jpg)
Dissemination PhaseUnderstanding
of data
/ Evaluation
![Page 71: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/71.jpg)
Dissemination Phase
The final phase of data science is disseminating results. Prepare reports in order to communicate findings to the appropriate audience. Results are most commonly in the form of written reports such as internal memos, slideshow presentation, business / policy white paper, or academic research publications.
● Beyond presenting results in written form, some data scientists also want to distribute their software so that colleagues can reproduce their experiments or play with their prototype systems.
![Page 72: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/72.jpg)
References● http://bit.ly/1jZcx2I ● http://bit.ly/1jZeTyN ● http://bit.ly/1hbQuWx
![Page 73: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/73.jpg)
Challenges in Workflow of Data Science
Jayanta Kr. Nayek
![Page 74: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/74.jpg)
Preparation phase Acquire data:-Keeping track of provenance :-Where each piece of data comes from and whether it is still
up-to-date.
-Data management : -Programmers must assign names to data files that they create
or download and then organize those files into directories. -When they create or download new versions of those files, they
must make sure to assign proper filenames to all versions and keep track of their differences.
-Storage :-Sometimes there is so much data that it cannot fit on a single
hard drive, so it must be stored on remote servers.
![Page 75: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/75.jpg)
Preparation Phase Reformat and clean data :-A related problem is that raw data often contains semantic
errors(an error in logic or arithmetic that must be detected at run time), missing entries, or inconsistent formatting, so it needs to be "cleaned" prior to analysis.
-Data integration :-Data integration involves combining data residing in different
sources and providing users with a unified view of these data.-Heterogeneous Data:-data integration involves synchronizing huge quantities of
variable, heterogeneous data resulting from internal legacy systems (an old method, technology, computer system, or application program,"of, relating to, or being a previous or outdated computer system) that vary in data format. Legacy systems may have been created around flat file, network, or hierarchical databases.
![Page 76: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/76.jpg)
Preparation Phase● Data Integration Problems:
-Unanticipated Costs:-Labor costs for initial planning, evaluation,
programming and additional data acquisition-Software and hardware purchases-Unanticipated technology changes/advances-Both labor and the direct costs of data storage and
maintenance
-Lack of Data Management Expertise: -support required to engage and convey to everyone in
the agency the need for and benefits of data integration is unlikely to flow from leaders who lack awareness of or commitment to the benefits of data integration.
![Page 77: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/77.jpg)
Preparation Phase Data transmission:
-It is the physical transfer of data over a point-to-point or point-to-multipoint communication channel.
-Cloud data storage is popularly used as the development of cloud technologies.
-We know that the network bandwidth capacity is the bottleneck in cloud and distributed systems, especially when the volume of communication is large.
-On the other side, cloud storage also lead to data security problems as the requirements of data integrity checking.
![Page 78: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/78.jpg)
Analysis Phase-Data inconsistence and incompleteness:-A number of data preprocessing techniques, including data
cleaning, data integration, data transformation and date reduction, can be applied to remove noise and correct inconsistencies.
-Scalability:-The biggest and most important challenge is scalability when we
deal with the Big Data analysis.-In the last few decades, researchers paid more attentions to
accelerate analysis algorithms to cope with increasing volumes of data and speed up processors following the Moore’s Law.
-Data Curation:-Data curation is aimed at data discovery and retrieval, data
quality assurance, value addition, reuse and preservation over time.
-The existing database management tools are unable to process Big Data that grow so large and complex.
![Page 79: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/79.jpg)
Analysis Phase-Timeliness:-Real-time Big Data applications, like navigation, social networks, finance,
biomedicine, astronomy, intelligent transport systems, and internet of thing, timeliness is at the top priority. How can we guarantee the timeliness of response when the volume of data will be processed is very large?
-File and metadata management:-Repeatedly editing and executing scripts while iterating on experiments
causes the production of numerous output files, such as intermediate data, textual reports, tables, and graphical visualizations.
-However, doing so leads to data management problems due to the abundance of files and the fact that programmers often later forget their own ad-hoc naming conventions.
-Data security:-Firstly, the size of Big Data is extremely large, channelling the protection
approaches.
-Secondly, it also leads to much heavier workload of the security.
![Page 80: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/80.jpg)
Analysis Phase-Absolute running times:Scripts might take a long time to terminate, either due to
large amounts of data being processed or the algorithms being slow.
-Incremental running times:Scripts might take a long time to terminate after minor
incremental code edits done while iterating on analyses, which wastes time re-computing almost the same results as previous runs.
-Crashes from errors:Scripts might crash prematurely due to errors in either the
code or inconsistencies in data sets. Programmers often need to endure several rounds of debugging before their scripts can terminate with useful results.
![Page 81: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/81.jpg)
Reflection Phase● Take notes:Since notes are a form of data, the usual data management
problems arise in notetaking, most notably how to organize notes and link them with the context in which they were originally written.
● Make comparisons and explore alternatives:
Data scientists must organize, manage, and compare these graphs to gain insights and ideas for what alternative hypotheses to explore.
![Page 82: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/82.jpg)
Dissemination Phase-Functionalities:
-To convey information easily by providing knowledge hidden in the complex and large-scale data sets, both aesthetic form and functionality are necessary.
-Current tools mostly have poor performances in functionalities and response time.
-Scalability :
-It is particularly difficult to conduct data visualization (the main objective of data visualization is to represent knowledge more intuitively and effectively by using different graphs) because of the large size and high dimension of Big Data.
![Page 83: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/83.jpg)
Dissemination Phase● Difficult to distribute research code:Some data scientists also want to distribute their
software so that colleagues can reproduce their experiments or play with their prototype systems. It is difficult to distribute research code in a form that other people can easily execute on their own computers.
● Difficult to reproduce the results:It is even difficult to reproduce the results of one's
own experiments a few months or years in the future, since one's own operating system and software inevitably get upgraded in some incompatible manner such that the original code no longer runs.
![Page 84: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/84.jpg)
Reference● Chen,Philip C.L. And Zhang,Chun-Yang.
(2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data.Information Sciences.ELSEVIER.Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, Macau, China.
● http://bit.ly/1jZcx2I ● http://1.usa.gov/SNspKm
![Page 85: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/85.jpg)
TECHNOLOGY and Tools for DATA SCIENCE
TANMAY MONDAL & MANASH KUMAR
![Page 86: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/86.jpg)
We need
● Organise Data
● Analyse Data
● Package and Deliver Data
![Page 87: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/87.jpg)
Data Science Tools Language
Java, R, Python, ... Databases/Data Warehouses
Apache Cassandra, Apache HBase, MongoDB, ....
Data Mining RapidMiner/RapidAnalytics, Orange, Weka, ....
File Systems Gluster, Hadoop Distributed File System, ...
![Page 88: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/88.jpg)
Data Science Tools Big Data Search
Lucene, Solr, ... Data Aggregation and Transfer
Sqoop, Flume, .... Miscellaneous Big Data Tools
– Hadoop, Avro, Zookeeper, ... ......................
![Page 89: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/89.jpg)
![Page 90: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/90.jpg)
What is Hadoop?● The Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
Nodes
Hadoop cluster
![Page 91: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/91.jpg)
Why Hadoop?
• Handles enormous data volumes.
• Cost-effective.
• Scalable.
• Fault tolerant.
![Page 92: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/92.jpg)
Origin of Hadoop
• Google introduced two key technology for handling Big data, Google File System (a distributed file system technology) in 2003 and MapReduce ( framework for distributed compute model) in 2004 to the world.
• Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS.
• In February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop.
• First release of Apache Hadoop in September 2007
![Page 93: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/93.jpg)
When should we go for Hadoop ?
Data is too huge
Unstructured data
Parallelism
Processes are independent
Need better scalability
![Page 94: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/94.jpg)
The Hadoop Ecosystem● HDFS - Hadoop Distributed File System.
● MapReduce - A distributed framework for
executing work in parallel.
• Hive - Hive is a data warehouse infrastructure built
on top of Hadoop for providing data summarization,
query, and analysis.
● Pig – Pig is a high-level platform for creating
MapReduce programs used with Hadoop.
● HBase – A non-rational, distributed database
system.
● ..........
![Page 95: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/95.jpg)
The Major Component of HadoopHadoop use its own distributed file system,HDFS, which makes data available to multiple computing nodes.
Hadoop uses MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
![Page 96: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/96.jpg)
HDFSHierarchical UNIX-like file system for data storage sort of Splitting of large files into blocks.
Stores files in blocks across many nodes in
a
cluster.
Distribution and replication of blocks to
different nodes.
Have master slave architecture.
![Page 97: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/97.jpg)
HDFS Architecture
![Page 98: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/98.jpg)
HDFS ...NameNode
Runs on a single node as a master processHolds file metadata (which blocks are where)Directs client access to files in HDFS
SecondaryNameNodeMaintains a copy of the NameNode metadata
Data Node●Stores data in the local file system●Periodically sends a report of all existing blocks to the NameNode
![Page 99: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/99.jpg)
WHAT IS MAP REDUCE?
MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster
![Page 100: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/100.jpg)
![Page 101: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/101.jpg)
Map Reduce Paradigm
Data processing system with two key phase
Map
Perform a map function on input key/value pairs to generate intermediate key/value pairs
Reduce
Perform a reduce function on intermediate key/value groups to generate output key/value pairs
![Page 102: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/102.jpg)
Map Reduce Daemons
•JobTracker (Master)
-Monitors job and task progress
- Manages MapReduce jobs
-Giving tasks to different nodes
•TaskTracker (Slave)
- Creates individual map and reduce tasks
- Reports task status to JobTracker
-Runs on same node as DataNode service
![Page 103: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/103.jpg)
![Page 104: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/104.jpg)
Hadoop Map Reduce Components
Reduce Phase
Shuffle
Sort
Reducer
Output Format
Map Phase
Input Format
Record Reader
Mapper
Combiner
![Page 105: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/105.jpg)
105
How does Map Reduce work?➢The run time partitions the input and provides it to different Map instances
➢Map (key, value) (key’, value’)
➢The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’. ➢Map and Reduce are user written functions in java
![Page 106: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/106.jpg)
![Page 107: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/107.jpg)
WORD COUNT IN MAP REDUCE
![Page 108: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/108.jpg)
Validation of data extract and load into EDW(Enterprise Data Warehouse)
Once map-reduce process is completed and data output files are generated, then data is moved to enterprise data warehouse or any other transactional systems depending on the requirement.
![Page 109: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/109.jpg)
USERS OF HADOOPYahoo! - More than 100,000 CPUs in 40,000 computers
running Hadoop Produces data that was used in every Yahoo!
Web search queryFacebook - In 2010 Facebook claimed that they had the
largest Hadoop cluster in the world with 21 PB of storage.
On June 13, 2012 they announced the data had grown to 100 PB.
Each (commodity) node has 8 cores and 12 TB of storage
![Page 110: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/110.jpg)
USERS OF HADOOP
Adobe -Adobe uses Apache Hadoop and Apache HBase in several areas from social services to structured data storage and processing for internal use.Currently have about 30 nodes running HDFS
Ebay -532 nodes cluster (8 * 532 cores, 5.3PB)Heavy usage of Java MapReduce, Apache Pig, Apache Hive, Apache HBase Using it for Search optimization and Research.
![Page 111: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/111.jpg)
Twitter We use Apache Hadoop to store and process
tweets, log files, and many other types of data generated across Twitter.
GBIF (Global Biodiversity Information Facility)
Nonprofit organization that focuses on making scientific data on biodiversity available via the Internet
18 nodes running a mix of Apache Hadoop and Apache HBase
![Page 112: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/112.jpg)
University of Glasgow
30 nodes cluster (Xeon Quad Core 2.4GHz, 4GB RAM, 1TB/node storage). To facilitate information retrieval research & experimentation, particularly for TREC
Greece.com
Using Apache Hadoop for analyzing data for millions of images, log analysis, data mining
![Page 113: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/113.jpg)
Referenceshttp://bit.ly/1km1e46
http://bit.ly/Rzuzfz
http://yhoo.it/1pheFVK
Big data: Testing Approach to Overcome Quality Challenges\By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja.
![Page 114: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/114.jpg)
Machine Learning
Samhati Soor
![Page 115: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/115.jpg)
What is it?Learning is a process of knowledge acquisition with specific purpose.
Machine learning is the study of how to use computers to simulate human learning activities.
TrainingSet
LearningAlgorithm
hypothesis Predicted OutputInput
Feedback
![Page 116: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/116.jpg)
Why Machine Learning is Possible?
Mass StorageMore data available
Higher Performance of ComputerLarger memory in handling the data
Greater computational power for calculating and even online learning
Machine Learning Basics: 1. General Introduction
![Page 117: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/117.jpg)
Basic Structure of the Machine Learning System
Externalenvironment
Corpusstudy
KnowledgeRepresentation
Execution
Machine Learning Model
![Page 118: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/118.jpg)
The Goal of Machine Learning is...to create a predictive model that is
indistinguishable from a correct model.
Without Logic
With Logic
![Page 119: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/119.jpg)
Two Phases
Machine learning methods are broken into two phases:
TrainingApplication
![Page 120: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/120.jpg)
Types of Machine Learning
Other types:
1. Semi-supervised learning
2. Time-series forecasting
3. Anomaly detection
4. Active learning
Main types:
1. Supervised Learning
2. Unsupervised learning
3. Reinforcement learning
![Page 121: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/121.jpg)
The Main Research Work on Machine Learning Field
Task-oriented researchCognitive simulationTheoretical analysis
![Page 122: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/122.jpg)
Data Science and Machine Learning
If we are giving the computer rules and/or algorithms to automatically search through your data to “learn” how recognize patterns and make complex decisions (such as identifying spam emails), we are implementing machine learning.
In Data science, Data scientists use both statistical techniques and machine learning algorithms for identifying patterns and structure in data.
![Page 123: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/123.jpg)
.
Role of Machine Learning in Data Science
https://doubleclix.wordpress.com/category/data-science/
![Page 124: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/124.jpg)
A Simple ImplementationLet, we have a model consisted of the likelihood of
the coin landing heads (prior over θ), while the data consisted of the results of N coin flips.
We are observing some data.
Our goal is to determine the model from the data i.e. we will find the probability of getting desired model using the given data or p(model|data).
![Page 125: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/125.jpg)
Using Conditional Probability,p(data|model) =p(data and model) * p(model) --(1)p(model|data) =p(data and model) * p(data) --(2)
From (1) and (2) we get,p(data|model) / p(model) = p(model|data) / p(data)
That implies : p(model|data) = (p(model|data) * p(data)) /
p(model)
posterior likelihood prior evidence
![Page 126: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/126.jpg)
The likelihood distribution describes the likelihood of data given model — it reflects our assumptions about how the data c was generated.
The prior distribution describes our assumptions about model before observing the data.
The posterior distribution describes our knowledge of model, incorporating both the data and the prior.
The evidence is useful in model selection.
![Page 127: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/127.jpg)
Working Method of a Predictive Modeler and a Data ScientistA predictive modeler may use machine learning approach
to predict a value or likelihood of an outcome, given a number of input variables.
A data scientist applies these same approaches on large data sets, writing code and using software adapted to work on big data.
![Page 128: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/128.jpg)
The available library of statistical and machine learning algorithms for evaluating and learning from big data is growing, but is not yet as comprehensive as the algorithms available for the non-distributed world.
The algorithms vary by product, so it is important to understand what is and is not available.
Even not all algorithms familiar to the statistician and data miner are easily converted to the distributed computing environment.
The bottom line is that, while fitting models on big data has the potential benefit of greater predictive power, some of the costs are loss of flexibility in algorithm choices and/or extensive programming time.
Prospective
![Page 129: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/129.jpg)
References
Machine Learning and Data MiningLecture NotesCSC 411/D11Computer Science DepartmentUniversity of TorontoVersion: February 6, 2012
The Discipline of Machine LearningTom M. MitchellJuly 2006CMU-ML-06-108School of Computer ScienceCarnegie Mellon UniversityPittsburgh, PA 15213
Statistical Machine Learning-Nic Schraudolph
http://bit.ly/1oFt1ws
http://bit.ly/1oFtNty
![Page 130: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/130.jpg)
Conclusion
Shiv Shakti Ghosh
![Page 131: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/131.jpg)
Research AreasCloud computingDatabases and Database Management
SystemsNatural language processingSignal ProcessingComputer vision
![Page 132: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/132.jpg)
Cloud computing Cloud computing involves distributed
computing over a network, where a program or application may run on many connected computers at the same time.
It specifically refers to a server connected through a communication network such as the Internet, an intranet, a local area network (LAN) or wide area network (WAN).
![Page 133: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/133.jpg)
IssuesPrivacy -The increased use of cloud
computing services such as Gmail and Google Docs has pressed the issue of privacy concerns. The greater use of cloud computing services has given access to a plethora of data which has the immense risk of data being disclosed either accidentally or deliberately.
![Page 134: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/134.jpg)
Contd..Legal-certain legal issues arise with cloud
computing, including trademark infringement, security concerns and sharing of proprietary data resources.
Vendor lock-in-cloud computing is still relatively new, standards are still being developed. Many cloud platforms and services are built on the specific standards, tools and protocols developed by a particular vendor for its particular cloud offering. This is a major challenge in interoperability.
![Page 135: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/135.jpg)
Research areasopen interoperation across cloud solutions at
IaaS, PaaS and SaaS levelsmanaging multi tenancy at large scale and in
heterogeneous environmentsdynamic and seamless elasticity from private
clouds to public clouds for unusual and/or infrequent requirements
data management in a cloud environment, taking the technical and legal constraints into consideration
![Page 136: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/136.jpg)
Databases &DBMS
A database is an organized collection of data. The data are typically organized to in a way that supports processes requiring this information.
Database management systems (DBMSs) are specially designed software applications that interact with the user, other applications, and the database itself to capture and analyze data.
![Page 137: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/137.jpg)
IssuesData definition – Defining new data structures for a
database, removing data structures from the database, modifying the structure of existing data.
Update – Inserting, modifying, and deleting data.Retrieval – Obtaining information either for end-
user queries and reports or for processing by applications.
Administration – Registering and monitoring users, enforcing data security, monitoring performance, maintaining data integrity, dealing with concurrency control, and recovering information if the system fails.
![Page 138: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/138.jpg)
Research areasResearch activity includes theory and
development of prototypes and models. Notable research topics include, the atomic transaction concept and related concurrency control techniques, query languages and query optimization methods, RAID, and more.
![Page 139: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/139.jpg)
NLPNatural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction.
![Page 140: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/140.jpg)
Human-level natural language processing is an AI problem, that is equivalent to making computers as intelligent as people. NLP's future is therefore tied closely to the development of AI in general.
As natural language understanding improves, computers will be able to learn from the information online and apply what they learned in the real world.
In the future, humans may not need to code programs, but will dictate to a computer in a human natural language, and the computer will understand and act upon the instructions.
![Page 141: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/141.jpg)
Signal ProcessingSignal processing is an area of Systems Engineering, Electrical Engineering and applied mathematics that deals with operations on or analysis of analog as well as digitized signals, representing time-varying or spatially varying physical quantities.Signals of interest can include sound, electromagnetic radiation, images, and sensor readings, for example biological measurements such as electrocardiograms, control system signals, telecommunication transmission signals, and many others.
![Page 142: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/142.jpg)
Computer visionComputer vision is a field that includes methods for acquiring, processing, analyzing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information. A theme in the development of this field has been to duplicate the abilities of human vision by electronically perceiving and understanding an image.
![Page 143: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/143.jpg)
Data Science Higher Education programmes 2014 Programs in 2014 Institute / Organization Course
Indiana University, Indiana, US *
Online Certificate in Data Science(January 2014 ).
University of California, Berkeley Master of Information and Data Science program.
Saint Peters University, US ** Master of Science in Data Science program.
Worcester Polytechnic Institute, Worcester, Massachusetts, US
Master of Science in Data Science program.
University of Virginia , US *** Master of Science in Data Science
* The program consists of 12 credits, including cloud computing, data management and data analysis.** The program’s curriculum will include topics such as decision analysis and optimization, predictive modeling, data mining and visualization.*** A professional program to prepare students for the use of data analysis in major industries such as health care, business, and science.
![Page 144: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/144.jpg)
Conferences on Data Science2014International Conference on Data Science and
Engineering, (26-28 August 2014) Hosted By :
School of Computer Science StudiesCochin University of Science & Technology,
Co-Sponsored by IEEE Kerala.DataEDGE Conference : A new vision for data
science, (May 8–9, 2014 Berkeley, CA ) Discussions will be on the way organizations are
using data to address business and social issues, about the challenges of working with data at scale, and about the most pressing questions and debates facing data scientists today.
![Page 145: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/145.jpg)
O’REILLY Strata is organising three conferences:
New York(October 15-17, 2014 ) Discussions will be on complex issues and opportunities brought to business by big data, data science, and pervasive computing.
Barcelona, Spain (November 19–21,2014) Discussions will be on big data analytics.San Jose, CA (February 18–20, 2015)
![Page 146: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/146.jpg)
ASE(Academy of Science and Engineering) is organising three conferences:
Stanford University, CA, USA, (May 27 - May 31, 2014)
Tsinghua University, Beijing, China, (August 4-7, 2014)
Harvard University, Cambridge, MA, US (December 15-19, 2014).
IEEE International Conference on Big Data Science and Engineering (Tsinghua University, Beijing, China, 24-26 Sept. 2014).
The 2014 International Conference on Data Science and Advanced Analytics(October 30 - November 1, 2014, Shanghai, China).
![Page 147: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/147.jpg)
Journals of Data ScienceJournal of Data Science-an international
journal devoted to applications of statistical methods at large.
Online version is free. Hard copy version- 300 USD/ yearCODATA Data Science Journal Published by Codata.EPJ Data Science: a Springer Open JournalInternational Journal of Data Science :
Inder Science Publishers.
![Page 148: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/148.jpg)
Referenceshttp://bit.ly/1omFc3Bhttp://bit.ly/1jZbP5Fhttp://bit.ly/1mCBzqvhttp://oreil.ly/1jZc4O0http://bit.ly/1mnyJRehttp://bit.ly/1tMzzvxhttp://bit.ly/1pwnZlNhttp://bit.ly/1iq0y9ahttps://bitly.com/
![Page 149: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/149.jpg)
![Page 150: Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg](https://reader037.fdocuments.in/reader037/viewer/2022110313/55c30ee4bb61ebcc738b4744/html5/thumbnails/150.jpg)