Introduction to data science intro,ch(1,2,3)

18
Data science Data Science An emerging area of work concerned with the collection, preparation, analysis ,visualization, management, and preservation of large collections of information . 1

Transcript of Introduction to data science intro,ch(1,2,3)

Data science

Data Science

An emerging area of work concerned with the collection, preparation, analysis ,visualization, management, and preservation of large collections of information .

1

Web page

much of the data in the world is non-numeric and unstructured.

unstructured means that the data are not arranged in neat rows and columns. Think of a web page

2

$

3

Data architecture

Data

acquisition

Data

analysis

Data

archiving

4

Data architect

providing input on how the data would need to be routed and organized to support the analysis, visualization, and presentation of the data to the

appropriate people.

5

Data acquisition

focuses on how the data are collected, and importantly , how the data are represented prior to analysis and presentation.

Tool example :barcode

Different barcodes are used for the same product. (for example, for different sized boxes of cereal).

6

Data analysis

using portions of data (samples) to make inferences about the larger context, and visualization of the data by presenting it in tables, graphs, and even animations.

7

Data archiving

Preservation of collected data in a form that makes it highly reusable ,so "data curation" is

a difficult challenge because it is so hard to anticipate all of the future uses of the data.

Example(Twitter):

Geocodes : data that shows the geographical location from which a tweet was sent could be a useful element to store with the data.

8

Learning the application domain

Communicating with data users

Seeing the big picture of a complex system

Knowing how data can be represented :metadata

Data transformation and analysis

Visualization and presentation

Attention to quality

Ethical reasoning :privacy 9

About Data •Data comes from the Latin word, "datum,"

meaning a "thing given“

10

za15id05v2005kamel

11

“The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point”

CLAUDE SHANNON

yes

1

0

No

Maybe 01

ASCII

12

Identifying Data Problems Data Science is an applied activity and data scientists serve the needs and solve the problems of data users.

Hint:

The data scientist may never actually become a farmer, but if you are going to identify a data problem that a farmer has, you have to learn to think like a farmer, to some degree.

3 questions:

subject matter experts.

ask about anomalies

ask about risks and uncertainty

13

Introduction To R R is an integrated suite of software facilities for data manipulation, calculation , graphical Display and other things it has .

"R" is an open source software program

an effective data handling and storage facility.

a suite of operators for calculations on arrays, in particular matrices,

a large, coherent, integrated collection of intermediate tools for data analysis,

graphical facilities for data analysis and display either directly at the computer or on hardcopy.

14

Additional Pros: R was among the first analysis programs to

integrate capabilities for drawing data directly from the Twitter(r) social media platform

The extensibility of R means that new modules are being added all the time by volunteers

the lessons one learns in working with R are almost universally applicable to other programs and environments.

15

CONS:

R is "command line" oriented

R is not especially good at giving feedback or error messages.

16

How to write a text

myText <- "this is a piece of text" Create Data Set :

myFamilyAges <- c(43, 42, 12, 8, 5)

c(): Concatenates data elements together Assignment arrow: <-

Some mathematical function :

sum():Adds data elements

range():Min value and max value

mean():The average

17

18