Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week...

35
Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo [email protected]

Transcript of Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week...

Page 1: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

Introductionto Big Data

Chapter 1 & 2 (Week 1)Course overview & introduction

DCCS208(02) Korea University 2019 Fall

Asst. Prof. Minseok Seo

[email protected]

Page 2: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

01Course OverviewIntroduction to Big Data

Page 3: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

Contents

Definition of Big Data

Introduction to Big Data2.

Brief introduction of professor & course

Course Overview1. Object & Aim of the course

Assignments & Quiz

Evaluation

Key techniques in Data Science

Core technology of Informatics

Page 4: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

4 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewCourse information

Introduction to Big Data, DCCS208(02), Fall 2019.

Lecture time: Wed. (6,7) and Thu. (6)

Location: Wed. (7-310) and Thu. (7-315)

Completion division: Major elective subject

Level: Junior / Senior

Page 5: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

5 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewDefinition of Big Data (Cont.)

Which is bigger, elephant or rat?

VS.

Page 6: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

6 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewDefinition of Big Data (Cont.)

What is Data?

ID Height Weight Age

Student 1 189 cm 81 kg 24

Student 2 210 cm 90 kg 26

Student 3 191 cm 92 kg 27

… … … …

Student N 162 cm 71 kg 21

Attributes (Dimension; Features; Variables)

Ob

jects

(S

am

ple

s,

Ind

ivid

ua

ls)

Page 7: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

7 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewDefinition of Big Data (Cont.)

In a narrow sense, Big Data means only sample size.

In a broad sense, Big Data represents both sample size and dimensionality.

Page 8: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

8 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewDefinition of Big Data (Cont.)

3V’s (Volume, Velocity, and Variety)

Page 9: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

9 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewDefinition of Big Data (Cont.)

5V’s (Volume, Velocity, Variety, Veracity, and Value)

Volume: Data size

Velocity: Data production speed

Variety: Data oriented from various things

Veracity: Data accuracy (Trustworthy)

Value: Data value

Value*

Page 10: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

10 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewRelationship between Big-data & Data Science

The amount of data and information is not directly correlated with

knowledge generation.

X

But the demand for data scientists will be growing.

Page 11: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

11 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewJob market of Big data

Furht B., Villanustre F. (2016) Introduction to Big Data. In: Big Data Technologies and Applications. Springer, Cham

It is the time to prepare for an academic course to cultivate data analysts

commensurate with demand.

Page 12: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

12 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewObject & Aim of the course

Students who have taken this course expect to be able to learn:

Introduction to Big Data

Concept of

Big Data

Computational approaches for

Big Data

Statistical approaches for

Big Data

Visualization for Big Data

R programming

Basic Skill in Data Science

Page 13: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

13 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewCourse schedule (Before Mid-term exam)

Week Period Study Contents

1 09.02 - 09.08 Introduction to Big Data & Data Science

2 09.09 - 09.15Overall workflow, Computer Software issues, and applications in the

Big Data era

3 09.16 - 09.22 Introduction to R programming

4 09.23 - 09.29 Descriptive & Fundamental Statistics

5 09.30 - 10.06 Understanding Data Structures (Types of random variable)

6 10.07 - 10.13 Data Visualization

7 10.14 - 10.20 Preprocessing of Big Data (Quality Control and Prescreening)

8 10.21 - 10.27 Mid-term Exam

Page 14: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

14 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewCourse schedule (After Mid-term exam)

Week Period Study Contents

9 10.28 - 11.03 Parallel and Distributed Processing for Big Data

10 11.04 - 11.10 Statistical Estimation & Modeling

11 11.11 - 11.17 Computational approach for statistical modeling with robustness

12 11.18 - 11.24 Clustering analysis (Unsupervised learning methods)

13 11.25 - 12.01 Classification analysis (Supervised learning methods)

14 11.02 - 12.08 Algorithms of Dimensionality Reduction for Big Data

15 12.09 - 12.15 Trends in various academic & industrial fields for application of Big Data

16 12.16 - 12.22 Final Exam

Page 15: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

15 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewTwo types of lectures per week

There are two representative computer language for Big data analysis, R and

Python.

R will be used in this class.

It is not required any prior knowledge of the R language because I plan to provide

example code for student's practice.

https://cran.r-project.org/

Wed. day2hrs

Thu. Day1hr

Lecture for Theory Hands-on lecture

The methodology learned in theory class will be exercised in the computer lab. on Thursday.

Page 16: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

16 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewExam, Quiz, and Homework

There will be two simple quizzes in class to check the student's learning

progress of the course (before and after midterm respectively).

Quiz

Homework There will be 4 times assignments.

This will be a report on the theory and practice of data analysis learned in

class.

There will be two exams.

I will ask you to understand the basic computational/statistical algorithm.

Midterm and Final exams

Page 17: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

17 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewEvaluation plan

Absolute grading system

Score ≥ 95, you will get A+

Score ≥ 90, you will get A

Score ≥ 85, you will get B+

and...

30%

30%

10%

20%

10%

Midterm Final Quiz Assignment Attendance

Page 18: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

18 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewTextbook

No Textbook

This course will be proceed based on the presentation slide

I will upload presentation slide in Blackboard & my homepage

Homepage: https://scholar.harvard.edu/msseo

Teaching >> Introduction to Big Data >> Related Materials

Reference 2 (Eng. Version)

Introduction to Data Science by Rafael A. Irizarry, 2019.

(online textbook and free)

https://rafalab.github.io/dsbook/

Reference 3 (Eng. Version)

R for Data Science by Garrett Grolemund.

(online textbook and free)

https://r4ds.had.co.nz/

Reference 1 (Kor. Version)

R for Practical Data Analysis

(online textbook and free)

http://r4pda.co.kr/pdf/r4pda_2014_03_02.pdf

Page 19: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

19 / 20copyrightⓒ 2018 All rights reserved by Korea University

Course OverviewContact information

Prof. Minseok Seo

Location: 7-203

Tel: 044-860-1379

Email: [email protected]

TA. Heechan Chae

Location: 7-328

Email: [email protected]

If you have any questions about the course please email me and I will reply as

soon as I see it.

If you need to meet in person, please make an appointment by email first.

I will be available at Mon: 12:00 - 17:00 | Wed: 10:00 - 13:00 | Thu: 10:00 - 13:00.

Page 20: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

End of Orientation

Page 21: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

Contents

Concept of Big Data

Introduction to Big Data2.

Brief introduction of professor & course

Course Overview1. Object & Aim of the course

Assignments & Quiz

Evaluation

Key techniques in Data Science for Big data

Page 22: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

22 / 20copyrightⓒ 2018 All rights reserved by Korea University

Characteristics of Big DataRemind concept of Big Data

5V’s (Volume, Velocity, Variety, Veracity, and Value)

Volume: Data size

Velocity: Data production speed

Variety: Data oriented from various things

Veracity: Data accuracy (Trustworthy)

Value: Data value

Value*

Page 23: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

23 / 20copyrightⓒ 2018 All rights reserved by Korea University

Petabyte era

transferred about 197 PB of data thorough its network each data (2018)

processed about 24 petabytes daily (2009)

1 PB = 1000000000000000B = 1015bytes = 1000terabytes

1000 PB = 1 exabyte (EB)

In fact, we can say that we have already entered the exabyte

era.

Page 24: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

24 / 20copyrightⓒ 2018 All rights reserved by Korea University

Characteristics of Big DataHow do you recognize if it's big data or not?

Computer Scientist

My computer is low on memory for

handling this data!!

That is Big Data

No!!!! This data is over 2TB. Where do I

store it?????

That is Big Data

In short, if you’re having trouble with data processing on your computer (멘붕에빠지면), it will be due to the Big Data.

Page 25: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

25 / 20copyrightⓒ 2018 All rights reserved by Korea University

Characteristics of Big DataHow do you recognize if it's big data or not?

Statistician

When does this calculation end? I was

only waiting for 10 years ...

Dimensionality is too high!!!! I can’t build

statistical model using this data!!!

That is Big Data

In short, if you’re having trouble with data analysis on your computer (멘붕에 빠지면), it will be due to the Big Data.

Page 26: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

26 / 20copyrightⓒ 2018 All rights reserved by Korea University

Core technologies of Big Data eraIT technologies to resolve issue derived from the Big data

Difficulties arise in both hardware and software.

Prescreening techniques

Data Visualization

Feature selection

Parallel processing

Clouding computing

Distributed processing

Software Hardware

But students can approach software difficulties.

Page 27: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

27 / 20copyrightⓒ 2018 All rights reserved by Korea University

Computational language for Big DataR and Python

There are two representative computer language for Big data analysis, R and

Python.

R programming language (free and relatively easy) for hands-on lecture.

Let’s connect R homepage

https://cran.r-project.org/

Wed. day2hrs

Thu. Day1hr

Lecture for Theory Hands-on lecture

Page 28: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

28 / 20copyrightⓒ 2018 All rights reserved by Korea University

Install R(Step 1) Download the R installer

Page 29: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

29 / 20copyrightⓒ 2018 All rights reserved by Korea University

Install R(Step 2) Download the RStudio

Download Rstudio from https://www.rstudio.com/products/rstudio/download/

Page 30: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

30 / 20copyrightⓒ 2018 All rights reserved by Korea University

Install R(Step 3) Install R and Rstudio

Page 31: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

31 / 20copyrightⓒ 2018 All rights reserved by Korea University

What is R

R is an interpreted computer language.

It is possible to interface procedures written in C, C+, and etc., languages for

efficiency.

System commands can be called from within R

R is used for data manipulation, statistics, and graphics.

Page 32: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

32 / 20copyrightⓒ 2018 All rights reserved by Korea University

R, S, and S-plus (History of R)

S: an interactive environment for data analysis developed at Bell Laboratories since

1976

1988 - S2: RA Becker, JM Chambers, A Wilks

1992 - S3: JM Chambers, TJ Hastie

1998 - S4: JM Chambers

Exclusively licensed by AT&T/Lucent to Insightful Corporation, Seattle WA. Product

name: “S-plus”.

Implementation languages C, Fortran.

R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of

Auckland, New Zealand during 1990s.

Since 1997: international “R-core” team of ca. 15 people with access to common

CVS archive.

Page 33: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

33 / 20copyrightⓒ 2018 All rights reserved by Korea University

What R does and does not Possible

(1) data handling and storage: numeric, textual

(2) matrix algebra

(3) has tables and regular expressions

(4) high-level data analytic and statistical functions

(5) OOP (classes)

(6) Graphic

(7) Programming language: loops, branching, subroutines, and etc.,

Impossible

(1) R is not a database, but connects to DBMSs

(2) R has no GUI, but connect to Java, TclTk

(3) R is fundamentally very slow, but allows to call own C/C++ code

(4) R is no spreadsheet view of data, but connects to Excel/MsOffice

(5) R is no professional & commercial support

But all R users in the world are developers (Power of Collective intelligence; 집단지성).

If you make a meaningful package at any time, you can publish it within 1 second.

Therefore, applying latest algorithms are faster than any programming language.

Page 34: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

34 / 20copyrightⓒ 2018 All rights reserved by Korea University

Install R(Step 3) Install R and Rstudio

Page 35: Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction DCCS208(02) Korea University 2019 Fall Asst.

End of Slide