Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week...
Transcript of Introduction to Big Data - scholar.harvard.edu€¦ · Introduction to Big Data Chapter 1 & 2 (Week...
Introductionto Big Data
Chapter 1 & 2 (Week 1)Course overview & introduction
DCCS208(02) Korea University 2019 Fall
Asst. Prof. Minseok Seo
01Course OverviewIntroduction to Big Data
Contents
Definition of Big Data
Introduction to Big Data2.
Brief introduction of professor & course
Course Overview1. Object & Aim of the course
Assignments & Quiz
Evaluation
Key techniques in Data Science
Core technology of Informatics
4 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewCourse information
Introduction to Big Data, DCCS208(02), Fall 2019.
Lecture time: Wed. (6,7) and Thu. (6)
Location: Wed. (7-310) and Thu. (7-315)
Completion division: Major elective subject
Level: Junior / Senior
5 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewDefinition of Big Data (Cont.)
Which is bigger, elephant or rat?
VS.
6 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewDefinition of Big Data (Cont.)
What is Data?
ID Height Weight Age
Student 1 189 cm 81 kg 24
Student 2 210 cm 90 kg 26
Student 3 191 cm 92 kg 27
… … … …
Student N 162 cm 71 kg 21
Attributes (Dimension; Features; Variables)
Ob
jects
(S
am
ple
s,
Ind
ivid
ua
ls)
7 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewDefinition of Big Data (Cont.)
In a narrow sense, Big Data means only sample size.
In a broad sense, Big Data represents both sample size and dimensionality.
8 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewDefinition of Big Data (Cont.)
3V’s (Volume, Velocity, and Variety)
9 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewDefinition of Big Data (Cont.)
5V’s (Volume, Velocity, Variety, Veracity, and Value)
Volume: Data size
Velocity: Data production speed
Variety: Data oriented from various things
Veracity: Data accuracy (Trustworthy)
Value: Data value
Value*
10 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewRelationship between Big-data & Data Science
The amount of data and information is not directly correlated with
knowledge generation.
X
But the demand for data scientists will be growing.
11 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewJob market of Big data
Furht B., Villanustre F. (2016) Introduction to Big Data. In: Big Data Technologies and Applications. Springer, Cham
It is the time to prepare for an academic course to cultivate data analysts
commensurate with demand.
12 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewObject & Aim of the course
Students who have taken this course expect to be able to learn:
Introduction to Big Data
Concept of
Big Data
Computational approaches for
Big Data
Statistical approaches for
Big Data
Visualization for Big Data
R programming
Basic Skill in Data Science
13 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewCourse schedule (Before Mid-term exam)
Week Period Study Contents
1 09.02 - 09.08 Introduction to Big Data & Data Science
2 09.09 - 09.15Overall workflow, Computer Software issues, and applications in the
Big Data era
3 09.16 - 09.22 Introduction to R programming
4 09.23 - 09.29 Descriptive & Fundamental Statistics
5 09.30 - 10.06 Understanding Data Structures (Types of random variable)
6 10.07 - 10.13 Data Visualization
7 10.14 - 10.20 Preprocessing of Big Data (Quality Control and Prescreening)
8 10.21 - 10.27 Mid-term Exam
14 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewCourse schedule (After Mid-term exam)
Week Period Study Contents
9 10.28 - 11.03 Parallel and Distributed Processing for Big Data
10 11.04 - 11.10 Statistical Estimation & Modeling
11 11.11 - 11.17 Computational approach for statistical modeling with robustness
12 11.18 - 11.24 Clustering analysis (Unsupervised learning methods)
13 11.25 - 12.01 Classification analysis (Supervised learning methods)
14 11.02 - 12.08 Algorithms of Dimensionality Reduction for Big Data
15 12.09 - 12.15 Trends in various academic & industrial fields for application of Big Data
16 12.16 - 12.22 Final Exam
15 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewTwo types of lectures per week
There are two representative computer language for Big data analysis, R and
Python.
R will be used in this class.
It is not required any prior knowledge of the R language because I plan to provide
example code for student's practice.
https://cran.r-project.org/
Wed. day2hrs
Thu. Day1hr
Lecture for Theory Hands-on lecture
The methodology learned in theory class will be exercised in the computer lab. on Thursday.
16 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewExam, Quiz, and Homework
There will be two simple quizzes in class to check the student's learning
progress of the course (before and after midterm respectively).
Quiz
Homework There will be 4 times assignments.
This will be a report on the theory and practice of data analysis learned in
class.
There will be two exams.
I will ask you to understand the basic computational/statistical algorithm.
Midterm and Final exams
17 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewEvaluation plan
Absolute grading system
Score ≥ 95, you will get A+
Score ≥ 90, you will get A
Score ≥ 85, you will get B+
and...
30%
30%
10%
20%
10%
Midterm Final Quiz Assignment Attendance
18 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewTextbook
No Textbook
This course will be proceed based on the presentation slide
I will upload presentation slide in Blackboard & my homepage
Homepage: https://scholar.harvard.edu/msseo
Teaching >> Introduction to Big Data >> Related Materials
Reference 2 (Eng. Version)
Introduction to Data Science by Rafael A. Irizarry, 2019.
(online textbook and free)
https://rafalab.github.io/dsbook/
Reference 3 (Eng. Version)
R for Data Science by Garrett Grolemund.
(online textbook and free)
https://r4ds.had.co.nz/
Reference 1 (Kor. Version)
R for Practical Data Analysis
(online textbook and free)
http://r4pda.co.kr/pdf/r4pda_2014_03_02.pdf
19 / 20copyrightⓒ 2018 All rights reserved by Korea University
Course OverviewContact information
Prof. Minseok Seo
Location: 7-203
Tel: 044-860-1379
Email: [email protected]
TA. Heechan Chae
Location: 7-328
Email: [email protected]
If you have any questions about the course please email me and I will reply as
soon as I see it.
If you need to meet in person, please make an appointment by email first.
I will be available at Mon: 12:00 - 17:00 | Wed: 10:00 - 13:00 | Thu: 10:00 - 13:00.
End of Orientation
Contents
Concept of Big Data
Introduction to Big Data2.
Brief introduction of professor & course
Course Overview1. Object & Aim of the course
Assignments & Quiz
Evaluation
Key techniques in Data Science for Big data
22 / 20copyrightⓒ 2018 All rights reserved by Korea University
Characteristics of Big DataRemind concept of Big Data
5V’s (Volume, Velocity, Variety, Veracity, and Value)
Volume: Data size
Velocity: Data production speed
Variety: Data oriented from various things
Veracity: Data accuracy (Trustworthy)
Value: Data value
Value*
23 / 20copyrightⓒ 2018 All rights reserved by Korea University
Petabyte era
transferred about 197 PB of data thorough its network each data (2018)
processed about 24 petabytes daily (2009)
1 PB = 1000000000000000B = 1015bytes = 1000terabytes
1000 PB = 1 exabyte (EB)
In fact, we can say that we have already entered the exabyte
era.
24 / 20copyrightⓒ 2018 All rights reserved by Korea University
Characteristics of Big DataHow do you recognize if it's big data or not?
Computer Scientist
My computer is low on memory for
handling this data!!
That is Big Data
No!!!! This data is over 2TB. Where do I
store it?????
That is Big Data
In short, if you’re having trouble with data processing on your computer (멘붕에빠지면), it will be due to the Big Data.
25 / 20copyrightⓒ 2018 All rights reserved by Korea University
Characteristics of Big DataHow do you recognize if it's big data or not?
Statistician
When does this calculation end? I was
only waiting for 10 years ...
Dimensionality is too high!!!! I can’t build
statistical model using this data!!!
That is Big Data
In short, if you’re having trouble with data analysis on your computer (멘붕에 빠지면), it will be due to the Big Data.
26 / 20copyrightⓒ 2018 All rights reserved by Korea University
Core technologies of Big Data eraIT technologies to resolve issue derived from the Big data
Difficulties arise in both hardware and software.
Prescreening techniques
Data Visualization
Feature selection
Parallel processing
Clouding computing
Distributed processing
Software Hardware
But students can approach software difficulties.
27 / 20copyrightⓒ 2018 All rights reserved by Korea University
Computational language for Big DataR and Python
There are two representative computer language for Big data analysis, R and
Python.
R programming language (free and relatively easy) for hands-on lecture.
Let’s connect R homepage
https://cran.r-project.org/
Wed. day2hrs
Thu. Day1hr
Lecture for Theory Hands-on lecture
28 / 20copyrightⓒ 2018 All rights reserved by Korea University
Install R(Step 1) Download the R installer
29 / 20copyrightⓒ 2018 All rights reserved by Korea University
Install R(Step 2) Download the RStudio
Download Rstudio from https://www.rstudio.com/products/rstudio/download/
30 / 20copyrightⓒ 2018 All rights reserved by Korea University
Install R(Step 3) Install R and Rstudio
31 / 20copyrightⓒ 2018 All rights reserved by Korea University
What is R
R is an interpreted computer language.
It is possible to interface procedures written in C, C+, and etc., languages for
efficiency.
System commands can be called from within R
R is used for data manipulation, statistics, and graphics.
32 / 20copyrightⓒ 2018 All rights reserved by Korea University
R, S, and S-plus (History of R)
S: an interactive environment for data analysis developed at Bell Laboratories since
1976
1988 - S2: RA Becker, JM Chambers, A Wilks
1992 - S3: JM Chambers, TJ Hastie
1998 - S4: JM Chambers
Exclusively licensed by AT&T/Lucent to Insightful Corporation, Seattle WA. Product
name: “S-plus”.
Implementation languages C, Fortran.
R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of
Auckland, New Zealand during 1990s.
Since 1997: international “R-core” team of ca. 15 people with access to common
CVS archive.
33 / 20copyrightⓒ 2018 All rights reserved by Korea University
What R does and does not Possible
(1) data handling and storage: numeric, textual
(2) matrix algebra
(3) has tables and regular expressions
(4) high-level data analytic and statistical functions
(5) OOP (classes)
(6) Graphic
(7) Programming language: loops, branching, subroutines, and etc.,
Impossible
(1) R is not a database, but connects to DBMSs
(2) R has no GUI, but connect to Java, TclTk
(3) R is fundamentally very slow, but allows to call own C/C++ code
(4) R is no spreadsheet view of data, but connects to Excel/MsOffice
(5) R is no professional & commercial support
But all R users in the world are developers (Power of Collective intelligence; 집단지성).
If you make a meaningful package at any time, you can publish it within 1 second.
Therefore, applying latest algorithms are faster than any programming language.
34 / 20copyrightⓒ 2018 All rights reserved by Korea University
Install R(Step 3) Install R and Rstudio
End of Slide