Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week...

41
Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo [email protected]

Transcript of Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week...

Page 1: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

Introductionto Big Data

Chapter 3 & 4 (Week 2)Applications of Big Data

DCCS208(02) Korea University 2019 Fall

Asst. Prof. Minseok [email protected]

Page 2: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

Contents

General Workflow

Workflow of Data Science2.

Diverse applications

Applications of Big Data1.

1st step (Problem definition)

Practice to define problem

Practice to imagine required data

Page 3: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

3 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsNatural language processing and voice recognization

Page 4: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

4 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsNetflix & Youtube recommendation system

Page 5: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

5 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsChatBot

fXck yoX!!! <- Please don’t try that..... for chatbot’s future

Page 6: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

6 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsColor recovery for B&W picture

Page 7: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

7 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsResolution recovery for poor quality picture

Page 8: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

8 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsMotion detection

https://www.youtube.com/watch?v=pW6nZXeWlGM&feature=youtu.be

Page 9: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

9 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsImage captioning

Page 10: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

10 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsNew image generation

Page 11: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

11 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsAutonomous car

Page 12: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

12 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsRobotics

Page 13: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

13 / 20copyrightⓒ 2018 All rights reserved by Korea University

But if I ask you to make these things right now !?

Page 14: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

14 / 20copyrightⓒ 2018 All rights reserved by Korea University

Main goal of this courseMindset of this course

A journey of a thousand miles begins with a single step !!

After taking this class, you should be able to:

• think that XXX types of data will be required for these application.

• imagine data structure for these applications.

• know what technique will be required even though you don't know the exact mathematical / statistical formular of that for these applications.

Page 15: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

15 / 20copyrightⓒ 2018 All rights reserved by Korea University

General workflow for Data ScienceDiagram of workflow

Page 16: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

16 / 20copyrightⓒ 2018 All rights reserved by Korea University

Another workflow for Data ScienceDiagram of workflow

Page 17: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

17 / 20copyrightⓒ 2018 All rights reserved by Korea University

1st step for Big Data ScienceProblem definition

This is the first step in everywork

• We can set a problem by talking with someone.

• You can also set issues while fighting with yourself.

• Someone else may tell you what is uncomfortable.

• You can also view news articles and come up with new ideas.

• Ideas can come to mind during irrelevant activities.

...

Page 18: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

18 / 20copyrightⓒ 2018 All rights reserved by Korea University

What was unconfortable?Think about why this technology came about

Page 19: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

19 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsNetflix & Youtube recommendation system

Page 20: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

20 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsChatBot

Page 21: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

21 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsColor recovery for B&W picture

Page 22: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

22 / 20copyrightⓒ 2018 All rights reserved by Korea University

Page 23: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

23 / 20copyrightⓒ 2018 All rights reserved by Korea University

Workflow for Data ScienceDiagram of workflow

Page 24: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

24 / 20copyrightⓒ 2018 All rights reserved by Korea University

Workflow for Data ScienceDiagram of workflow

Page 25: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

25 / 20copyrightⓒ 2018 All rights reserved by Korea University

2nd step for Big Data ScienceExperimental design for getting data

Before attempting to collect data, think about what data you need to collect for your purpose.

• What is important feature?

• # of features

• # of samples

• Types of features

• Target individual

• ...

This is basically covered in the "Experimental Design" course in Department of Statistics.

Page 26: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

26 / 20copyrightⓒ 2018 All rights reserved by Korea University

Page 27: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

27 / 20copyrightⓒ 2018 All rights reserved by Korea University

Tabular DataStructured data

What is a table?

• A table is a collection of rows and columns

• Each row has an index

• Each column has a name

• A cell is specified by an (index, name) pair

• A cell may or may not have a value

Page 28: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

28 / 20copyrightⓒ 2018 All rights reserved by Korea University

Tabular DataStructured data

Page 29: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

29 / 20copyrightⓒ 2018 All rights reserved by Korea University

Tabular Datacsv format (comma-separated values)

Page 30: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

30 / 20copyrightⓒ 2018 All rights reserved by Korea University

The structure spectrumStructured or not

Structured(schema-first)

Relational DatabaseFormatted Messages

Semi-Structured(schema-later)

DocumentsXML

Tagged Text/Media

Unstructured(schema-never)

Plain Text

Media

Page 31: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

31 / 20copyrightⓒ 2018 All rights reserved by Korea University

When people use the word database, fundamentally what they are saying is

that the data should be self-describing and it should have a

schema. That’s really all the word database means.

-- Jim Gray, “The Fourth Paradigm”

Page 32: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

32 / 20copyrightⓒ 2018 All rights reserved by Korea University

Key concept: Structured DataStructured data

A data model is a collection of concepts for describing data.

A schema is a description of a particular collection of data, using a given data model.

Page 33: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

33 / 20copyrightⓒ 2018 All rights reserved by Korea University

The Relational ModelStructured data

The Relational Model is UbiquitousMySQL, PostgreSQL, Oracle, DB2, SQLServer, …

Foundational work done atIBM Santa Teresa Labs (now IBM Almaden )“System R”UC Berkeley CS – the “Ingres” System

Object-oriented concepts have been merged in

Early work: POSTGRES research project at Berkeley

As has support for XML (semi-structured data)

E. F., “Ted” CoddTuring Award

1981

Page 34: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

34 / 20copyrightⓒ 2018 All rights reserved by Korea University

ExampleInstance of student relation

sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@eecs 18 3.253650 Smith smith @math 19 3.8

CREATE TABLE Students(sid CHAR(20), name CHAR(20), login CHAR(10),age INTEGER,gpa FLOAT)

Page 35: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

35 / 20copyrightⓒ 2018 All rights reserved by Korea University

Data model (Tabular)Python

DataFrame: a dict of Series objectsEach Series object represents a column

Series: a named, ordered dictionaryThe keys of the dictionary are the indexesBuilt on NumPy’s ndarrayValues can be any Numpy data type object

Data stored in memory

Operations performed from Python shell

Page 36: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

36 / 20copyrightⓒ 2018 All rights reserved by Korea University

Operations (Tabular)Python

• integrate (join), transform, clean, impute

• aggregate: sum, count, average, max, min

• sort

• pivot

• Relational• union, intersection, difference, cartesian product (CROSS JOIN)• select/filter, project• join: natural join (INNER JOIN), theta join, semi-join, etc.• rename

Page 37: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

37 / 20copyrightⓒ 2018 All rights reserved by Korea University

Data model (Tabular)R

data.frame: a list of vector objectsEach vector object represents a column

Possible vector typeslogical, integer, double, complex, character, raw

Data stored in memory

Operations performed from the R shell

Page 38: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

38 / 20copyrightⓒ 2018 All rights reserved by Korea University

What’s wrong with Tables?

Too limited in structure?Too rigid?Too old fashioned?

Page 39: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

39 / 20copyrightⓒ 2018 All rights reserved by Korea University

Beyond tables

Page 40: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

40 / 20copyrightⓒ 2018 All rights reserved by Korea University

But Structure Matters!

Func

tiona

lity

Time (and cost)

Structured(schema-first)

Unstructured (schema-less)

Dataspaces(pay-as-you-go)

Structure enables computers to help usersmanipulate and maintain the data.

Page 41: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

End of Slide