Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week...

Post on 21-May-2020

4 views 0 download

Transcript of Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week...

Introductionto Big Data

Chapter 3 & 4 (Week 2)Applications of Big Data

DCCS208(02) Korea University 2019 Fall

Asst. Prof. Minseok Seomins@korea.ac.kr

Contents

General Workflow

Workflow of Data Science2.

Diverse applications

Applications of Big Data1.

1st step (Problem definition)

Practice to define problem

Practice to imagine required data

3 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsNatural language processing and voice recognization

4 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsNetflix & Youtube recommendation system

5 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsChatBot

fXck yoX!!! <- Please don’t try that..... for chatbot’s future

6 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsColor recovery for B&W picture

7 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsResolution recovery for poor quality picture

8 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsMotion detection

https://www.youtube.com/watch?v=pW6nZXeWlGM&feature=youtu.be

9 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsImage captioning

10 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsNew image generation

11 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsAutonomous car

12 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsRobotics

13 / 20copyrightⓒ 2018 All rights reserved by Korea University

But if I ask you to make these things right now !?

14 / 20copyrightⓒ 2018 All rights reserved by Korea University

Main goal of this courseMindset of this course

A journey of a thousand miles begins with a single step !!

After taking this class, you should be able to:

• think that XXX types of data will be required for these application.

• imagine data structure for these applications.

• know what technique will be required even though you don't know the exact mathematical / statistical formular of that for these applications.

15 / 20copyrightⓒ 2018 All rights reserved by Korea University

General workflow for Data ScienceDiagram of workflow

16 / 20copyrightⓒ 2018 All rights reserved by Korea University

Another workflow for Data ScienceDiagram of workflow

17 / 20copyrightⓒ 2018 All rights reserved by Korea University

1st step for Big Data ScienceProblem definition

This is the first step in everywork

• We can set a problem by talking with someone.

• You can also set issues while fighting with yourself.

• Someone else may tell you what is uncomfortable.

• You can also view news articles and come up with new ideas.

• Ideas can come to mind during irrelevant activities.

...

18 / 20copyrightⓒ 2018 All rights reserved by Korea University

What was unconfortable?Think about why this technology came about

19 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsNetflix & Youtube recommendation system

20 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsChatBot

21 / 20copyrightⓒ 2018 All rights reserved by Korea University

ApplicationsColor recovery for B&W picture

22 / 20copyrightⓒ 2018 All rights reserved by Korea University

23 / 20copyrightⓒ 2018 All rights reserved by Korea University

Workflow for Data ScienceDiagram of workflow

24 / 20copyrightⓒ 2018 All rights reserved by Korea University

Workflow for Data ScienceDiagram of workflow

25 / 20copyrightⓒ 2018 All rights reserved by Korea University

2nd step for Big Data ScienceExperimental design for getting data

Before attempting to collect data, think about what data you need to collect for your purpose.

• What is important feature?

• # of features

• # of samples

• Types of features

• Target individual

• ...

This is basically covered in the "Experimental Design" course in Department of Statistics.

26 / 20copyrightⓒ 2018 All rights reserved by Korea University

27 / 20copyrightⓒ 2018 All rights reserved by Korea University

Tabular DataStructured data

What is a table?

• A table is a collection of rows and columns

• Each row has an index

• Each column has a name

• A cell is specified by an (index, name) pair

• A cell may or may not have a value

28 / 20copyrightⓒ 2018 All rights reserved by Korea University

Tabular DataStructured data

29 / 20copyrightⓒ 2018 All rights reserved by Korea University

Tabular Datacsv format (comma-separated values)

30 / 20copyrightⓒ 2018 All rights reserved by Korea University

The structure spectrumStructured or not

Structured(schema-first)

Relational DatabaseFormatted Messages

Semi-Structured(schema-later)

DocumentsXML

Tagged Text/Media

Unstructured(schema-never)

Plain Text

Media

31 / 20copyrightⓒ 2018 All rights reserved by Korea University

When people use the word database, fundamentally what they are saying is

that the data should be self-describing and it should have a

schema. That’s really all the word database means.

-- Jim Gray, “The Fourth Paradigm”

32 / 20copyrightⓒ 2018 All rights reserved by Korea University

Key concept: Structured DataStructured data

A data model is a collection of concepts for describing data.

A schema is a description of a particular collection of data, using a given data model.

33 / 20copyrightⓒ 2018 All rights reserved by Korea University

The Relational ModelStructured data

The Relational Model is UbiquitousMySQL, PostgreSQL, Oracle, DB2, SQLServer, …

Foundational work done atIBM Santa Teresa Labs (now IBM Almaden )“System R”UC Berkeley CS – the “Ingres” System

Object-oriented concepts have been merged in

Early work: POSTGRES research project at Berkeley

As has support for XML (semi-structured data)

E. F., “Ted” CoddTuring Award

1981

34 / 20copyrightⓒ 2018 All rights reserved by Korea University

ExampleInstance of student relation

sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@eecs 18 3.253650 Smith smith @math 19 3.8

CREATE TABLE Students(sid CHAR(20), name CHAR(20), login CHAR(10),age INTEGER,gpa FLOAT)

35 / 20copyrightⓒ 2018 All rights reserved by Korea University

Data model (Tabular)Python

DataFrame: a dict of Series objectsEach Series object represents a column

Series: a named, ordered dictionaryThe keys of the dictionary are the indexesBuilt on NumPy’s ndarrayValues can be any Numpy data type object

Data stored in memory

Operations performed from Python shell

36 / 20copyrightⓒ 2018 All rights reserved by Korea University

Operations (Tabular)Python

• integrate (join), transform, clean, impute

• aggregate: sum, count, average, max, min

• sort

• pivot

• Relational• union, intersection, difference, cartesian product (CROSS JOIN)• select/filter, project• join: natural join (INNER JOIN), theta join, semi-join, etc.• rename

37 / 20copyrightⓒ 2018 All rights reserved by Korea University

Data model (Tabular)R

data.frame: a list of vector objectsEach vector object represents a column

Possible vector typeslogical, integer, double, complex, character, raw

Data stored in memory

Operations performed from the R shell

38 / 20copyrightⓒ 2018 All rights reserved by Korea University

What’s wrong with Tables?

Too limited in structure?Too rigid?Too old fashioned?

39 / 20copyrightⓒ 2018 All rights reserved by Korea University

Beyond tables

40 / 20copyrightⓒ 2018 All rights reserved by Korea University

But Structure Matters!

Func

tiona

lity

Time (and cost)

Structured(schema-first)

Unstructured (schema-less)

Dataspaces(pay-as-you-go)

Structure enables computers to help usersmanipulate and maintain the data.

End of Slide