Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week...
Transcript of Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week...
Introductionto Big Data
Chapter 3 & 4 (Week 2)Applications of Big Data
DCCS208(02) Korea University 2019 Fall
Asst. Prof. Minseok [email protected]
Contents
General Workflow
Workflow of Data Science2.
Diverse applications
Applications of Big Data1.
1st step (Problem definition)
Practice to define problem
Practice to imagine required data
3 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsNatural language processing and voice recognization
4 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsNetflix & Youtube recommendation system
5 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsChatBot
fXck yoX!!! <- Please don’t try that..... for chatbot’s future
6 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsColor recovery for B&W picture
7 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsResolution recovery for poor quality picture
8 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsMotion detection
https://www.youtube.com/watch?v=pW6nZXeWlGM&feature=youtu.be
9 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsImage captioning
10 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsNew image generation
11 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsAutonomous car
12 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsRobotics
13 / 20copyrightⓒ 2018 All rights reserved by Korea University
But if I ask you to make these things right now !?
14 / 20copyrightⓒ 2018 All rights reserved by Korea University
Main goal of this courseMindset of this course
A journey of a thousand miles begins with a single step !!
After taking this class, you should be able to:
• think that XXX types of data will be required for these application.
• imagine data structure for these applications.
• know what technique will be required even though you don't know the exact mathematical / statistical formular of that for these applications.
15 / 20copyrightⓒ 2018 All rights reserved by Korea University
General workflow for Data ScienceDiagram of workflow
16 / 20copyrightⓒ 2018 All rights reserved by Korea University
Another workflow for Data ScienceDiagram of workflow
17 / 20copyrightⓒ 2018 All rights reserved by Korea University
1st step for Big Data ScienceProblem definition
This is the first step in everywork
• We can set a problem by talking with someone.
• You can also set issues while fighting with yourself.
• Someone else may tell you what is uncomfortable.
• You can also view news articles and come up with new ideas.
• Ideas can come to mind during irrelevant activities.
...
18 / 20copyrightⓒ 2018 All rights reserved by Korea University
What was unconfortable?Think about why this technology came about
19 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsNetflix & Youtube recommendation system
20 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsChatBot
21 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsColor recovery for B&W picture
22 / 20copyrightⓒ 2018 All rights reserved by Korea University
23 / 20copyrightⓒ 2018 All rights reserved by Korea University
Workflow for Data ScienceDiagram of workflow
24 / 20copyrightⓒ 2018 All rights reserved by Korea University
Workflow for Data ScienceDiagram of workflow
25 / 20copyrightⓒ 2018 All rights reserved by Korea University
2nd step for Big Data ScienceExperimental design for getting data
Before attempting to collect data, think about what data you need to collect for your purpose.
• What is important feature?
• # of features
• # of samples
• Types of features
• Target individual
• ...
This is basically covered in the "Experimental Design" course in Department of Statistics.
26 / 20copyrightⓒ 2018 All rights reserved by Korea University
27 / 20copyrightⓒ 2018 All rights reserved by Korea University
Tabular DataStructured data
What is a table?
• A table is a collection of rows and columns
• Each row has an index
• Each column has a name
• A cell is specified by an (index, name) pair
• A cell may or may not have a value
28 / 20copyrightⓒ 2018 All rights reserved by Korea University
Tabular DataStructured data
29 / 20copyrightⓒ 2018 All rights reserved by Korea University
Tabular Datacsv format (comma-separated values)
30 / 20copyrightⓒ 2018 All rights reserved by Korea University
The structure spectrumStructured or not
Structured(schema-first)
Relational DatabaseFormatted Messages
Semi-Structured(schema-later)
DocumentsXML
Tagged Text/Media
Unstructured(schema-never)
Plain Text
Media
31 / 20copyrightⓒ 2018 All rights reserved by Korea University
When people use the word database, fundamentally what they are saying is
that the data should be self-describing and it should have a
schema. That’s really all the word database means.
-- Jim Gray, “The Fourth Paradigm”
32 / 20copyrightⓒ 2018 All rights reserved by Korea University
Key concept: Structured DataStructured data
A data model is a collection of concepts for describing data.
A schema is a description of a particular collection of data, using a given data model.
33 / 20copyrightⓒ 2018 All rights reserved by Korea University
The Relational ModelStructured data
The Relational Model is UbiquitousMySQL, PostgreSQL, Oracle, DB2, SQLServer, …
Foundational work done atIBM Santa Teresa Labs (now IBM Almaden )“System R”UC Berkeley CS – the “Ingres” System
Object-oriented concepts have been merged in
Early work: POSTGRES research project at Berkeley
As has support for XML (semi-structured data)
E. F., “Ted” CoddTuring Award
1981
34 / 20copyrightⓒ 2018 All rights reserved by Korea University
ExampleInstance of student relation
sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@eecs 18 3.253650 Smith smith @math 19 3.8
CREATE TABLE Students(sid CHAR(20), name CHAR(20), login CHAR(10),age INTEGER,gpa FLOAT)
35 / 20copyrightⓒ 2018 All rights reserved by Korea University
Data model (Tabular)Python
DataFrame: a dict of Series objectsEach Series object represents a column
Series: a named, ordered dictionaryThe keys of the dictionary are the indexesBuilt on NumPy’s ndarrayValues can be any Numpy data type object
Data stored in memory
Operations performed from Python shell
36 / 20copyrightⓒ 2018 All rights reserved by Korea University
Operations (Tabular)Python
• integrate (join), transform, clean, impute
• aggregate: sum, count, average, max, min
• sort
• pivot
• Relational• union, intersection, difference, cartesian product (CROSS JOIN)• select/filter, project• join: natural join (INNER JOIN), theta join, semi-join, etc.• rename
37 / 20copyrightⓒ 2018 All rights reserved by Korea University
Data model (Tabular)R
data.frame: a list of vector objectsEach vector object represents a column
Possible vector typeslogical, integer, double, complex, character, raw
Data stored in memory
Operations performed from the R shell
38 / 20copyrightⓒ 2018 All rights reserved by Korea University
What’s wrong with Tables?
Too limited in structure?Too rigid?Too old fashioned?
39 / 20copyrightⓒ 2018 All rights reserved by Korea University
Beyond tables
40 / 20copyrightⓒ 2018 All rights reserved by Korea University
But Structure Matters!
Func
tiona
lity
Time (and cost)
Structured(schema-first)
Unstructured (schema-less)
Dataspaces(pay-as-you-go)
Structure enables computers to help usersmanipulate and maintain the data.
End of Slide