Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break....

44
Big Data Definitions Giovanni Giuffrida November 29, 2018

Transcript of Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break....

Page 1: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Big Data Definitions

Giovanni Giuffrida

November 29, 2018

Page 2: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,
Page 3: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,
Page 4: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,
Page 5: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Roots

• Datification is the revolutionbehind Big Data

• People have always tried toquantify the World around them

• Always-connected digitaltechnologies made this a reality

Page 6: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

History repeats itself

• 90’: Artificial Intelligence, Expert systems

• 00’: Data/Text mining, Knowledge discovery, Machinelearning

• Today: Big Data, DMP, Deep learning

Same goal: Extracting meaningful info from data

So... what’s the difference today??

Page 7: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

But now industry is really serious about it.

• In the past, mostly limited toUniversities and research centers

• Companies now aggresivelyinvesting big $$$ in Big Data

• Shift budgets from offline to online

• They want more control on theirinvestments: transparency andcontrol

• Big data 1.0 to big data 2.0 forcompanies

Page 8: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

But now industry is really serious about it. Why?

• Data culture: Companies now appreciate value of data torun their business

• Data size: Today’s data at companies disposal is really big.Traditional methods break.

• Technology: Tech infrastructure to handle big data is nowavailable (e.g., Hadoop, NoSql, CouchBase, MongoDB,Dynamo, Hbase, RedShift, etc.)

Page 9: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

A not so far-fetched scenario

“One of your loyal customers posts on Facebook that she’s goingshopping at one of your stores today. You know that she justpurchased a pair of pants online last week, and that her abandonedonline shopping cart has a few cute tops in it to go with the pants.She goes to the store, the retail assistant is able to identify whoshe is and brings out the tops she abandoned online to try on withher new pants. But since your customer isn’t wearing her newpants, the retail assistant knows which size pants to go grab. Thenwhile shopping, your customer gets a 25% off coupon delivered toher smartphone-good for today only.”

Page 10: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Big Data according to Oxford Dictionary

big data n. Computing (also with capital initials) data of a verylarge size, typically to the extent that its manipulation and

management present significant logistical challenges; (also) thebranch of computing involving such data.

Page 11: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

A simpler view

Anything that does not fit in Excel

Page 12: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

How big is big?

Page 13: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

How big is big?

Quintillion = 1018

Page 14: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

How big is big?

Quintillion = 1018

Page 15: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

How big is big? Let’s try to measure it

• Many attempts to measure the world information of all types• Prof. Martin Hilbert from USC:

• In 2000: Only 25% of the world information was digital• In 2007: Only 7% of the world information was analog• In 2013: 1200 Exabyte (1B gigabytes) of overall data, only 2%

analog• If it were books: Cover entire US surface 52 layers of book• If it were CD-ROMs: 5 separate piles to the moon

Page 16: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

http://www.ibmbigdatahub.com/sites/default/files/infographic file/4-Vs-of-big-data.jpg

Page 17: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

According to Gartner

• Big Data according to Gartner:Big data is high-Volume, high-Velocity and high-Varietyinformation assets that demand cost-effective, innovativeforms of information processing for enhanced insight anddecision making.

• Value and Veracity added later... not so meanigful

Page 18: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Volume: Data at rest

• About: Amount of data

• Unit: bytes

• Information about thegeneral population,education, health, medicine,travel, geographic locations,shopping, financialtransactions, jobs, scientificexperiments, emails, sensors,texts, photos, videos,activity on social networks,etc.

• How much is “Big Data”?

Page 19: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Velocity: Data in motion

• About: Moving data

• Unit: Bytes per second• Two possible interpretations

• Data Generation Rate• Data Processing Rate

• Every minute (2016):• 150M emails sent• 2.4M searches on Google• 30M WhatsApp messages• 700K logins on Facebook

• How much is “Big Data”?

Page 20: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Variety: Many possible forms

• About: Form of data• Three basic types of data

• Structured = Data in a fixed fieldwithin a record (spreadsheets,Relational Database)

• Semi-Structured = XML, JSON,CSV

• Unstructured = Data storedwithout any model, or that doesnot have any organisation

• Any of those types can be big

• Only 20% of data today is“structured”

Page 21: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Veracity: Uncertain data

• Uncertainty due to many factors• Incompleteness• Inconsistency• Ambiguity• Model approximation• Technical constraints• etc..

• Often overlooked

• But... it could as important as the other Vs

Page 22: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Value: Actionable insights

• People do not need data, they need insights

• Hidden in the data

• “Value is a concentrated data-juice”

• Gaining correct but irrelevant information is a (big) waste oftime

• Close interaction between technical/analytics team andbusiness managers may alleviate the problem

Page 23: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

How big is Big? Considerations

• Frankly... No formal definition yet

• These numbers outstrip our machines and our imagination

• Digital data generation accelerated tremendously in recentyears

• Technology to process all this is behind

Page 24: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Big Data Science is a process

• Science and Crafts can be taught

• Creativity may came throughexperience

• Common Sense? You need to have it

• As many crafting processes all stepsmay be well understood!

Page 25: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Big Data Science is multidisciplinary

Page 26: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Interests towards Big Data

“Big Data” vs “Business Intelligence”

Page 27: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Interests towards Big Data

“Big Data” vs “Business Intelligence” vs “Hadoop”

Page 28: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

From Big Data 1.0 to 2.0

• Similar trend as Web 1.0 to 2.0

• Infrastructure for Big Data in place today• Many technologies getting consolidated

• NoSQL, Hadoop, Redshift, etc.• Still missing standard• Scientific communities also working on it

• Infrastructure for storing Big Data in place today

• Big Data 1.0: Now we can process huge amount of data

• Big Data 2.0: Ok, now, what to do with it??

• “Data Science” as a new “sexy” field of research andapplication

• Some companies are already in the 2.0, e.g., Amazon andNetflix

Page 29: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

From Big Data 1.0 to 2.0

• Data Science is becoming ubiquitous• Analytical thinking will be pervasive in companies• Managers should have a basic understand of fundamental

principles• Managers will lead data-analytics teams and data-analytics

projects• Companies culture should support data-driven processes• All aspects: from production to marketing to sales• Investors as well

• Necessary to support data-driven decisions or to identifydata-oriented threaths from competitors

Page 30: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Big Data Main Challenges

• A lot of data

• Complex data

• Ever increasing data

• Many different sources

• Siloed data

• Duplicated data

• Wrong and dirty data

• Understanding data

• Understanding businessneeds

• ... Matching those two

• Presenting data,communicating results

• Infeer new data?

• Privacy concern

Page 31: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Big Data Universe

• Survey by O’Reilly(2014)

• 300+ toolsavailable

• Most people use 6to 10

• Best professionalsuse up to 20 tools

• Not well definedstandards yet

• Common: SQL, R,Python, Excel

Page 32: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Big Data investments

Page 33: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Big Data investments... be careful

Page 34: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

Industry attitute towards Big Data

• Viewpoint report DNV-GL and Eurisko

• February 2016

• 1189 professionals world-wide interviewed

• 82 leaders selected

Page 35: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

“52% of enterprises globally see Big Data as an opportunity, andonly 4.5% see it as a threat”

Page 36: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

“51% of enterprises with 1K+ employees are planning to increasetheir investments in Big Data in the next two to three years”

Page 37: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

“44.9% of all enterprises rate Big Data capabilities as importantand very important to their strategic direction”

Page 38: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

“Despite widespread optimism regarding Big Data’s potential, only23.5% of enterprises have a clear strategy today”

Page 39: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

“Enhancing information management (27.61%), implementing &integrating new technologies and methods (24.8%), makingchanges to culture and organization (16%) and creating new

business models and go-to-market strategies (15.4%) are the topfour actions companies have taken with Big Data”

Page 40: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

“Increased efficiency (22.6%), improved decision making (16.3%)and improved customer experience and engagement (15.6%) are

the top three benefits of enterprise-wide Big Data adoption today”

Page 41: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

The new “Data Scientist” job profile

McKinsey: “There will be a shortage of talent necessary fororganizations to take advantage of big data. By 2018, theUnited States alone could face a shortage of 140,000 to190,000 people with deep analytical skills as well as 1.5million managers and analysts with the know-how to use theanalysis of big data to make effective decisions.”

Nerds are becoming “sexy”

Page 42: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,

The “Data Scientist” must-have skills

• Basic statistics

• Basic statistical programming: R and Python, SQL

• Machine learning concepts

• Data Munging: Data very often are dirty... need to clean andnormalize those

• Data Visualization: Many tools available

• Thinking like a “Data Scientist”: see data everywhere

• Thinking like a problem solver

• Exceptional oral and written communication abilities

• Customer oriented approach

Page 43: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,
Page 44: Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break. Technology: Tech infrastructure to handle big data is now available (e.g., Hadoop, NoSql,