Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break....

Post on 25-Jun-2020

1 views 0 download

Transcript of Big Data Definitions - unict.itggiuffrida/LM/big_data_definitions.pdfTraditional methods break....

Big Data Definitions

Giovanni Giuffrida

November 29, 2018

Roots

• Datification is the revolutionbehind Big Data

• People have always tried toquantify the World around them

• Always-connected digitaltechnologies made this a reality

History repeats itself

• 90’: Artificial Intelligence, Expert systems

• 00’: Data/Text mining, Knowledge discovery, Machinelearning

• Today: Big Data, DMP, Deep learning

Same goal: Extracting meaningful info from data

So... what’s the difference today??

But now industry is really serious about it.

• In the past, mostly limited toUniversities and research centers

• Companies now aggresivelyinvesting big $$$ in Big Data

• Shift budgets from offline to online

• They want more control on theirinvestments: transparency andcontrol

• Big data 1.0 to big data 2.0 forcompanies

But now industry is really serious about it. Why?

• Data culture: Companies now appreciate value of data torun their business

• Data size: Today’s data at companies disposal is really big.Traditional methods break.

• Technology: Tech infrastructure to handle big data is nowavailable (e.g., Hadoop, NoSql, CouchBase, MongoDB,Dynamo, Hbase, RedShift, etc.)

A not so far-fetched scenario

“One of your loyal customers posts on Facebook that she’s goingshopping at one of your stores today. You know that she justpurchased a pair of pants online last week, and that her abandonedonline shopping cart has a few cute tops in it to go with the pants.She goes to the store, the retail assistant is able to identify whoshe is and brings out the tops she abandoned online to try on withher new pants. But since your customer isn’t wearing her newpants, the retail assistant knows which size pants to go grab. Thenwhile shopping, your customer gets a 25% off coupon delivered toher smartphone-good for today only.”

Big Data according to Oxford Dictionary

big data n. Computing (also with capital initials) data of a verylarge size, typically to the extent that its manipulation and

management present significant logistical challenges; (also) thebranch of computing involving such data.

A simpler view

Anything that does not fit in Excel

How big is big?

How big is big?

Quintillion = 1018

How big is big?

Quintillion = 1018

How big is big? Let’s try to measure it

• Many attempts to measure the world information of all types• Prof. Martin Hilbert from USC:

• In 2000: Only 25% of the world information was digital• In 2007: Only 7% of the world information was analog• In 2013: 1200 Exabyte (1B gigabytes) of overall data, only 2%

analog• If it were books: Cover entire US surface 52 layers of book• If it were CD-ROMs: 5 separate piles to the moon

http://www.ibmbigdatahub.com/sites/default/files/infographic file/4-Vs-of-big-data.jpg

According to Gartner

• Big Data according to Gartner:Big data is high-Volume, high-Velocity and high-Varietyinformation assets that demand cost-effective, innovativeforms of information processing for enhanced insight anddecision making.

• Value and Veracity added later... not so meanigful

Volume: Data at rest

• About: Amount of data

• Unit: bytes

• Information about thegeneral population,education, health, medicine,travel, geographic locations,shopping, financialtransactions, jobs, scientificexperiments, emails, sensors,texts, photos, videos,activity on social networks,etc.

• How much is “Big Data”?

Velocity: Data in motion

• About: Moving data

• Unit: Bytes per second• Two possible interpretations

• Data Generation Rate• Data Processing Rate

• Every minute (2016):• 150M emails sent• 2.4M searches on Google• 30M WhatsApp messages• 700K logins on Facebook

• How much is “Big Data”?

Variety: Many possible forms

• About: Form of data• Three basic types of data

• Structured = Data in a fixed fieldwithin a record (spreadsheets,Relational Database)

• Semi-Structured = XML, JSON,CSV

• Unstructured = Data storedwithout any model, or that doesnot have any organisation

• Any of those types can be big

• Only 20% of data today is“structured”

Veracity: Uncertain data

• Uncertainty due to many factors• Incompleteness• Inconsistency• Ambiguity• Model approximation• Technical constraints• etc..

• Often overlooked

• But... it could as important as the other Vs

Value: Actionable insights

• People do not need data, they need insights

• Hidden in the data

• “Value is a concentrated data-juice”

• Gaining correct but irrelevant information is a (big) waste oftime

• Close interaction between technical/analytics team andbusiness managers may alleviate the problem

How big is Big? Considerations

• Frankly... No formal definition yet

• These numbers outstrip our machines and our imagination

• Digital data generation accelerated tremendously in recentyears

• Technology to process all this is behind

Big Data Science is a process

• Science and Crafts can be taught

• Creativity may came throughexperience

• Common Sense? You need to have it

• As many crafting processes all stepsmay be well understood!

Big Data Science is multidisciplinary

Interests towards Big Data

“Big Data” vs “Business Intelligence”

Interests towards Big Data

“Big Data” vs “Business Intelligence” vs “Hadoop”

From Big Data 1.0 to 2.0

• Similar trend as Web 1.0 to 2.0

• Infrastructure for Big Data in place today• Many technologies getting consolidated

• NoSQL, Hadoop, Redshift, etc.• Still missing standard• Scientific communities also working on it

• Infrastructure for storing Big Data in place today

• Big Data 1.0: Now we can process huge amount of data

• Big Data 2.0: Ok, now, what to do with it??

• “Data Science” as a new “sexy” field of research andapplication

• Some companies are already in the 2.0, e.g., Amazon andNetflix

From Big Data 1.0 to 2.0

• Data Science is becoming ubiquitous• Analytical thinking will be pervasive in companies• Managers should have a basic understand of fundamental

principles• Managers will lead data-analytics teams and data-analytics

projects• Companies culture should support data-driven processes• All aspects: from production to marketing to sales• Investors as well

• Necessary to support data-driven decisions or to identifydata-oriented threaths from competitors

Big Data Main Challenges

• A lot of data

• Complex data

• Ever increasing data

• Many different sources

• Siloed data

• Duplicated data

• Wrong and dirty data

• Understanding data

• Understanding businessneeds

• ... Matching those two

• Presenting data,communicating results

• Infeer new data?

• Privacy concern

Big Data Universe

• Survey by O’Reilly(2014)

• 300+ toolsavailable

• Most people use 6to 10

• Best professionalsuse up to 20 tools

• Not well definedstandards yet

• Common: SQL, R,Python, Excel

Big Data investments

Big Data investments... be careful

Industry attitute towards Big Data

• Viewpoint report DNV-GL and Eurisko

• February 2016

• 1189 professionals world-wide interviewed

• 82 leaders selected

“52% of enterprises globally see Big Data as an opportunity, andonly 4.5% see it as a threat”

“51% of enterprises with 1K+ employees are planning to increasetheir investments in Big Data in the next two to three years”

“44.9% of all enterprises rate Big Data capabilities as importantand very important to their strategic direction”

“Despite widespread optimism regarding Big Data’s potential, only23.5% of enterprises have a clear strategy today”

“Enhancing information management (27.61%), implementing &integrating new technologies and methods (24.8%), makingchanges to culture and organization (16%) and creating new

business models and go-to-market strategies (15.4%) are the topfour actions companies have taken with Big Data”

“Increased efficiency (22.6%), improved decision making (16.3%)and improved customer experience and engagement (15.6%) are

the top three benefits of enterprise-wide Big Data adoption today”

The new “Data Scientist” job profile

McKinsey: “There will be a shortage of talent necessary fororganizations to take advantage of big data. By 2018, theUnited States alone could face a shortage of 140,000 to190,000 people with deep analytical skills as well as 1.5million managers and analysts with the know-how to use theanalysis of big data to make effective decisions.”

Nerds are becoming “sexy”

The “Data Scientist” must-have skills

• Basic statistics

• Basic statistical programming: R and Python, SQL

• Machine learning concepts

• Data Munging: Data very often are dirty... need to clean andnormalize those

• Data Visualization: Many tools available

• Thinking like a “Data Scientist”: see data everywhere

• Thinking like a problem solver

• Exceptional oral and written communication abilities

• Customer oriented approach