Big Data: Data Analysis Boot Camp What is Big Data?ccartled/Teaching/2018-Spring/DataAnalysi… ·...
Transcript of Big Data: Data Analysis Boot Camp What is Big Data?ccartled/Teaching/2018-Spring/DataAnalysi… ·...
-
1/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Big Data: Data Analysis Boot CampWhat is Big Data?
Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD
19 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 2018
-
2/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Table of contents (1 of 1)
1 Intro.2 What is Big Data
And, why is it interesting?3 Big Data’s Vs
Classical definitionData sources and types
4 What sets BD apartStatistics and BD
5 Real definitionsPragmatic and practical
6 EthicsA simple idea in pictures
7 Q & A
8 Conclusion
9 References
10 Files
-
3/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
What are we going to cover?
We’re going to talk about:
What is Big Data?
What is Big Data, beyond themarketing hype?
What sets Big Data apart?
What is a practical definition ofBig Data?
-
4/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
And, why is it interesting?
And, why is it interesting?
“Big data has emerged as a technology term andtrend that is complementary to and considered to beequally as transformational as the cloud computingmodel. . . . represented as an ’old’ or ’new’ capabilitydepending on the perspective of those defining it, . . . ”
Lee Badger [8]
“Big Data can be characterized by the three V’s:volume (large amounts of data), variety (includesdifferent types of data), and velocity (constantlyaccumulating new data).”
Jules. J. Berman [3]
-
5/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Classical definition
Doug Laney, META Group
The origin of “Big Data” ideas and definitions.
Started in the e-commerceMergers and Acquisitionsarena
Used to explain whytraditional RelationalDatabase ManagementSystems (RDMS) wouldn’tscale
Intended audience wasnon-technical management Image from [7].
Take away: traditional RDMS don’t/won’t scale and differentapproaches are needed.
-
6/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Classical definition
Laney’s original BD Vs
Image from [7].
-
7/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Classical definition
Volume — what does it mean for Big Data?
How much is there? And, how do we store it?
Store relational records?
Store transactional records?
How long to keep dataavailable?
How to access data?
How to migrate data?
Image from [4].
See http://en.wikipedia.org/wiki/Metric prefix for list of prefixes.
-
8/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Classical definition
Velocity — what does it mean for Big Data?
Frequency of datageneration/delivery
Think of data from a device,or sensor, robots, clicklogs
Real-time analysis is small(9%) [12].
Most Big Data analytics isbatch
Known as “Little’s Law”[9]
Take away: data is generated at a high speed, it must be analyzedbefore the next set of data is delivered.
-
9/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Classical definition
Variety — what does it mean for Big Data?
Not all data is the same.
Data from a multitude ofdifferent sources.
Not all data is useful.
Data is lost during“normalization”
Hopefully not importantdata, when in doubt: keep itsomehow
Gets away from relationaldatabases
-
10/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Classical definition
The original Vs have been expanded
Lots more Vs.
1 Vagueness2 Validity3 Value4 Variability5 Variety6 Velocity7 Venue
8 Veracity9 Viability
10 Vincularity11 Virility12 Viscosity13 Visibility14 Visible
15 Visualization
16 Vitality
17 Vocabulary
18 Volatility
19 Volume
We’ll talk about these later.
-
11/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Data sources and types
The Big Data challenges.
Heterogeneity
Scale
Timeliness
Complexity
Privacy
The Big Data user changes the question[1].
-
12/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Statistics and BD
Important ideas from statistics
How “good” an answer do you want?Questions that need to beanswered:
How accurately do you needthe answer?
What level of confidence doyou intend to use?
What is your currentestimate of the answeryou’re after? Image from [6].
The greater the tolerance for error, the fewer samples needed.
-
13/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Statistics and BD
If you have some pre-knowledge of the “population” then you onlyneed to sample a very small number of “individuals” to get a goodenough answer.[15]
-
14/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Statistics and BD
How sampling differs from “Big Data”
Sampling – start with apreconceived idea of the outcome
Sampling – few data pointsextremely valuable (n = 1000)
Big data – you don’t know whatthe data holds
Big data – many data pointsextremely cheap (n = all)
Leadership role changes frominvestigator to data [10].
Large data sets are messy, incomplete, inconsistent, and errorprone. Require lots of data munging and data wrangling.
-
15/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Statistics and BD
We’ll be covering virtually “bleeding edge” stuff.
Data too big for a singlemachine.
Processing too long for asingle machine.
Question/analysis isparalizable.
Image from [13].
-
16/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Statistics and BD
Lots of places, lots of it, and fast.
We are “drowning” in Big Data.
230,000,000 tweets per day[5]
2,700,000,000 Facebooklikes per day [2]
100 hours of YouTube videoevery minute [16]
Clickstream left on serversOur wearable devices are contributing to this avalanche of data.
-
17/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Statistics and BD
With all this data, what kinds of questions can we ask?
How is data from one dataset related to data inanother?
Are the relationshipsone-to-one or, one-to-many,or many-to-many?
Is the data “clean” or not?
What are we trying to findfrom the data?
The details of the questions depend on the data and what we areinterested in finding.
-
18/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Statistics and BD
Some questions are easily stated, . . .
Which of these questions areamenable to Big Data processing(and why)?
1 a[i ] = b[i ] + c[i ]
2 a[i ] = f (b)
3 a[i ] = a[i − 1] + b[i − 1]4 a = b + c
-
19/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Statistics and BD
Does the tweet sentiment change over time?
-
20/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Statistics and BD
What sends what type of tweet?
-
21/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Statistics and BD
Where do tweets come from?
-
22/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Pragmatic and practical
A pragmatic definition
“. . . big data refers to things one can do at a largescale that cannot be done at a smaller one, to extractnew insights or create new forms of value, in ways thatchange markets, organizations, the relationship betweencitizens and governments, and more.”
Mayer-Schönberger and Cukier [10]
-
23/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Pragmatic and practical
A practical definition based on “people” time.
If:
Your data won’t fit into onemachine or application, or
You are waiting too long foran answer
then:
You have a Big Data problem that requires Big Data tools andtechniques.
-
24/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
A simple idea in pictures
“In addition to thefoundational and translationalskills training that studentsreceive, they would alsobenefit from a betterunderstanding of ethics andsocial context of data . . . ”
NAS [11]
Image from [14].
-
25/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
A simple idea in pictures
Ethics in Big Data[11]
Data confidence – avoidingoverconfidence and theinclination to drawstronger-than-appropriateconclusionsData context – understandthe context of data setsbefore they are processedFairness – treat all [data]equitably and avoid bias
Privacy – privacy withrespect to how data arecollected and analyzedStewardship – supervisionof a data set at all stagesof existenceValidity – ensure that thedata set contains validinformation
This area could be a full credit college course.
-
26/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Q & A time.
Q: Name two families whose kidswon’t join the Marines.A: The Halls of Montezuma andthe Shores of Tripoli.
-
27/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
What have we covered?
Big Data is all around us.Big Data is about volume, variety,velocity, and getting answersquickly.Some Big Data questions are easyto state, but impossible to answer.Dealing with Big Data can raisereal ethical questions
Next: What is R?
-
28/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
References (1 of 4)
[1] Divyakant Agrawal, Philip Bernstein, Elisa Bertino, SusanDavidson, Umeshwas Dayal, and Michael Franklin,Challenges and Opportunities with Big Data, Purde e-Pubs(2011).
[2] Anson Alexander,Facebook User Statistics 2012 [Infographic], ansonAlex.com(2012).
[3] Jules J Berman,Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information,Newnes, 2013.
[4] Applied Innovations, Track website visitors, http://www.appliedi.net/blog/track-website-visitors/,2010.
http://www.appliedi.net/blog/track-website-visitors/http://www.appliedi.net/blog/track-website-visitors/
-
29/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
References (2 of 4)
[5] Joab Jackson, The Big Promise of Big Data, BusinessSoftware (2012).
[6] James Klurfeld, Making sense of the campaign: The truthabout polling,http://drc.centerfornewsliteracy.org/resource/
making-sense-campaign-truth-about-polling, 2016.
[7] Doug Laney,3D Data Management: Controlling Data Volume, Velocity and Variety,META Group Research Note 6 (2001).
[8] Robert Bohn Lee Badger, David Bernstein,US Government Cloud Computing Technology Roadmap Volume I,Tech. report, National Institute of Standards and Technology,2014.
http://drc.centerfornewsliteracy.org/resource/making-sense-campaign-truth-about-pollinghttp://drc.centerfornewsliteracy.org/resource/making-sense-campaign-truth-about-polling
-
30/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
References (3 of 4)
[9] John DC Little, A Proof for the Queuing Formula: L= λ W,Operations Research 9 (1961), no. 3, 383–387.
[10] Viktor Mayer-Schönberger and Kenneth Cukier, Big data: Arevolution that will transform how we live, work, and think,Houghton Mifflin Harcourt, 2013.
[11] National Academies of Sciences Engineering and Medicine,Envisioning the data science discipline: The undergraduateperspective: Interim report, National Academies of Science,2017.
[12] Philip Russom, Big Data Analytics, TDWI Best PracticesReport, Fourth Quarter (2011).
-
31/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
References (4 of 4)
[13] Andy Shaw, Leading edge or bleeding edge? (reflecting oninnovation), http://poopengineer.blogspot.com/2015/04/leading-edge-or-bleeding-edge.html, 2015.
[14] European Union Staff, Ethics, https://ec.europa.eu/programmes/horizon2020/en/h2020-section/ethics,2017.
[15] Mario F Triola, Essentials of statistics, Pearson AddisonWesley Boston, MA, USA:, 2008.
[16] YouTube, Statistics,http://www.youtube.com/yt/press/statistics.html.
http://poopengineer.blogspot.com/2015/04/leading-edge-or-bleeding-edge.htmlhttp://poopengineer.blogspot.com/2015/04/leading-edge-or-bleeding-edge.htmlhttps://ec.europa.eu/programmes/horizon2020/en/h2020-section/ethicshttps://ec.europa.eu/programmes/horizon2020/en/h2020-section/ethicshttp://www.youtube.com/yt/press/statistics.html
-
32/32
Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files
Files of interest
1 The V’s of Big data
-
How Many Vs are there in Big Data?
Chuck Cartledge
Sunday 6th March, 2016 at 13:32
Contents
1 Introduction 1
2 Big Data Vs 1
3 Conclusion 4
List of Tables
1 Big Data Vs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1 Introduction
Doug Laney is widely credited with originating the first big Vs of Big Data: volume, variety,and velocity. Since then the number of Vs has grown and grown. I will list all the Vs that Ican find in chronological order until I get tired, to see if there is a pattern in their growth.
2 Big Data Vs
Table 1 chronicles some of the Big Data Vs.
-
Table 1: Big Data Vs.
Num. Year V Definition Source
1 2001 Variety . . . no greater barrier to effective data manage-ment will exist than the variety of incompati-ble data formats, non-aligned data structures,and inconsistent data semantics.
[8, 10]
2 2001 Velocity E-commerce has also increased point-of-interaction (POI) speed and, consequently, thepace data used to support interactions andgenerated by interactions.
[8]
3 2001 Volume E-commerce channels increase thedepth/breadth of data available about atransaction (or any point of interaction).
[8]
4 2013 Validity . . . is the data correct and accurate for the in-tended use.
[2, 9, 10,11, 14]
5 2013 Value How to determine the prescriptive value ofdata?
[2, 5, 9,13, 14,15, 7, 6,3, 1]
6 2013 Variability Many options or variable interpretations canconfuse interpretation.
[2, 5, 10,13, 15]
7 2013 Veracity . . . to the biases, noise and abnormality indata.
[2, 5, 9,11, 14,15, 12, 6,3, 4, 1]
8 2013 Viability . . . can the data be analyzed in a way thatmakes it decision-relevant?
[5, 10]
9 2013 Virility . . . Defined by some users as the rate at whichthe data spreads; how often it is picked up andrepeated by other users or events.
[15]
10 2013 Viscosity . . . used to describe the latency or lag time inthe data relative to the event being described.
[15]
11 2013 Visibility . . . the state of being able to see or be seen - isimplied. [9, 14, 10]
(Continued on the next page.)
2
-
Table 1. (Continued from the previous page.)
Num. Year V Definition Source
12 2013 Visualization Making all that vast amount of data compre-hensible in a manner that is easy to under-stand and read. With the right analyses andvisualizations, raw data can be put to use oth-erwise raw data remains essentially useless.
[13]
13 2013 Volatility . . . how long is data valid and how long shouldit be stored.
[10, 11]
14 2014 Vagueness . . . confusion over the meaning of big data (Isit Hadoop? Is it something that weve alwayshad? Whats new about it? What are thetools? Which tools should I use? etc.)
[2]
15 2014 Venue . . . distributed, heterogeneous data from mul-tiple platforms, from different owners systems,with different access and formatting require-ments, private vs. public cloud.
[2]
16 2014 Vocabulary . . . schema, data models, semantics, ontolo-gies, taxonomies, and other content- andcontext-based metadata that describe thedatas structure, syntax, content, and prove-nance.
[2]
17 2015 Vincularity . . . it implies connectivity or linkage. [10]
18 2015 Visible We live in an increasingly visual world and thestatistics of increase in the number of imagesand videos shared on the Internet is stagger-ing.
[10]
19 2015 Vitality . . . criticality of the data is another conceptthat is crucial and is embedded in the conceptof Value.
[10]
3
-
3 Conclusion
Doug Laney’s initial 3Vs were based on his experiences in a business mergers and acquisitionsenvironment. The terms and ideas were easy to understand and to relate to. Many, manypeople have hopped on the “V” band wagon, and will probably continue to for the foreseeablefuture. From a practical stand point; if your data won’t fit in one machine, or the machinetakes too long to return an answer then you have a Big Data problem. Regardless of manyVs apply.
An alternative set of attributes of a Big Data based on [16] is:Size: the volume of the datasets is a critical factor, orComplexity: the structure, behaviour and permutations of the datasets is a critical
factor, orTechnologies: the tools and techniques which are used to process a sizable or complex
dataset is a critical factor.Any two of these attributes meets the requirements of a Big Data problem.
4
-
References
[1] Marcos D Assunção, Rodrigo N Calheiros, Silvia Bianchi, Marco AS Netto, and Rajku-mar Buyya, Big data computing and clouds: Trends and future directions, Journal ofParallel and Distributed Computing 79 (2015), 3–15.
[2] Kirk Borne, Top 10 big data challenges - a serious look at 10 big datavs, https://www.mapr.com/blog/top-10-big-data-challenges-%E2%80%93-serious-look-10-big-data-v%E2%80%99s, 2014.
[3] Yuri Demchenko, Paola Grosso, Cees De Laat, and Peter Membrey, Addressing big dataissues in scientific data infrastructure, Collaboration Technologies and Systems (CTS),2013 International Conference on, IEEE, 2013, pp. 48–55.
[4] Xin Luna Dong and Divesh Srivastava, Big data integration, Data Engineering (ICDE),2013 IEEE 29th International Conference on, IEEE, 2013, pp. 1245–1248.
[5] Seth Grimes, Big data: Avoid ’wanna v’ confusion, http://www.informationweek.com/big-data/big-data-analytics/big-data-avoid-wanna-v-confusion/d/
d-id/1111077?, 2013.
[6] Pascal Hitzler and Krzysztof Janowicz, Linked data, big data, and the 4th paradigm.,Semantic Web 4 (2013), no. 3, 233–235.
[7] Stephen Kaisler, Frank Armour, Juan Antonio Espinosa, and William Money, Big data:Issues and challenges moving forward, System Sciences (HICSS), 2013 46th HawaiiInternational Conference on, IEEE, 2013, pp. 995–1004.
[8] Doug Laney, 3d data management: Controlling data volume, velocity and variety, METAGroup Research Note 6 (2001).
[9] Rob Livingstone, The 7 vs of big data, http://rob-livingstone.com/2013/06/big-data-or-black-hole/, 2013.
[10] Rajiv Maheshwari, 3 vs or 7 vs - whats the value of big data?, https://www.linkedin.com/pulse/3-vs-7-whats-value-big-data-rajiv-maheshwari, 2105.
[11] Kevin Normandeau, Beyond volume, variety and velocity is the is-sue of big data veracity, http://insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-veracity/, 2013.
[12] Wullianallur Raghupathi and Viju Raghupathi, Big data analytics in healthcare: promiseand potential, Health Information Science and Systems 2 (2014), no. 1, 3.
5
-
[13] BI Staff, Why the 3vs are not sufficient to describe big data, https://datafloq.com/read/3vs-sufficient-describe-big-data/166, 2013.
[14] University of Technology Staff, The 7 vs of big data, http://mbitm.uts.edu.au/feed/7-vs-big-data, 2013.
[15] Bill Vorhies, How many vs in big data the charac-teristics that define big data, http://data-magnum.com/how-many-vs-in-big-data-the-characteristics-that-define-big-data/, 2013.
[16] Jonathan Stuart Ward and Adam Barker, Undefined by data: A survey of big datadefinitions, arXiv preprint arXiv:1309.5821 (2013).
6
"Chuck Cartledge"
Intro.What is Big DataAnd, why is it interesting?
Big Data's VsClassical definitionData sources and types
What sets BD apartStatistics and BD
Real definitionsPragmatic and practical
EthicsA simple idea in pictures
Q & AConclusionReferencesFiles