Big Data Analytics - Universität Hildesheim · I \Big data is high-volume, high-velocity and...
Transcript of Big Data Analytics - Universität Hildesheim · I \Big data is high-volume, high-velocity and...
Big Data Analytics
Big Data Analytics
Prof. Dr. Lars Schmidt-Thieme
Information Systems and Machine Learning Lab (ISMLL)Institute of Computer Science
University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
[based on original slides by Lucas Rego Drumond, ISMLL 2014]
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25
Big Data Analytics
Outline
1. What is Big Data?
2. Where to Find Big Data?
3. Big Data Applications
4. How to analyze Big Data?
5. Big Data at ISMLL, University of Hildesheim
6. Conclusions
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25
Big Data Analytics 1. What is Big Data?
Outline
1. What is Big Data?
2. Where to Find Big Data?
3. Big Data Applications
4. How to analyze Big Data?
5. Big Data at ISMLL, University of Hildesheim
6. Conclusions
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25
Big Data Analytics 1. What is Big Data?
What is Big Data?
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25
Big Data Analytics 1. What is Big Data?
What is Big Data?
Some definitions:
I “A collection of data sets so large and complex that it becomesdifficult to process using on-hand database management tools ortraditional data processing applications.”http://en.wikipedia.org/wiki/Big data
I “Big data is high-volume, high-velocity and high-varietyinformation assets that demand cost-effective, innovative forms ofinformation processing for enhanced insight and decision making.”www.gartner.com/it-glossary/big-data/
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 2 / 25
Big Data Analytics 1. What is Big Data?
Big Data — Dimensions (the “4 Vs”)
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 3 / 25
Big Data Analytics 1. What is Big Data?
What is Big Data?
Big Data is about:
I Storing and accessing large amounts of (unstructured/complex) dataI querying
I Processing high volume data streams
I Decision support based on large dataI visualization, reportingI navigation / query interfacesI contextualization / sense making
I Building predictive models trained on large data
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 4 / 25
Big Data Analytics 2. Where to Find Big Data?
Outline
1. What is Big Data?
2. Where to Find Big Data?
3. Big Data Applications
4. How to analyze Big Data?
5. Big Data at ISMLL, University of Hildesheim
6. Conclusions
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 5 / 25
Big Data Analytics 2. Where to Find Big Data?
Big Data in Physics (CERN)
I Large Hadron Collider has collected data from over 300 trillionproton-proton collisions
I Approx. 25 Petabytes per year
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 5 / 25
Big Data Analytics 2. Where to Find Big Data?
Big Data in Genetics (Ensembl)
I Ensembl database contains the genome of humans and 50 otherspecies
I “only” 250 GB
source: http://www.ensembl.org/
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 6 / 25
Big Data Analytics 2. Where to Find Big Data?
Big Data in the Web (Google)
I 3.3 billion searches per day (on average)
I 30 trillion unique URLs identified on the Web
I 20 billion sites crawled a day
I In 2008 Google processed more than 20 Petabytes of data per day
Source:
– http://searchengineland.com/google-search-press-129925
– Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM51, 1 (January 2008), 107-113.
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 7 / 25
Big Data Analytics 2. Where to Find Big Data?
Big Data in Social Media (Facebook)
I 1.28 billion users (1.23 billion monthly active in January 2014)I Size of user data stored by Facebook: 300 PetabytesI Average amount of data that Facebook takes in daily: 600 TerabytesI Size of Facebook’s Graph Search database: 700 Terabytes
Source: http://allfacebook.com/orcfile b130817
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 8 / 25
Big Data Analytics 2. Where to Find Big Data?
Big Data in Social Media (Twitter)
I Average number of tweets per day: 58 millionI Number of Twitter search engine queries every day: 2.1 billionI Total number of active registered Twitter users: 645,750,000
Source: http://www.statisticbrain.com/twitter-statistics/
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 9 / 25
Big Data Analytics 2. Where to Find Big Data?
Big Data — Public Datasets
1000 Genomes Project DNA of 1700 humans 200 TBCommon Crawl Corpus 5G web pages 81 TB
Wikipedia / Freebase 1.9G subject/predicate/object triples 250 GBMillion Song Dataset audio features of 1M songs 280 GB
OpenStreetMap a map of earth 90 GB2000 US Census US census data 200 GB
PubChem library biological activities of small molecules 230 GB
NCDC weather data daily measurements from 9000 stations 20 GBOpen Library metadata of 20M books 7 GBTwitter 1.6G tweets 0.6 GB
CD 700 MB, DVD 4,7–17 GB, Blu-ray 25–100 GB, hard disc: 3–4 TB.
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 10 / 25
Big Data Analytics 2. Where to Find Big Data?
How Large is 1 Petabyte (PB)?
I 35.7M years counted one byte per second,
I 254 years listening to music stored in CD quality (500MB/h)I 25 years watching DVDs
I 223.000 DVDs (a 4.7 GB)I but there are only 74.000 TV movies on IMDB !
(1.8M including TV episodes)
I can be stored on 341 harddisks a 3 TB/90e (30,000e)
I 96 days to read from standard harddisks sequentially (1030 MBits/s)
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 11 / 25
Big Data Analytics 3. Big Data Applications
Outline
1. What is Big Data?
2. Where to Find Big Data?
3. Big Data Applications
4. How to analyze Big Data?
5. Big Data at ISMLL, University of Hildesheim
6. Conclusions
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 12 / 25
Big Data Analytics 3. Big Data Applications
What to do with Big Data?
Application examples:
I Test complex scientific hypotheses (physics, genetics)
I Index the web, including relevance feedback by users (web)
I Online personalized advertising (social media, esp. Google)
I Recommender systems (e-commerce, esp. Amazon)
I Media analysis, sentiment analysis, market research (social media)I e.g., Obama campaign 2012
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 12 / 25
Big Data Analytics 3. Big Data Applications
More Applications & Case Studies
I T-Mobile USA:integrated Big Data across multiple IT systems to combine customertransaction and interactions data in order to better predict customerdefections
I By leveraging social media data along with transaction data from CRMand billing systems, customer defections is said to have been cut in halfin a single quarter.
I US Xpress:collects data elements ranging from fuel usage to tire condition totruck engine operations to GPS information
I Optimal fleet management
I McLaren’s Formula One racing team:real-time car sensor data during car races
I Real time identification of issues with its racing cars
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 13 / 25
Big Data Analytics 4. How to analyze Big Data?
Outline
1. What is Big Data?
2. Where to Find Big Data?
3. Big Data Applications
4. How to analyze Big Data?
5. Big Data at ISMLL, University of Hildesheim
6. Conclusions
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 14 / 25
Big Data Analytics 4. How to analyze Big Data?
How to handle Big Data? — The BI Approach
DataWarehouse
I Static databases (snapshots)
I Structured data
I Centralized approaches
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 14 / 25
Big Data Analytics 4. How to analyze Big Data?
How to handle Big Data? — The Distributed Approach
I Massive Parallelism
I Heterogeneous datasources
I Unstructured data
I Data streams
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 15 / 25
Big Data Analytics 4. How to analyze Big Data?
Challenges
Challenges to deal with large volumes of data:
I Store and query large amounts of data in a distributed environmentefficiently
I distributed file systemsI nosql databases, distributed databases
I Process distributed large data efficientlyI execution environments
I Scale / distribute machine learning techniquesI distributed learning algorithms (message passing)I ML execution environments
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 16 / 25
Big Data Analytics 4. How to analyze Big Data?
Execution Environments: MapReduce
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 17 / 25
Big Data Analytics 4. How to analyze Big Data?
Execution Environments: GraphLab
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 18 / 25
Big Data Analytics 5. Big Data at ISMLL, University of Hildesheim
Outline
1. What is Big Data?
2. Where to Find Big Data?
3. Big Data Applications
4. How to analyze Big Data?
5. Big Data at ISMLL, University of Hildesheim
6. Conclusions
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 19 / 25
Big Data Analytics 5. Big Data at ISMLL, University of Hildesheim
Usage of Big Data in Cooperative Research ProjectsI transportation
I REDUCTION: Danish Taxi fleet(14.000 vehicles, 1.5 million trips, 2.2 billion GPS measurements)
I e-commerce / recommender systemsI NetFlix dataset: 100 million transactionsI Rossmann Online / Compra
I technology enhanced learningI Whizz Education
(1200 exercises, 250,000 students, 30 million interactions)
I engineering data miningI Rolls Royce: jet engine vibrationI Detectino: ground penetrating radar
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 19 / 25
Big Data Analytics 5. Big Data at ISMLL, University of Hildesheim
ISMLL Compute Cluster
I 61 compute nodes
I 840 cores, 1288 threads
I 2.2 TB RAM
I 183 TB hard disk capacity
I 10 TFlopsI special nodes:
I database serverI coprocessor compute node
(240 cores, 960 threads,4 TFlops)
I software:I simple scheduler
(sun grid engine)I Map–Reduce (hadoop)
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 20 / 25
Big Data Analytics 5. Big Data at ISMLL, University of Hildesheim
Data Analytics — A New Study Programme in 2015term module type ECTS
1 Advanced Machine Learning lecture 6Modern Optimization Techniques lecture 6Programming Machine Learning lab 6Seminar Data Analytics I seminar 4application module I misc. 6
2 Advanced Database Technologies lecture 6Data and Privacy Protection lecture 3methodological specialization I lecture 6Distributed Machine Learning lab 6Seminar Data Analytics II seminar 4Project (part I) project 6
3 Planning and Optimal Control lecture 6methodological specialization II lecture 6Project (part II) project 9Seminar Data Analytics III seminar 4application module II misc. 6
4 Master thesis and colloquium thesis 30
Total 120
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 21 / 25
Big Data Analytics 5. Big Data at ISMLL, University of Hildesheim
Data Analytics — A New Study Programme in 2015I solid education on all fundamental aspects of data analytics at the
state-of-the-artI machine learningI analytical database technology & execution modelsI planning and control
I several methodological specializations:I Bayesian Networks, Computer Vision, etc.
I integrated application areaI media systems, software engineering, environmental sciences, computer
linguistics, information sciences, psychology (several still requested)
I hands-on lab courses
I a deep and fun, two term integrated group project
I fully internationally targeted (completely in English)
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 22 / 25
Big Data Analytics 6. Conclusions
Outline
1. What is Big Data?
2. Where to Find Big Data?
3. Big Data Applications
4. How to analyze Big Data?
5. Big Data at ISMLL, University of Hildesheim
6. Conclusions
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 23 / 25
Big Data Analytics 6. Conclusions
Big Data — Chances . . .
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 23 / 25
Big Data Analytics 6. Conclusions
Big Data — . . . and Risks
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 24 / 25
Big Data Analytics 6. Conclusions
ConclusionsI Big Data Analytics addresses the analysis of large and complex data
(volume, variety, velocity, veracity)I at Petabyte scale (1024 Terabytes)
I Big Data emerges naturally in many different domains(science, web, e-commerce, social media, robotics)
I Big Data requires massively parallel infrastructure to be analyzedtimely
I data centers with hundreds and thousands of compute nodesI distributed databases, nosql databasesI execution environments (MapReduce)
I To exploit big data in a principled and optimal way, machine learningmethods have to be scaled and distributed,
I requiring innovations at the level of the learning algorithms,I also requiring special ML execution environments.
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 25 / 25