Big Data: Opportunities and Challenges
Raja Chiky – [email protected]
OUTLINE
¡ About me
¡ What is Big Data?
¡ Evolution of Business Intelligence
¡ Big Data Opportunities
¡ Big Data challenges
¡ Conclusion
3
24/10/2014
About me ¡ Associate professor in Computer Science – LISITE-RDI
¡ Research interest: Data stream mining, scalability and resource optimization in distributed architectures (e.g cloud architectures), recommender systems
¡ Research field: Large scale data management
1. Real-time and distributed
processing of various data
sources
2. Use semantic technologies to add a semantic
layer
3. Recommender systems and
collaborative data mining
4. Optimizing resources in large scale systems
Heterogeneous and dynamic data streams
Heterogeneous and sta1c data
sensors
5. Modeling and validation of complex systems
4
24/10/2014
What is Big data?
5
24/10/2014
6 Big Data: Buzzword!
24/10/2014
New era 7
24/10/2014
24/10/2014
8 Where is all this data coming from?
24/10/2014
24/10/2014
9 More and More connected Things
24/10/2014
10 So, what is Big Data?
Dawn of (me
2003 2012
5 EB
…
2.7 ZB
2015
10 ZB (E)
Volume of data created Worldwide
§ 1 YB = 10^24 Bytes § 1 ZB = 10^21 Bytes § 1 EB = 10^18 Bytes § 1 PB = 10^15 Bytes § 1TB = 10^12 Bytes § 1 GB = 10^9 Bytes
Variety of data
Velocity of data
§ Walmart handles 1M transac(ons per hour § Google processes 24PB of data per day § AT&T transfers 30 PB of data per day § 90 trillion emails are sent per year § World of WarcraQ uses 1.3 PB of storage
§ Facebook when had a user base of 900 M users, had 25 PB of compressed data
§ 400M tweets per day in June ’12 § 72 hours of video is uploaded to Youtube
every minute
§ Radio § TV § News § E-‐Mails § Facebook
Posts
§ Tweets § Blogs § Photos § Videos (user
and paid) § RSS feeds
§ Wikipedia § GPS data § RFID § POS
Scanners § …
Volume
Variety
Velocity
Big Data Elements
Source: Big Data & Analytics - Why Should We Care?, Vishwa Kolla
+ Veracity (IBM) - information uncertainty
octobre 24, 2014
11 Key factors ¡ Cheap storage
¡ Recording everything is not expensive anymore
¡ Cloud computing ¡ Cheap, on demand computing resources from
anywhere in the world and for everyone
¡ Business reasons ¡ New insights arise that give competitive
advantage
¡ Data in various forms everywhere: IoT and IoE, Social Networks, Open Data
¡ The way we interact with each other and with data / information
¡ …
24/10/2014
24/10/2014
12 Transforming our daily lives
Then Now
One size fits all Personalization & Targeted Selling
Source: Big Data Trends by David Feinleib
24/10/2014
13 Fitness
Source: Big Data Trends by David Feinleib
Then Now
Manual tracking Focus on the goal
24/10/2014
14 Customer service
Then Now
Reactive Customer Service Pro-active Customer Service
Source: Big Data Trends by David Feinleib
24/10/2014
15 Customer service: 360-degree view of the customer
Why?
What?
How? When/Where?
Who?
Opera1onal data
Behavioral data
Descrip1ve data
Interac1on data Contextual
data
24/10/2014
17 Big Data opportunities
Source: Source: Big Data opportunities survey, Unisphere / SAP, May 2013.
Opportunities: big data use cases
360° view of the customer
• Integra1on of data from social networks, CRM, transac1onal data, etc.
• Example: T-‐Mobile, telecom operator -‐> Reduc1on of the customer leave of 50% in a quarter
E-‐reputa?on
• Sen1ment analysis, proac1ve monitoring of social networks
• Example: Nestlé, food group-‐> Gain of 4 places in the Reputa1on Ins1tute’s Index due to an interac1on 24/7
Op?misa?on
• Predic1ve analysis for anomalies detec1on, processes op1miza1on using sensors and opera1onal data
• Example: Union Pacific Railroad, reduce train derailments, increase train shipment, carbon emission reduc1on
Public security
• Monitoring social networks, integra1on of spa1al data and sensors
• Example: Serious Request 2012 -‐> monitoring of crowd movements with Twi^er and sensors, localiza1on of public force, integra1on with GIS
19
24/10/2014
Evolution of Business Intelligence
20
24/10/2014
24/10/2014
21 Static Data Semantic Data Stream (Big) Data
Output
User Interac1on
Gathering Informa1on
Store
Data sources
Visual analytics
Structured/unstructured data
Seman1c ETL/Batch processing
Flexible queries / SPARQL
Triple Sore
C
Static report
databases
ETL/Batch processing
Ad-‐hoc queries Analy1cs
Data Warehouse
C
Real-time analytics
sensors
Static data Data streams
Semantic ETL
stream processing
Continuous queries/ Business rules
Knowledge
enrichment
Databases/ Triplestores
Real time visual-analytics
Re
tro-
ac
tion
Load shedding
24/10/2014
22 Static Data Semantic Data Stream (Big) Data
Output
User Interac1on
Gathering Informa1on
Store
Data sources
Real-time analytics
sensors
Static data Data streams
Semantic ETL
stream processing
Continuous queries/ Business rules
Knowledge
enrichment
Databases/ Triplestores
Real time visual-analytics
Re
tro-
ac
tion
Load shedding
C
Visual analytics
Structured/unstructured data
Semantic ETL/Batch processin
g
Flexible queries / SPARQL
Triple Sore
Static report
databases
ETL/Batch processin
g
Ad-hoc queries
Analytics
Data Warehouse
C
24/10/2014
23 Static Data Semantic Data Stream (Big) Data
Output
User Interaction
Gathering Information
Store
Data sources
Visual analytics
Structured/unstructured data
Semantic ETL/Batch processin
g
Flexible queries / SPARQL
Triple Sore
C
Static report
databases
ETL/Batch processin
g
Ad-hoc queries
Analytics
Data Warehouse
C
Real-time analytics
sensors
Static data Data stream
Semantic ETL
stream processing
Continuous queries/ Business rules
Knowledge enrichment
Databases/ Triplestores
Real time visual-analytics
Re
tro-
ac
tion
Load shedding
What are Big Data Challenges?
24
24/10/2014
Big Data workflow
1. Capture
2. Store
3. Analyze
4. Visualize
Challenges arise in all these steps
25
24/10/2014
24/10/2014
26 Challenges: Data Collection ¡ Heterogeneity of sources ¡ Company databases => Silos
¡ Sensor networks, Intelligent objects
¡ Data streams: Social Networks, financial information, etc.
¡ Data Velocity
¡ Data provenance and quality
24/10/2014
27 Type of data used in Big Data initiatives
Internal data Traditional sources
« New data »
Source: Big Data opportunities survey, Unisphere / SAP, May 2013.
24/10/2014
28 Challenges: Data Collection Velocity Website logs
Network monitoring Financial services
eCommerce Traffic control Power consumption
Weather forecasting
What is a data stream? ¡ Golab & Oszu (2003): “A data stream is a real-time, continuous, ordered
(implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.”
¡ Massive volumes of data, items arrive at a high rate.
29
24/10/2014
24/10/2014
30 Data Stream Management Systems
DBMS DSMS
Data model Permanent updatable relations Streams and permanent updatable relations
Storage Data is stored on disk Permanent relations are stored on disk Streams are processed on the fly
Query SQL language Creating structures Inserting/updating/deleting data Retrieving data (one-time query)
SQL-like query language Standard SQL on permanent relations Extended SQL on streams with windowing Continuous queries
Performance Large volumes of data Optimization of computer resources to deal with Several streams Several queries Ability to face variations in arrival rates without crash
24/10/2014
31 Challenges: Data Collection Data provenance and quality ¡ Data provenance: Provenance refers to the information that
describes data in sufficient detail to facilitate reproduction and enable validation of results.
¡ Data quality: Validity and consistency of the data. Is it up to date and fit for the targetted use case ?
Source: Patrick McDaniel, Kevin Butler, Steve McLaughlin, Radu Sion, Erez Zadok, and Marianne Winslett, Towards a secure and ecfficient system for end-to-end provenance, 2010.
24/10/2014
32 Challenges in data storage ¡ Large amounts of data ¡ Need to use a highly distributed architecture
¡ Massive queries ¡ Avoid joins since they are very time consuming
¡ Evolutionary schema ¡ Flexibility and scalability
¡ Predictable and low latency
¡ High availability
¡ Elasticity : Horizontal extensibility (Scale out)
¡ No need: Transaction / Strong consistency/ Complex queries
Limitation of RDBMS
“ If the only tool you have is a hammer, you tend to see every problem as a nail.” Abraham Maslow
33
24/10/2014
Limitation of RDBMS 34
24/10/2014
NO SQL Not Only
Relational
35
• No SQL => Not Only SQL • SQL must not die but storage solutions should be
considered for specific applications Exact name: Non relational DB
24/10/2014
CAP theorem (E.Brewer, N. Lynch 2000)
C
A P
“CAP Theorem”: C-A-P: choose two.
consistency
Availability Partition-Tolerance
Claim: every distributed system is on one side of the triangle.
CA: available, and consistent, unless there is a partition.
AP: a reachable replica provides service even in a partition, but may be inconsistent.
CP: always consistent, even in a partition, but a reachable replica may deny service without agreement of the others
36
24/10/2014
NoSQL Taxonomy
Data
Key-value
Document
Column
Graph
37
24/10/2014
Challenges in Data Analytics ¡ Problems in large scale analytics ¡ Distributed computation efficiency
¡ Evaluate performance gains from distribution
¡ Bringing data to the processor
¡ Efficient parallel algorithms (statistics, summaries)
¡ Speed analytics
¡ Streaming computations
¡ Load balancing
¡ Load Shedding
38
24/10/2014
24/10/2014
39 Challenges in Data Access and Visualization ¡ The main goal of data visualization is to communicate
information clearly and effectively through graphical means
¡ Provide results of analytics workflow for faster systems such as real-time query interfaces
“Visualization is a form of knowledge compression” - David McCandless
24/10/2014
40 Big Data: Technological challenges ¡ Data infrastructure tools and platforms : data centers, cloud
infrastructures, noSQL databases, in-memory databases, Hadoop/Map Reduce Ecosphere
¡ New generation of front-end tools for BI and analytic systems: data visualization and visual analytics, self-service BI, Mobile BI
¡ Data processing : supercomputers, distributed or massively parallel-computing
24/10/2014
41
24/10/2014
42 Conclusion: Big Data challenges ¡ Semantic Information aggregation ¡ Information aggregation: “too much data to assimilate but not
enough knowledge to act”
¡ Distributed and real-time processing ¡ Design of real-time and distributed algorithms for stream processing
and information aggregation
¡ Distribution and parallelization of data mining algorithms
¡ Optimizing resources
¡ visual analytics and user modeling ¡ Dynamic user model
¡ Novel visualizations for very large datasets
¡ Data protection
24/10/2014
43 IEEE Metro Area Smart Tech Workshop on Distributed Data Streaming Dec 5,2014 Paris ¡ 08h00: Registration - Breakfast
08h50: Room L012 - Welcome 09h00: Room L012 - Introduction to Distributed Data Streaming - Speaker: Raja Chiky (ISEP) 10h15: Coffee break 10h45: Room L012 - Real World Issues in Supervised Classification for Data Streams - Speaker: Vincent Lemaire (Orange Labs) 11h30: Room L012 - Use Case 1- Finance - Speaker: Antoine Chambille (Quartet FS) 12h00: Room L012 - Use Case 2 – Smart metering - Speakers: Marie-Luce Picard (EDF R&D) 12h30: Lunch offsite 14h00: Rooms L305-L306 - 2 Parallel labs sessions: Real-Time Data processing with open source DSMS - Speakers: Raja Chiky and Sylvain Lefebvre - 1st part 15:30: Coffee break 16:00: Rooms L305-L306 - 2 Parallel labs sessions: Real-Time Data processing with open source DSMS - Speakers: Raja Chiky and Sylvain Lefebvre - 2nd part 17h30: Reception onsite
24/10/2014
44
Thanks to Marie-Aude Aufaure, ECP Sylvain lefebvre, ISEP
Big Data Linked Data Volume, Variety, Velocity, Veracity, … Value
Web of data, Seman(c Web -‐ A set of principles and good
prac1ces allowing to link, publish and search for web data
-‐ Structure and seman1cally enrich RDF data, with a very high scalability
-‐> Big Linked Data
Integrate, aggregate, analyze, visualize large data sets, whatever is their type, provenance, speed of their flow …
Big Linked Data
Linked Big Data
Our Value proposi?on – Seman1c aggrega1on from textual and non textual streams – Manage seman1c heterogeneity, real-‐1me and distributed processing – Ensure data quality and veracity – Visual analy1cs
Seman?c Technologies
Living Lab
Linked & Big Data Academic Chair
Top Related