February 2016 Lecture Note in Computer Science: An...

19
February 2016 Lecture Note in Computer Science: An Introduction to Big Data Technologies Dr. M. Naci Akkøk Chief Architect, Oracle Nordic, Vollsveien 2A, 1366 Lysaker, Norway [email protected] https://www.oracle.com/no/index.html https://no.linkedin.com/in/dr-m-naci-akkøk-4a8a6b1 Abstract. There are always new directions and new buzzwords in Computer Sci- ence and its brother Information Technology. Big Data is one of the newcomers that started to get market attention recently. Market attention, like in many other cases, naturally led to vendors offering Big Data products leading to the illusion that Big Data is a technology. This lecture note offers a brief history of Big Data, and (indirectly) defines it as a number of technologies attempting to address cer- tain modern needs and requirements. One of the main takeaways in this lecture note is a brief experience based account of where and how these various technol- ogies can (or should) be used in addressing Big Data needs. 1 Introduction, definition What we call Big Data has been around for some time, though it started re- ceiving market attention relatively recently. There are many claims to when and where it was started, but if we look at the use of the term Big Data, it seems to have been used in two publications in 2008: a report from Compu- ting Community Consortium 1 , and an editorial in Nature 2 . There is a short but good summary of the history of Big Data up until end of 2013 by Forbes 3 , which I will not repeat here. 1.1 The claim: Big Data as perceived When the market value of Big Data was estimated to be US$ 10 billion around 2010-2011, many major vendors (like Oracle Corporation, IBM, Mi- crosoft, and others) started to offer Big Data products. The products were pri- marily software packaged or built around Hadoop, an open source software platform managed by the Apache Software Foundation today 4 .

Transcript of February 2016 Lecture Note in Computer Science: An...

Page 1: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

February 2016 Lecture Note in Computer Science: An Introduction to Big Data Technologies

Dr. M. Naci Akkøk

Chief Architect, Oracle Nordic, Vollsveien 2A, 1366 Lysaker, Norway

[email protected]

https://www.oracle.com/no/index.html https://no.linkedin.com/in/dr-m-naci-akkøk-4a8a6b1

Abstract. There are always new directions and new buzzwords in Computer Sci-ence and its brother Information Technology. Big Data is one of the newcomers that started to get market attention recently. Market attention, like in many other cases, naturally led to vendors offering Big Data products leading to the illusion that Big Data is a technology. This lecture note offers a brief history of Big Data, and (indirectly) defines it as a number of technologies attempting to address cer-tain modern needs and requirements. One of the main takeaways in this lecture note is a brief experience based account of where and how these various technol-ogies can (or should) be used in addressing Big Data needs.

1 Introduction, definition

What we call Big Data has been around for some time, though it started re-ceiving market attention relatively recently. There are many claims to when and where it was started, but if we look at the use of the term Big Data, it seems to have been used in two publications in 2008: a report from Compu-ting Community Consortium1, and an editorial in Nature2. There is a short but good summary of the history of Big Data up until end of 2013 by Forbes3, which I will not repeat here.

1.1 The claim: Big Data as perceived

When the market value of Big Data was estimated to be US$ 10 billion around 2010-2011, many major vendors (like Oracle Corporation, IBM, Mi-crosoft, and others) started to offer Big Data products. The products were pri-marily software packaged or built around Hadoop, an open source software platform managed by the Apache Software Foundation today4.

Page 2: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 2

Hadoop was created by Douglass Read Cutting (or Doug Cutting), using Google Labs´ MapReduce algorithm5. The algorithm, published in 2004, ena-bled computations to be parallelized across clusters of servers, offering a po-tentially affordable solution to large scale data processing and horizontal scalability challenges.

Hadoop is not MapReduce only, but is a software library that was built with MapReduce as basis initially. This is how Hadoop is described by Apache Software Foundation:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from sin-gle servers to thousands of machines, each offering local computation and storage.

A more beginner-friendly description of Hadoop and its components can also be found on the ReadWrite technology news site6. Initially, Hadoop was the perceived definition of Big Data. Later on, other products/technologies were introduced for large scale data storage/retrieval and processing, claiming to be the “answer” to Big Data.

1.2 The reality: Big Data as applied

The claim that Big Data is a single product or a single technology like Hadoop is a misleading simplification. The reality is closer to Big Data being a set of (modern) needs that require a set of (newer or improved) technologies de-pending upon the problem at hand. Big Data is often defined through 4 major problem areas called the “4 Vs” de-picted below.

Fig. 1. The four challenges defining Big Data in practice (“the four Vs”)

Page 3: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 3

• Volume is intuitive: It points to the challenge of very large amounts

of data and the need to store, retrieve and process them. This chal-lenge relates to areas of computing we know as “large scale” and “high-capacity” computing.

• Velocity represents the challenges caused by for example streaming or time-series data, requiring acquisition without loss and real-time or near-real time (live) processing as in “real-time” or “high-perfor-mance computing”.

• Variety represents the challenge of disparate data sources with vary-ing data formats, technology platforms etc. Often, variety is also the challenge of connecting to and using/processing data from its source directly, without having to pull all data in to a central storage. This challenge relates to the areas of computing we know as “heterogene-ous/autonomous distributed database systems”. Variety may also refer to the fact

• Value represents the challenge of processing high volume, fast and varied data to get value out of it. Analysis of such data is a challenge in itself, and is often referred to as “Big Data Analytics”, with catego-ries of its own like:

o Business Intelligence (BI) or reporting, which applies to vol-ume and possibly variety of data, but is not necessarily a veloc-ity challenge, because it often operates on stored data and in a batch processing style.

o Data mining or discovery, which help analyze data from dif-ferent perspectives for categorization, anomaly detection, iden-tification of (not immediately available) patterns and relation-ships etc. One popular use of data discovery or data mining is called predictive analytics, often used to predict events that need to be handled before they happen. Typically, discovery would be like reporting, i.e., for batch volume data and from a variety of sources.

o Live data (event or stream) processing is analyzing real-time or near-real-time data as it streams through, often for monitor-ing purposes, for detecting deviations and other live processing needs. This clearly involves the velocity aspect, but can also in-volve both variety and volume.

Page 4: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 4

1.3 The typical Big Data solution

Thus, Big Data is more of a solution to a certain set of requirements which typically involve a combination of volume, velocity, variety and value as de-scribed above. A generic Big Data solution may involve the following steps (Fig. 2):

Fig. 2. Generic Big Data solution and platform

The stages are self-explanatory to a large degree, and not even new, but there are some points worth noting. Starting from the left and bottom:

• Big data sources can be many: Social media, databases, documents, sensors, other data generating equipment etc. The “sensor and con-nector development/test” stage represents the data-source side of the flow of information, which is an inherent part of any Big Data value chain, and can be seen as part of the platform (middleware, security etc.).

• The middleware platform itself is needed to integrate the various stages and provide functionality like security, and also because a Big Data solution is often a distributed solution as in many other modern systems.

• The data acquisition stage may not always mean high speed data ac-quisition (or “Velocity”) or a “Variety” of sources as implied in Fig. 2, but Velocity and Variety are often the factors that create the Big Data challenge as compared to regular systems. The challenge is often two-fold: Get the data without loosing any, and monitor/process the data in real-time, meaning fast transformations (including some real-time anal-ysis), storage/ingestion visualization in real-time. Note that the designa-tion “real-time” may not be accurate, and that “live” (meaning almost real-time) is probably a better term.

• The data storage and archival stage is what is often perceived as the real Big Data challenge: being able to store very large amounts of data in an affordable and scalable manner, creating live (on-line) archives or

Page 5: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 5

data-reservoirs that can be accessed and processed with sufficient per-formance, unlike off-line archives (for example tape archives). Yet an-other note: Even if the incoming data is not fast or volume, require-ments like “never having to delete data and data-logs” may justify the need for Big Data storage.

• The analysis and reporting stage is where the “Value” is extracted from the data. It implies high-performance processing of large amounts of data for reporting, for data discovery or data mining, for predictive analytics, for machine learning etc. It can also imply a need to compare or correlate live data (from the initial acquisition stage) with large amounts of stored historical data as in cases of live tracking and fraud detection.

So the acquire-store-analyze chain is not new at all, as mentioned above. What is new is the need for very high performance and very high volume stor-age and processing of data, often from disparate sources, and often for more than classical BI-style analysis, which requires newer analysis tools like dis-covery, prediction and learning tools that can operate on fast and volume data.

The generic Big Data solution captures some of the essentials, but we need to take a brief look at some example (actual) use-cases to understand the chal-lenges and the composite nature of the solutions better.

1.4 Some Big Data example use-cases and their requirements

The following are taken from actual cases and abstracted to form something like a pattern representing a category of Big Data use-cases:

• Life-critical systems testing and monitoring A typical example here is aircraft testing, where practically all com-ponents of the aircraft generate data through numerous sensors during laboratory or flight testing. Depending upon the type and duration of the test, data may be coming in at a speed of several Gigabytes/se-cond and for several hours per test. Assuming about 1 GB/sec (not the maximum for aircraft) for 2 hours, this would mean about 7200 GB/sec or approximately 7 Terabytes/test flight. In this category of use-cases, both Velocity and Volume are relevant.

• Traffic regulation/facilitation & road safety This is a slightly different category of use-cases but with similar needs, where both Velocity and Volume are relevant. There are many potential use-cases in this category:

Page 6: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 6

o It covers cases like road sensors, vehicle sensors and GPS in-formation used to sense and regulate traffic flow,

o Cases where certain traffic signs can be removed with GPS information used to show the relevant sign(s) on for example heads-up displays of the vehicles,

o Cases where a dangerous and icy turn experienced by a vehi-cle can be transmitted as warning automatically to an road alarm center and other following vehicles,

o Cases where a private and daily driving route can be made public to a select trusted group so that they can choose to be picked up on the way, enabling effective car-pooling and sharing.

Fig. 3. Some actual road & traffic related use-cases

There are many potential use-cases in this category. Most of them re-quire continuous collection and analysis/monitoring of vehicle, road and GPS data, which naturally build up to very large volumes over time.

This will be a very long lecture note if we try to detail every single use-case category, so we will simply list some here, letting the reader speculate how much of volume, velocity, variety and values each use-case category may rep-resent:

• Fraud detection Fraud detection in credit/debit card operations may not have the same

Page 7: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 7

4V requirements as tax fraud operations. Both require or imply vol-ume, at least over time. Which one do you think would also require velocity? Variety? Value (or which types of analysis/processing)?

• Performance/quality monitoring Student learning patterns and performance can be monitored/analyzed through monitoring student devices, or the quality of a building and its materials through data from embedded sensors, or errors in a man-ufacturing process for the purpose of improvement, or success of a marketing/sales process. Again, each case may require different com-binations of the 4Vs.

• Social monitoring and behavior monitoring/planning This has been around for a while, and implies monitoring of social media as well as system events from systems like Enterprise Resource Planning (ERP), Customer Relationship Management (CRM) and Customer Experience (CX) systems for understanding customer or user behavior, product acceptance, product quality and return-rates, customer satisfaction etc. Batch-style analysis/processing is often the solution here, though certain related cases may require close to real-time processing.

• Predictive analytics Relates to areas like predictive maintenance, where streaming and stored information is used to predict failures or other relevant anoma-lies so as to be able to remedy the situation without costly breaks in operation. Not all predictive analytics cases fall under the Big Data category, but some definitely do. This is the case especially when both streaming and historical data is needed for example in compari-son of live to historical data and in detection of patterns to be able to predict and improve prediction algorithms (through machine learn-ing). The solution will then have to involve both large volumes of data and high performance processing of both the large volume of data and the incoming high speed data.

• Innovative defense This is typical of cases like network defense where complex relation-ships and patterns have to be identified for live (preventive) actions to be decided.

This list of use-cases (actually use-case categories) is not exhaustive. But the list does show that velocity will definitely be as important a challenge as vol-ume in many cases, and that velocity will often mean volume, but not the

Page 8: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 8

other way around. Do also note that velocity refers to both high speed data (ingestion, acquisition) and the speed of processing of high speed and large volume data. These example use-cases also imply that variety may be “natural” in many big and real-time data processing cases, and should be watched out for, and value extraction will need tools beyond regular BI and analysis tools.

2 Big Data technologies

The fact that Big Data is more of a need to respond to a combination of vol-ume, velocity, variety and value requirements have naturally kicked off crea-tion of new approaches and technologies in all those areas. One obvious tech-nology that had to be improved was database technology, mainly because the database is one of the main components in the value chain.

Next, we are going to look at how database technologies have responded to the composite and complex requirements/needs of Big Data in all the four ar-eas, also because many perceive these newer database technologies as Big Data technologies.

2.1 NoSQL databases in Big Data

The term NoSQL, as we all know by now, does not mean “not” SQL, but stands for “not only SQL”, indicating the need to supplement the SQL world or the relational database management system (RDBMS) technology with al-ternative DBMS technologies to support modern high capacity (HC) and high performance (HP) data management needs – like the Big Data and fast data needs we have been looking at above.

Fig. 4. NoSQL DBMS technologies

Page 9: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 9

Figure 4 above lists the 5 main categories of NoSQL DBMS technologies that are of interest in the Big (and fast) Data domain. We will not be looking at every category in detail, but try to give provide a brief description of three categories:

• Graph family database management systems • Key-Value Pair (KVP) family database management systems • Hadoop Distributed File System (HDFS) and other distributed file

systems families

2.1.1 Graph family database management systems

You may have met this kind of DBMS under different names like triple-stores or RDF databases or Property Graphs.

First an important note: We shall not be focusing upon property graphs here, primarily because there is no well established standard for property graphs. But, for those who are interested, take a look at Tink-erpop (now part of the Apache Software Foundation) under GitHub7, which is considered to be the de facto standard for property graphs.

Instead, we will be looking at the W3C RDF standard. RDF stands for Resource Description Framework8, and is a standard maintained by W3C within its Semantic Web standards9.

Below (Fig. 5) is a sketch of a “Subject-Predicate-Object” (SPO) tri-plet that forms the basic element of a graph, and a hypothetical graph. The logic of a SPO-triplet is that it “answers” true if the predicate holds be-tween a given subject and object instance.

Fig. 5. A graph or triple-store triplet and a graph

A rudimentary example is “S: Eve P: is-friends-with O: Adam”, which could be viewed as evaluating to true if Eve is in fact friends with Adam (i.e., if that actual triplet exists in the graph) and to false otherwise. This mode of evaluating or verifying predicates (which can be thought of as as-sociations) between subject and object instances is used in querying or

Page 10: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 10

searching a graph, where triplets that evaluate to true are returned as result of a query on a graph

Graphs have many interesting properties. One of them is the fact that they are “schema-independent”. A major challenge with schema-depend-ency is that the world and its model changes, implying changes in the un-derlying data structures and database schemas. Database systems like RDMS´ that require the schema defined up-front make this change diffi-culty and costly.

For those who think in terms of tables, a graph database structure or a triplet could be viewed as a schema with three attributes, equivalent to a table with three columns as in figure 6a below, which makes the reason why a graph is schema-independent obvious.

Fig. 6. Graph (triplet) representation in table form, and linked graphs

Schema independency of the graph DBMS makes it well suited for the Big Data cases where Variety is a major challenge, but let us look at an-other graph DBMS characteristic – the capability to link data as in figure 6b – before we elaborate its use in a Big Data setting.

W3C defines RDF as well as a set of related technologies like the Linked Data Platform10 and a set of related tools and languages like the Web Ontology Language11 (OWL) that collectively offer a unique possi-bility to link data across distributed data sources. These tools and technol-ogies extend RDF to set up pre-defined relationships between subjects or objects (the nodes) or what RDF refers to as Classes and Objects.

One category of relationships (in OWL) is the equality and inequality group of relationships including for example the “sameAs” relationship. As demonstrated in figure 6b, one can define relationships between the nodes of two distinct graphs, thus linking the two graphs. This, in effect, is like extending the search or query space to include both graphs – i.e., two distinct databases – without further integration hassle.

Page 11: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 11

This is not limited to graphs in the same DBMS-space. Nor is it lim-ited to linking graph databases only. The Linked Data Platform mentioned above defines a set of rules for HTTP operations on Web resources to pro-vide an architecture for read-write linked data across the Web. This is what then becomes possible (see figure 7):

Fig. 7. Linked data and meta-data

One of the classical dilemmas of distributed computing (and especially distributed databases) is the case of a data hub or a data master that inte-grates or collects many data sets from different data sources. The two ex-treme implementations of such a data hub is either by creating one cen-tralized database and pulling in all the data, or by creating an integration or abstraction layer that routes (for example) queries to relevant data sources, and packages the result into one coherent query response.

The first case is impractical in Big Data cases where there are many databases and/or far too large amounts of data. The central database may become too big and heavy.

The second case is more attractive, because it has the possibility to leave the existing data sources intact and autonomous, and still present a “hub” of all underlying data to the user. But it has its challenges. The un-derlying data sources will not necessarily have matching schemas, and the underlying technologies will not necessarily be the same. The W3C stand-ard Graph approach addresses these challenges. It provides the possibility of creating a single graph meta-data layer (represented as linked graphs in the integrated graph metadata and domain vocabularies layer of figure 7 above) capable of accessing data from many different kinds of technolo-gies as shown at the data servers layer of the same figure, and many dif-

Page 12: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 12

ferent data sources and types, i.e., structured, semi-structured of unstruc-tured data, represented by the data sources and data types layer of the fig-ure.

Like relational databases, graph databases usually have their own query language. The W3C standard graph databases will typically use the W3C standard SPARQL query language12.

Note that it is relatively straight forward to convert schemas/models of various data sources (especially relational schemas) to graphs and back. There are tools and standards for that like W3C´s RDB2RDF13.

This concludes our discussion of the role of graph DBMS´ in Big Data: A graph DBMS is capable of addressing the Variety requirement (i.e., variety of sources, types, schemas, technologies) in a very effective manner.

For the sake of completeness, we need to mention that the centralized versus autonomous and distributed alternatives of the “hub” are not the only two alternatives. Hybrid solutions are also possible, allowing for some data to be centralized and some left as autonomous sources. It is also possible to do this in a manner that only caches the data to be stored centrally, replacing it with other relevant data as needed.

Oracle RDF Semantic Graph14 is an example of a graph DBMS that allows for hybrid data and meta-data level data integration and linking, with caching logic to decide the amount and kind of data to be centralized dynamically.

The graph DBMS also offers a light-weight structure that comes in handy when acquisition speed (i.e., Velocity) is a concern, but we will look at an alternative DBMS technology which is more effective in such cases.

2.1.2 Key-Value Pair (KVP) family database management systems

If the graph database can be viewed as a three-column table, the key-value pair database will be like a two-column table in comparison (see Figure 8 below).

Fig. 8. Key-value pair and graph database structures form a relational (table) view

Page 13: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 13

As you probably have guessed already, a key-value pair database (also called a key-value store) is what is known as a hash-table or a dictionary data-structure in programming terms.

Like the graph database, the KVP database does not require a pre-de-fined schema. Thus, it also addresses the Variety (flexibility) needs of a Big Data solution.

Like the graph database, the KVP database is a simple and effective data structure, indicating high performance in acquiring data (addressing the Velocity need of Big Data). In fact, it fits the [time-stamp, value] structure of time-series naturally, and is often used as the basis for tools and applications for storing and processing time-series or stream-ing data. The operational historian, a specialized application for log-ging/processing time-based process data, is a typical example.

The KVP store usually does not have an own standardized query lan-guage like SQL or SPARQL, but is usually accessed and queried pro-grammatically, i.e., through APIs or the equivalent. This does not mean that someone is trying to write an SQL interface to some KVP store as we speak. There is a large community of SQL users, and many types of DBMS´ offer an SQL interface for facilitating use. But, since the KVP stores often offer a programmatic interface instead, it should not come as a surprise that there are MapReduce

Originally the KVP DBMS was not a distributed DBMS, but that has been remedied in many available KVP stores through use of mechanisms like sharding, which is a technique for partitioning and distributing a data-base across many servers. This helps distribute the database itself, the work-load and enables scalability. As we shall see in the next section, this is also the kind of logic behind the Hadoop Distributed File System (HDFS).

Thus, in a Big Data setting, the KVP store is recommended used where Variety and Velocity (especially high performance acquisition/in-gestion) are dominating challenges. Note that a KVP store is often used in conjunction with, and/or as a front-end to another more main-line kind of database. Possibly a relational, graph or HDFS data store.

2.1.3 Hadoop Distributed File System (HDFS) and other distributed file systems

There is no hiding that Big Data is associated with Hadoop. And the best place for learning about Hadoop is the Apache Software Foundation4 as we mentioned before. And there are of course many quick overviews, in-troductions and tutorials elsewhere6,15.

Page 14: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 14

AS you may have noticed, we are striving not to call Hadoop a DBMS, like we did for the other two (the graph DBMS and the KVP DBMS). This is a matter of definition, of course, but Hadoop is often re-ferred to through its underlying Hadoop Distributed File System (HDFS).

To be more precise, it is a software library and a framework for inex-pensive distributed storage and computing. Apache Software Foundations own definition is as follows:

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of com-puters using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage”.

The distribution and processing mechanism itself originates from a tech-nique called MapReduce first published by Jeffrey Dean and Sanjay Ghemawat of Google in 20045. The following figure (figure 9) summa-rizes storing and processing data in an HDFS environment.

MapReduce is actually a programming model for parallel processing of large data volumes distributed on a cluster of machines. As the name implies, it consists of a map() method (or function or procedure, depend-ing upon your programming language/paradigm) and a reduce() method.

The map() method is used to group data sources according to some logical criterion (often a key). The map() method will run for each data entity in parallel on a number of processes (servers, nodes) reserved for mapping. The entities are often key-value pairs. The reduce() method will then process the each group, again in parallel, returning a collection of values (i.e., the response to the query).

Note that MapReduce is a framework – applicable in many different ways and on both structured and unstructured data (files). Note also that the frameworks like MapReduce makes sense only in a parallel-pro-cessing and multi-threaded environments.

MapReduce is also a paradigm. The principles underlying the MapRe-duce paradigm have existed for a while. As a matter of fact, map() and reduce() functions are inspired from functional programming, though the HDFS versions are not the same as their functional counterparts.

The HDFS style of storing and querying Big Data is summarized in figure 9 below. Notice the machine park sketched in the mid part of the diagram: There are “slaves” doing the jobs distributed and managed by a “master”.

Page 15: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 15

The lefts side of the diagram is just a representation of the fact that the HDFS framework can accept data from many data sources, both struc-tured and unstructured.

Fig. 9. Hadoop processing overview

The right side just represents queries formulated and results returned. The queries can be formulated programmatically, which is the original Ha-doop approach. That requires programming skills. However, there are a number of alternative query languages developed for the Hadoop frame-work. They include SQL-like interfaces like Hive16 and SQL interfaces like Impala17,18 and Big Data SQL19 for the army of SQL users.

You have probably guessed already, but let us state it explicitly: Ha-doop is the better match for the Volume need of Big Data solutions. It pro-vides inexpensive but effective and scalable storage and processing of large volumes of data.

2.2 Engineered systems

One should also know about the kind of Big Data solutions that come opti-mized and packaged – both hardware and software. This is one of the rela-tively newer trends in deployment/delivery, the other being “the cloud”. We shall not use much time for this, but it is worth studying the example of such a system in figure 10.

Page 16: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 16

Fig. 9. An engineered system example: The Oracle Big Data Appliance (BDA)

The example above is the Oracle´s Big Data Appliance20 (Oracle BDA) an-nounced during OracleWorld on October 3rd, 2011. The value in an engi-neered system is that the hardware is optimized for its software, addressing the need for larger storage and faster processing of large amounts of data bet-ter. Re-phrased, it simply means that it addresses the needs of Big Data better, because the hardware is also honed for the task.

2.3 Analytic tools and technologies

This is a very short reminder section referring to the Value definition at the beginning of section 1.2, where approaches to processing and analyzing large volumes of data and/or fast data is listed up and explained briefly.

• Event or stream processing of live (almost real-time or real-time) data • Analysis/visualization of stored data as in the case of regular Business

Intelligence (BI) or reporting systems • Data mining or discovery, including predictive analytics and machine

learning, which is best done on very large amounts of data

Page 17: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 17

3 End note

Whenever I talk about Big Data, I always mention trends and technologies like Internet of Things, sensors, social media, smart phones, wearables, in-gestibles, injectables etc., which are the trends that demand and generate data, simply because they are the reason we have a demand for Big Data solution. A good understanding of these areas is highly recommended for those who truly want to master Big Data technologies and solutions.

Page 18: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 18

References

1. Bryant RE, Katz RH, Lazowska ED. Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science, and Society.; 2008. http://cra.org/ccc/wp-content/uploads/sites/2/2015/05/Big_Data.pdf.

2. Editorial. Community cleverness required. Nature. 2008;455(7209):1-1. doi:10.1038/455001a.

3. Press G. A Very Short History Of Big Data. Forbes Tech. 2013:2. http://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/. Accessed January 13, 2016.

4. Apache Software Foundation. Welcome to ApacheTM Hadoop®! 2015. https://hadoop.apache.org/. Accessed January 13, 2016.

5. Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In: 6th Symposium on Operating Systems Design and Implementation, USENIX Association. Vol ; 2004. http://usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/dean/dean_html/index.html. Accessed February 3, 2016.

6. Brian Profitt. Hadoop: What It Is And How It Works. 2013. http://readwrite.com/2013/05/23/hadoop-what-it-is-and-how-it-works. Accessed January 13, 2016.

7. GitHub. Defining a Property Graph. https://github.com/tinkerpop/gremlin/wiki/Defining-a-Property-Graph. Accessed February 27, 2016.

8. W3C. RDF - Semantic Web Standards. https://www.w3.org/RDF/. Accessed February 17, 2016.

9. W3C. Semantic Web Standards. https://www.w3.org/2001/sw/wiki/Main_Page. Accessed February 17, 2016.

10. W3C. Linked Data Platform 1.0. https://www.w3.org/TR/2015/REC-ldp-20150226/. Accessed February 23, 2016.

11. W3C. OWL Web Ontology Language Overview. https://www.w3.org/TR/2004/REC-owl-features-20040210/#sameAs. Accessed February 23, 2016.

12. W3C. SPARQL Query Language for RDF. https://www.w3.org/TR/rdf-sparql-query/. Accessed February 24, 2016.

13. W3C. RDB2RDF - Semantic Web Standards. https://www.w3.org/2001/sw/wiki/RDB2RDF. Accessed February 23, 2016.

14. Oracle. Oracle Spatial and Graph - RDF Semantic Graph. http://www.oracle.com/technetwork/database-options/spatialandgraph/overview/rdfsemantic-graph-1902016.html. Accessed February 23, 2016.

15. Ciurana E, Kalali M. Getting Started with Apache Hadoop. DZone Refcardz. 2010;(117):1-6.

16. Apache Software Foundation. Apache Hive. https://hive.apache.org/. Accessed February 25, 2016.

17. Kornacker M, Erickson J. Cloudera Impala: Real-Time Queries in Apache Hadoop. 2012. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/. Accessed March 18, 2014.

18. Olson M. Impala v Hive. 2013. http://vision.cloudera.com/impala-v-hive/. Accessed March 18, 2014.

Page 19: February 2016 Lecture Note in Computer Science: An ...heim.ifi.uio.no/~infpri/Presentasjoner/BigData2016.pdf · An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk,

An Introduction to Big Data Technologies 9-Feb-2016 Dr. M. Naci Akkøk, Oracle

Page 19

19. Oracle. Unified Query for Big Data Systems. http://www.oracle.com/technetwork/database/bigdata-appliance/learnmore/bigdatasqloverview21jan2015-2408000.pdf?ssSourceSiteId=ocomen. Accessed February 25, 2016.

20. Oracle. Oracle Big Data Appliance - Wikipedia, the free encyclopedia. In: Wikipedia. Vol ; 2015. https://en.wikipedia.org/wiki/Oracle_Big_Data_Appliance. Accessed January 13, 2016.