Bigdata notes

To keep things simple, we typically define Big

Data using four Vs; namely,

volume, variety, velocity, and veracity. We

added the veracity characteristic

recently in response to the quality and source

issues our clients began facing

with their Big Data initiatives. Some analysts

include other V-based descriptors,

such as variability and visibility, but we’ll leave

those out of this discussion.

Volume is the obvious Big Data trait. At the

start of this chapter we rhymed

off all kinds of voluminous statistics that do two

things: go out of date the

moment they are quoted and grow bigger! We

can all relate to the cost of

home storage; we can remember geeking out

and bragging to our friends

about our new 1TB drive we bought for $500;

it’s now about $60; in a couple

of years, a consumer version will fit on your

fingernail.

The thing about Big Data and data volumes is

that the language has

changed. Aggregation that used to be measured

in petabytes (PB) is now

referenced by a term that sounds as if it’s from a

Star Wars movie: zettabytes

(ZB). A zettabyte is a trillion gigabytes (GB), or

a billion terabytes!

Since we’ve already given you some great

examples of the volume of data

in the previous section, we’ll keep this section

short and conclude by referencing

the world’s aggregate digital data growth rate.

In 2009, the world had

about 0.8ZB of data; in 2010, we crossed the

1ZB marker, and at the end of

2011 that number was estimated to be 1.8ZB

(we think 80 percent is quite the

significant growth rate). Six or seven years from

now, the number is estimated

(and note that any future estimates in this book

are out of date the

moment we saved the draft, and on the low side

for that matter) to be around

35ZB, equivalent to about four trillion 8GB

iPods! That number is astonishing

considering it’s a low-sided estimate. Just as

astounding are the challenges

and opportunities that are associated with this

amount of data.

The variety characteristic of Big Data is really

about trying to capture all of the

data that pertains to our decision-making

process. Making sense out of

unstructured data, such as opinion and intent

musings on Facebook, or analyzing

images, isn’t something that comes naturally for

computers. However, this

kind of data complements the data that we use

to drive decisions today. Most

of the data out there is semistructured or

unstructured. (To clarify, all data has

some structure; when we refer to unstructured

data, we are referring to the subcomponents

that don’t have structure, such as the freeform

text in a comments

field or the image in an auto-dated picture.)

Consider a customer call center; imagine being

able to detect the change in

tone of a frustrated client who raises his voice to

say, “This is the third outage

I’ve had in one week!” A Big Data solution

would not only identify the terms

“third” and “outage” as negative polarity

trending to consumer vulnerability,

but also the tonal change as another indicator

that a customer churn incident

is trending to happen. All of this insight can be

gleaned from unstructured

data. Now combine this unstructured data with

the customer’s record data

and transaction history (the structured data with

which we’re familiar), and

you’ve got a very personalized model of this

consumer: his value, how brittle

he’s become as your customer, and much more.

(You could start this usage

pattern by attempting to analyze recorded calls

not in real time, and mature

the solution over time to one that analyzes the

spoken word in real time.)

An IBM business partner, TerraEchos, has

developed one of the most

sophisticated sound classification systems in the

world. This system is used

for real-time perimeter security control; a

thousand sensors are buried underground

to collect and classify detected sounds so that

appropriate action can

be taken (dispatch personnel, dispatch aerial

surveillance, and so on) depending

on the classification. Consider the problem of

securing the perimeter of

a nuclear reactor that’s surrounded by parkland.

The TerraEchos system can

near-instantaneously differentiate the whisper of

the wind from a human

voice, or the sound of a human footstep from the

sound of a running deer.

In fact, if a tree were to fall in one of its

protected forests, TerraEchos can affirm

that it makes a sound even if no one is around to

hear it. Sound classification

is a great example of the variety characteristic of

Big Data.

One of our favorite but least understood

characteristics of Big Data is velocity.

We define velocity as the rate at which data

arrives at the enterprise and is

processed or well understood. In fact, we

challenge our clients to ask themselves,

once data arrives at their enterprise’s doorstep:

“How long does it

take you to do something about it or know it has

even arrived?” Think about it for a moment. The

opportunity cost clock on your data

starts ticking the moment the data hits the wire.

As organizations, we’re taking

far too long to spot trends or pick up valuable

insights. It doesn’t matter

what industry you’re in; being able to more

swiftly understand and respond

to data signals puts you in a position of power.

Whether you’re trying to

understand the health of a traffic system, the

health of a patient, or the health

of a loan portfolio, reacting faster gives you an

advantage. Velocity is perhaps

one of the most overlooked areas in the Big

Data craze, and one in

which we believe that IBM is unequalled in the

capabilities and sophistication

that it provides.

In the Big Data craze that has taken the

marketplace by storm, everyone

is fixated on at-rest analytics, using optimized

engines such the Netezza

technology behind the IBM PureData System

for Analytics or Hadoop to

perform analysis that was never before possible,

at least not at such a large

scale. Although this is vitally important, we

must nevertheless ask: “How

do you analyze data in motion?” This capability

has the potential to provide

businesses with the highest level of

differentiation, yet it seems to be somewhat

overlooked. The IBM InfoSphere Streams

(Streams) part of the IBM Big Data

platform provides a real-time streaming data

analytics engine. Streams is a

platform that provides fast, flexible, and

scalable processing of continuous

streams of time-sequenced data packets. We’ll

delve into the details and

capabilities of Streams in Part III, “Analytics for

Big Data in Motion.”

You might be thinking that velocity can be

handled by Complex Event

Processing (CEP) systems, and although they

might seem applicable on the

surface, in the Big Data world, they fall very

short. Stream processing enables

advanced analysis across diverse data types with

very high messaging data

rates and very low latency (μs to s). For

example, one financial services sector

(FSS) client analyzes and correlates over five

million market messages/

second to execute algorithmic option trades with

an average latency of 30

microseconds. Another client analyzes over

500,000 Internet protocol detail

records (IPDRs) per second, more than 6 billion

IPDRs per day, on more than

4PB of data per year, to understand the trending

and current-state health of their

network. Consider an enterprise network

security problem. In this domain,

threats come in microseconds so you need

technology that can respond and

keep pace. However you also need something

that can capture lots of data

quickly, and analyze it to identify emerging

signatures and patterns on the

network packets as they flow across the network

infrastructure.

Finally, from a governance perspective,

consider the added benefit of a Big

Data analytics velocity engine: If you have a

powerful analytics engine that

can apply very complex analytics to data as it

flows across the wire, and you

can glean insight from that data without having

to store it, you might not

have to subject this data to retention policies,

and that can result in huge savings

for your IT department.

Today’s CEP solutions are targeted to

approximately tens of thousands of

messages/second at best, with seconds-to-

minutes latency. Moreover, the

analytics are mostly rules-based and applicable

only to traditional data

types (as opposed to the TerraEchos example

earlier). Don’t get us wrong;

CEP has its place, but it has fundamentally

different design points. CEP is a

non-programmer-oriented solution for the

application of simple rules to

discrete, “complex” events.

Note that not a lot of people are talking about

Big Data velocity, because

there aren’t a lot of vendors that can do it, let

alone integrate at-rest technologies

with velocity to deliver economies of scale for

an enterprise’s current

investment. Take a moment to consider the

competitive advantage that your

company would have with an in-motion, at-rest

Big Data analytics platform,

by looking at Figure 1-1 (the IBM Big Data

platform is covered in detail in

Chapter 3).

You can see how Big Data streams into the

enterprise; note the point at

which the opportunity cost clock starts ticking

on the left. The more time

that passes, the less the potential competitive

advantage you have, and the

less return on data (ROD) you’re going to

experience. We feel this ROD

metric will be one that will dominate the future

IT landscape in a Big Data

world: we’re used to talking about return on

investment (ROI), which

talks about the entire solution investment;

however, in a Big Data world,

ROD is a finer granularization that helps fuel

future Big Data investments.

Traditionally, we’ve used at-rest solutions

(traditional data warehouses,

Hadoop, graph stores, and so on). The T box on

the right in Figure 1-1

represents the analytics that you discover and

harvest at rest (in this case,

it’s text-based sentiment analysis).

Unfortunately, this is where many

vendors’ Big Data talk ends. The truth is that

many vendors can’t help you

build the analytics; they can only help you to

execute it. This is a key

differentiator that you’ll find in the IBM Big

Data platform. Imagine being

able to seamlessly move the analytic artifacts

that you harvest at rest and

apply that insight to the data as it happens in

motion (the T box by the

lightning bolt on the left). This changes the

game. It makes the analytic

model adaptive, a living and breathing entity

that gets smarter day by day

and applies learned intelligence to the data as it

hits your organization’s

doorstep. This model is cyclical, and we often

refer to this as adaptive

analytics because of the real-time and closed-

loop mechanism of this

architecture.

The ability to have seamless analytics for both

at-rest and in-motion data

moves you from the forecast model that’s so

tightly aligned with traditional

warehousing (on the right) and energizes the

business with a nowcastmodel.

The whole point is getting the insight you learn

at rest to the frontier of the

business so it can be optimized and understood

as it happens. Ironically, the

more times the enterprise goes through this

adaptive analytics cycle

Veracity is a term that’s being used more and

more to describe Big Data; it

refers to the quality or trustworthiness of the

data. Tools that help handle Big

Data’s veracity transform the data into

trustworthy insights and discard

noise.

Collectively, a Big Data platform gives

businesses the opportunity to analyze

all of the data (whole population analytics), and

to gain a better understanding

of your business, your customers, the

marketplace, and so on. This

opportunity leads to the Big Data conundrum:

although the economics of

deletion have caused a massive spike in the data

that’s available to an organization,

the percentage of the data that an enterprise can

understand is on

the decline. A further complication is that the

data that the enterprise is trying

to understand is saturated with both useful

signals and lots of noise (data

that can’t be trusted, or isn’t useful to the

business problem at hand), as

shown in Figure 1-2.

We all have firsthand experience with this;

Twitter is full of examples of

spambots and directed tweets, which is

untrustworthy data. The2012 presidential

election in Mexico turned into a Twitter veracity

example

with fake accounts, which polluted political

discussion, introduced derogatory

hash tags, and more. Spam is nothing new to

folks in IT, but you

need to be aware that in the Big Data world,

there is also Big Spam potential,

and you need a way to sift through it and figure

out what data can and

can’t be trusted. Of course, there are words that

need to be understood in

context, jargon, and more (we cover this in

Chapter 8).

As previously noted, embedded within all of this

noise are useful signals:

the person who professes a profound disdain for

her current smartphone

manufacturer and starts a soliloquy about the

need for a new one is expressing

monetizable intent. Big Data is so vast that

quality issues are a reality, and

veracity is what we generally use to refer to this

problem domain. The fact

that one in three business leaders don’t trust the

information that they use to

make decisions is a strong indicator that a good

BIG DATA – Nathan Marz

1.5 Desired Properties of a Big Data System 1.5.1 Robust and fault-tolerant

The properties you should strive for in Big Data

systems are as much about

complexity as they are about scalability. Not

only must a Big Data system perform

well and be resource-efficient, it must be easy to

reason about as well. Let's go

over each property one by one. You don't need

to memorize these properties, as we

will revisit them as we use first principles to

show how to achieve these properties.

Building systems that "do the right thing" is

difficult in the face of the challenges

of distributed systems. Systems need to behave

correctly in the face of machines

going down randomly, the complex semantics

of consistency in distributed

databases, duplicated data, concurrency, and

more. These challenges make it

difficult just to reason about what a system is

doing. Part of making a Big Data

system robust is avoiding these complexities so

that you can easily reason about

the system.

Additionally, it is imperative for systems to be

"human fault-tolerant." This is

an oft-overlooked property of systems that we

are not going to ignore. In a

production system, it's inevitable that someone

is going to make a mistake

sometime, like by deploying incorrect code that

corrupts values in a database. You

will learn how to bake immutability and

recomputation into the core of your

systems to make your systems innately resilient

to human error. Immutability and

recomputation will be described in depth in

Chapters 2 through 5.

1.5.2 Low latency reads and updates The vast majority of applications require reads

to be satisfied with very low

latency, typically between a few milliseconds to

a few hundred milliseconds. On

the other hand, the update latency requirements

vary a great deal between

applications. Some applications require updates

to propogate immediately, while in

other applications a latency of a few hours is

fine. Regardless, you will need to be

able to achieve low latency updates when you need them in your Big Data systems.

More importantly, you need to be able to

achieve low latency reads and updates

without compromising the robustness of the

system. You will learn how to achieve

low latency updates in the discussion of the

"speed layer" in Chapter 7.

1.5.3 Scalable

Scalability is the ability to maintain

performance in the face of increasing data

and/or load by adding resources to the system.

The Lambda Architecture is

horizontally scalable across all layers of the

system stack: scaling is accomplished

by adding more machines.

1.5.4 General

A general system can support a wide range of

applications. Indeed, this book

wouldn't be very useful if it didn't generalize to

a wide range of applications! The

Lambda Architecture generalizes to applications

as diverse as financial

management systems, social media analytics,

scientific applications, and social

networking.

1.5.5 Extensible

You don't want to have to reinvent the wheel

each time you want to add a related

feature or make a change to how your system

works. Extensible systems allow

functionality to be added with a minimal

development cost.

Oftentimes a new feature or change to an

existing feature requires a migration

of old data into a new format. Part of a system

being extensible is making it easy to

do large-scale migrations. Being able to do big

migrations quickly and easily is

core to the approach you will learn.

1.5.6 Allows ad hoc queries

Being able to do ad hoc queries on your data is

extremely important. Nearly every

large dataset has unanticipated value within it.

Being able to mine a dataset

arbitrarily gives opportunities for business

optimization and new applications.

Ultimately, you can't discover interesting things

to do with your data unless you

can ask arbitrary questions of it. You will learn

how to do ad hoc queries in

Chapters 4 and 5 when we discuss batch

processing.

1.5.7 Minimal maintenance

Maintenance is the work required to keep a

system running smoothly. This

includes anticipating when to add machines to

scale, keeping processes up and

running, and debugging anything that goes

wrong in production.

An important part of minimizing maintenance is

choosing components that

have as small an as possible. implementation complexity That is, you want to rely

on components that have simple mechanisms

underlying them. In particular,

distributed databases tend to have very

complicated internals. The more complex a

system, the more likely something will go

wrong and the more you need to

understand about the system to debug and tune

it.

You combat implementation complexity by

relying on simple algorithms and

simple components. A trick employed in the

Lambda Architecture is to push

complexity out of the core components and into

pieces of the system whose

outputs are discardable after a few hours. The

most complex components used, like

read/write distributed databases, are in this layer

where outputs are eventually

discardable. We will discuss this technique in

depth when we discuss the "speed

layer" in Chapter 7.

A Big Data system must provide the

information necessary to debug the system

when things go wrong. The key is to be able to

trace for each value in the system

exactly what caused it to have that value.

1.5.8 Debuggable

Achieving all these properties together in one

system seems like a daunting

challenge. But by starting from first principles,

these properties naturally emerge

from the resulting system design. Let's now take

a look at the Lambda Architecture

which derives from first principles and satisifes

all of these properties.

Computing arbitrary functions on an arbitrary

dataset in realtime is a daunting

problem. There is no single tool that provides a

complete solution. Instead, you

have to use a variety of tools and techniques to

build a complete Big Data system.

The Lambda Architecture solves the problem of

computing arbitrary functions

on arbitrary data in realtime by decomposing the

problem into three layers: the

batch layer, t

Bigdata notes

Documents

Transcript of Bigdata notes