IoT underthe hood

Sparks Ignite, Inc.A technology consulting firm. We build outcomes.

Sparks Ignite, Inc.

IoT Under the HoodTaking an informed approach

7/22/15


IoT Under the Hood

There are any number of vendors and publications stating that IT departments need to invest big in Big Data and Big Analytics to meet the challenges of the Internet of Things.

Saying something does not make it true, even if you saying it very loudly and very often. It just makes it noisy.

Let's swap out marketing and hype for logic and math and separate the signal from the noise.

We'll come up with a clear problem definition and come up with an algorithmic approach to the problem.

Once we have a framework, we can more intelligently choose an implementation.


Essentially, there has been two fundamental changes:

one fundamental technical change: more devices are reporting more data more frequently.

one fundamental business change: provide more frequent and robust analytics.

Let's break down the requirements into something measurable.

IoT Under the Hood


More devices are reporting more data more frequently.

What are these “devices” again?Sensors, programmable logic controllers (PLC), RFID readers, etc

What do we mean by “reporting”?Each device is really only capable of generating a text based log file. It could be fixed or variable length, xml or json but it will be text.Most importantly, all of these devices now have an IP address.

IoT Under the Hood



What do we mean by “more”?For a device to be considered part of your Internet of Things, it must be connected to the network. At this point, “more” could mean connecting some or all of your existing devices, which has technical and security issues. It could mean connecting your supply chain partner's devices; same issues magnified. It could mean adding more devices. Plan for it to mean all of these things, and come up with a reasonable strategy for a staged onboarding process.

What do we mean by “data”?This type of data is called time series data: which Wikipedia tells us “is a sequence of data points, measured typically at successive points in time spaced at uniform time intervals.” It's typically geolocated as well.

IoT Under the Hood



What do we mean by “frequently”?Normally, we are talking about anywhere between near real time and fifteen minutes. Time, or frequency, is money and storage space is inexpensive, but not free. The maximum speed at which a device can report data is not necessarily the speed in which you are best served receiving the data.Frequency is a measurement that should be arrived upon by both the business and the data scientists. I have found that planning on an average frequency of one minute is reasonable and makes for easy estimations.From experience, I have also found that although devices are more than happy to talk nonstop, not everything they say is worth listening to.There are some technical advantages to pushing some basic quality control logic to the device, assuming it has the ability.

IoT Under the Hood


Provide more frequent and robust analytics. What do we mean by “provide”There will need to be both ad-hoc and structured analytics and reporting. It is worth noting that data at scale is often not amenable to the same types of reports that are used for more modest, enterprise-size data.

What do we mean by “more frequent”?For most use cases, the difference between an “advanced” and a “standard” analytics platform is the speed at which the data can be made available and actionable, not necessarily the level of detail. This difference can mean a report that provides advance warning of a potential system failure versus a detailed post mortem of broken equipment.

IoT Under the Hood


Provide more frequent and robust analytics.

What do we mean by “robust”?A robust data set, to a data scientist, means less work spent on cleansing, processing, and massaging the data and more time spent running, comparing and fine tuning algorithms. Many data scientists estimate that 80% of their work involves data munging. There is no other way to classify that time other than 'wasted'.

What do we mean by “analytics”?Analytics, or analysis of data, is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. The major categories of analytics that are typically performed on time series data are on the following slides.

IoT Under the Hood


Time Series Data Analytics

SummarizationGiven a time series Q containing n data points where n is an extremely large number, create a (possibly graphic) approximation of Q which retains its essential features but fits on a single page, screen, etc.

Anomoly DetectionGiven a time series Q, assumed to be normal, and an unannotated time series R, find all sections of R which contain anomalies or “surprising/interesting/unexpected” occurrences

SegmentationGiven a time series Q containing n data points, construct a model Q1 from K piecewise segments (K << n), such that Q closely approximates Q1

IoT Under the Hood


Time Series Data Analytics

IndexingGiven a query time series Q, and some similarity/dissimilarity measure D(Q, C), find the most similar time series in database DB

ClusteringFind natural groupings of the time series in database DB under some similarity/dissimilarity measure D(Q, C)

ClassificationGiven an unlabeled time series Q, assign it to one of two or more predefined classes

PredictionGiven a time series Q containing n data points, predict the value at time n + 1.

IoT Under the Hood


Let's assume that we want our analytic platform to run on data refreshed every minute rather than data that was batch processed the day before.That's a 10^3 performance increase.

We have discussed before that a new device can be●An existing device previously not connected to the network●A new device installed on a new product type●An interface to a partner's device●An interface to a customer's deviceLet's just say you have to deal with a million new devices.

We now have a problem definition with an order of magnitude estimation: Provide analytic capability 10^3 faster based on 10^6 more time series data.

Are there data structures and algorithms that can provide for such an increase in time series data?

IoT Under the Hood

Big O Review


Storing time series data means taking a large number of small files and persisting them using the time (at a minimum) as the natural order.

The emphasis is on insert, since there is no mandatory prerequisite for updating or deleting time series data.

This data will come from multiple sources so time (and possibly location) is really the only metric that they must have in common.

IoT Under the Hood


In order to rapidly query time series data, there needs to be a clear and relevant sequential-based index

A reasonable key could composed of metric name : timestamp : random 64 bit integer : geo tag. If you needed to retrieve a block of data that would provide all of the sensor readings from a particular device from a particular manufacturing plant in the third quarter of 2014, that would certainly be readily available.

Note that this key also represent a strong hashing function. Since a hash table provides the fastest possible data retrieval (constant time), it is very important to ensure that the hash is well generated. A bad hash degrades a hash table to a linked list and we will get data in O(n) rather than O(1).

IoT Under the Hood


Time series data unfortunately tends to heavy disk I/O.●Seek Rate●Time it takes for data to be written or read to disk.●Transfer Rate●Time is takes for to move data between the controller and the host system (external rate) as well as between the disk surface and the controller (internal rate).

As far as disk operations are concerned, it is better to transfer than to seek and even then it's helpful to minimize the frequency of the transfers in favor of larger payloads.

Since the data is both sequential and immutable, we can have a reasonable expectation that this can be optimized.

The seek times for RAM are in nanoseconds (1E-09) rather than milliseconds (1E-03), so we can short-circuit any deep conversations about partial stroking and hybrid drives.

IoT Under the Hood


We need a storage system that1.minimizes seek time2.optimized for index searches3.optimized for sequential searches4.optimized for inserts

A storage system in this context refers to both database engines, file systems, messaging platforms, etc. Anything that would store the data.

For estimation purposes, we are looking at a storage system of O(log n) or better. Since we are going to be looking at data at scale, we will need our performance to level off.

IoT Under the Hood


What algorithm are you using today for your storage systems?

Almost certainly a B (or B+) Tree. The primary value of a B tree is in storing data for efficient retrieval in a block-oriented storage context. Most file systems use this structure and it is used by every major relational database vendor for their key indexes.

B trees are great for random access. When inserting a record into a B Tree, you need to search the tree to find the location to insert the record. Since B-Trees are designed to be wide and shallow, there should be a minimal number of drive seeks.

B Tree inserts are O(log n), which can be argued is the mathematical lower bound for balanced trees.

So can we do better?

IoT Under the Hood


There is a data structure called a Log Structured Merge Tree (LSM) and a storage model called Log Structured Storage (LSS) that provide the same estimated performance O(log n) as a B Tree but provide two key potential areas of improvement for B Trees that are applicable for systems that are going to do large quantities of sequential writes:●Moving seek time from disk to memory●Moving from block data to log structured data

IoT Under the Hood


LSM Tree

An LSM-Tree is a hybrid tree model that uses two trees: C0 and C1. C0 is smaller and entirely resident in memory, whereas C1 is resident on disk. New records are written to C0 from C1 based on a size threshold.

Insertions now run primarily at RAM rather than HDD speeds, or 1E-09 rather than 1E-03 seconds. Of course, they are written to disk, but that is where the LSS comes in.

Note that many production systems systems concurrently write to a commit log on disk and C0 with the commit log getting deleted after flushing.

IoT Under the Hood


LSS

In a traditional storage system, there needs to be a considerable amount of overhead for updating and deleting existing members. In a log structured storage system, this overhead does not exist because a log structured storage system provides for an append-only sequence of data entries. Unlike a B+ Tree-based system, you don't find a location for new data, you merely append it to the end.

Because new records are always added to the end, there is never any need for searching a tree for insertion, like in a B-tree storage structure. This allows for extremely predictable horizontal scaling.

IoT Under the Hood


LSS

Providing concurrency and transactional semantics using Multiversion Concurrency and Control (MVCC) is easier in LSS than B-tree since existing data in not modified. A view of the system at state Q at time A is just as valid is a view of Q at time B.

Being able to manage concurrency and transactions in a distributed environment just by using immutable objects is a key to successful software development projects.

IoT Under the Hood


LSS/LSM versus B-Tree

While both options provide O(log n) performance, the LSS/LSM algorithm and data structure solution is clearly optimized for the IoT use case and can give us the order of magnitude speed increases that we need.

Consider that insertions into a Log Structured Merge Tree occur in memory rather than to disk, we are inserting into a medium that takes 1E-09 rather than 1E-03 seconds. For relational databases, that insertion would also occur in main memory, but that is only referring to GB size datasets, not TB sizes.

Once the LSM writes to disk, the Log Structure Storage System will always write to the end of the file with no searching or sorting. This actually occurs in O(1) time.

IoT Under the Hood


We have identified an optimal data structure and algorithm, now we need to identify the level of compliance needed for the data. You may have seen the CAP Theorem :

IoT Under the Hood


The CAP Theorem

ConsistencyAll clients see the same view of the data, even in the

presence of updatesAvailability

All clients can find some replica of the data, even in the presence of failurePartition Tolerance

The system property holds even if the system is partitioned.

Now, define your problem set(s) and pick two. The easiest way to identify where a use case falls on the CAP Theorem is to identify the consistency model you need.

IoT Under the Hood


For time series data from devices, availability and partition tolerance are key drivers. The data should never be lost and the system should not be unavailable. The data should be partitioned across multiple based on a reasonable hash in order to avoid the hot-spotting problem that can arise with time series data indexes. This is an AP model.

For example, consider a banking system. If a customer makes a transfer from checking to savings, anyone who looks at that data must see the same result. This would not be the case if the check and savings account were separately partitioned, so this a CA model.

By the way, relational databases are all CA and CA is the only way to be ACID. NoSQL databases are either CP or AP.

IoT Under the Hood


So what are our choices?

IoT Under the Hood


ConsistencyCommits are available across entire distributed system

AvailabilitySystem remains accessible and operational at all times

Partition ToleranceOnly a total system failure can cause the system to respond

incorrectly

Now, define your problem set(s) and pick two.

CATraditional relational databases

APDynamo-like systems, Cassandra, CouchDB, Voldemort, Riak

CPBigTable-like systems, MongoDB, HBase, Memcached, Redis

IoT Under the Hood


There are caveats here. For your company, you may value CA over AP, in which case you may prefer MongoDB or HBase. Your company may already have a Hadoop stack, and HBase is part of the basic ecosystem. MongoDB is very easy to query and uses BSON (binary JSON) as its storage engine, making it very easy to use JSON across the stack.

My personal bias is towards DataStax's Cassandra offering. Elastic scalability, flexible data model and the Cassandra Query Language looks a lot like SQL.

IBM has now started offering CouchBase in their BlueMix offering since they acquired Cloudant. Couchbase is similar to MongoDB in its JSON integration, but (at the time of this writing) queries needed to be structured as map-reduce rather than SQL.

IoT Under the Hood


To summarize,The best data structure for inserting into our persistent storage engine for time series data would run in O(log n) or logarithmic time.

B+ Trees and Log Structured Merge Trees are both appropriate, but the LSM Tree will deliver better performance for inserting time series data. Proper configuration of the LSM Tree engine could move a substantial amount of the operations from disk to memory (1E-09 rather than 1E-03).

Since we need to process 1E06 time more data, moving as much processing from 1E-03 to 1E-09 will absolutely get us there.

The consistency model will likely be availability and partition tolerance (AP).

IoT Under the Hood

IoT underthe hood

Data & Analytics

Transcript of IoT underthe hood