e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and...

15
e Bloor Group WHITE PAPER THE NATURE OF GRAPH DATA The New Wave of Graphical Analytics and Its Areas of Application Robin Bloor, Ph.D. & Rebecca Jozwiak

Transcript of e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and...

Page 1: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

The Bloor Group

WHITE PAPER

THE NATURE OF GRAPH DATA

The New Wave of Graphical Analytics and Its Areas of Application

Robin Bloor, Ph.D. & Rebecca Jozwiak

Page 2: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

THE NATURE OF GRAPH DATA

1

Executive Summary

In this white paper, we examine the characteristics, purposes and strengths of graph databases with an emphasis on the Cray® Urika® platform and the Cray Graph Engine. The paper will cover:

• The factors behind the need for relational database alternatives, including document and graph databases, because relational databases are:

- Incapable of handling hierarchical and graphical data structures well - Not designed for massively parallel architectures - Unable to swiftly process unstructured data

• The fundamental differences between relational, document and graph databases and the value of each

- How graph databases work, including definitions and functions of: - Triples - The semantic web - RDF and RDF schema - Ontologies and OWL - SPARQL

• The Cray® Urika®-GX platform and the Cray Graph Engine (CGE)

• How the CGE achieves performance and scale

• The position of CGE in the market

• The growing number of business sectors adopting graph analytics applications, including:

- Pharmaceuticals, particularly for drug interactions, drug repurposing and precision medicine

- Biomedical research and genome sequencing - Disease research involving chronic conditions - Financial services for fraud detection and risk analysis - Social media and gaming networks - Large-scale retail services - IT security, particularly for strengthening networks against cyberattacks - News analytics

• Two use cases involving Cray’s technology - A cybersecurity use case where the customer was able to find, isolate and

neutralize cyberthreats in real time - A cancer research use case where a research firm improved the performance of

hypothesis validation by 1,000 times

It is our view that graph analytics – and specifically the Cray Urika platform and the Cray Graph Engine – can provide extreme value over very large data sets. This type of technology differentiates itself from traditional databases by allowing users to explore patterns and relationships in data, a feat that is either impossible or difficult in relational databases.

Page 3: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

THE NATURE OF GRAPH DATA

2

The Emergence of Graph Databases

Let’s take stock. For the last 30 years or more the IT industry has employed relational databases for most of its application needs including most data analysis. Unless you worked on the front lines as a software developer, you could be forgiven for believing that relational database technology was all you needed to store and analyze the mass of corporate data that grew like bamboo, year after year.

We looked at data in that way because we stored it in that way, and we saw it in that way. And for many purposes it is an excellent way to store and explore data. It proved to be excellent for OLTP and for the data warehouse, and as a general-purpose database for data marts. There were specific areas where it was weak. In particular, it was ineffective and inefficient for complex hierarchical data, which is used extensively in bill-of-material processing and it was ineffective for processing graphical data structures.

New Database Engines

In the era of relational database dominance, applications for complex hierarchical data and graphical data were uncommon. If the volume of such data was relatively small, relational databases could be shoehorned into storing hierarchical and graphical data. As long as the data volumes were low, users could compensate for the inadequacy of the database by doing the necessary data manipulation programmatically; where that failed, object databases were usually employed.

In recent years new varieties of databases emerged, purpose-built to cater directly for such inconvenient data structures. On the one hand there were document databases such as MongoDB and CouchDB, and on the other hand there were property graph databases such as Neo4j, and RDF graph databases such as the Cray Graph Engine.

Several factors led to these developments. Foremost was the move to parallelism in software. This had been prompted by the chip industry’s move to producing multicore processors, which are best exploited by parallel software. Volumes of data had continued to grow, year on year, at a compound annual growth rate of about 55 percent, according to IDC’s regularly published estimates. This began to stretch the capabilities of relational database products, most of which were specifically engineered to run on relatively small clusters. They did not have massively parallel architectures. To address this limitation, several more powerful relational databases emerged – “column store” databases, like HP Vertica – that were built for parallelism and stored data more efficiently.

Nevertheless the dominance of relational databases came to an end. The search engine businesses, Google and Yahoo, needed to process large volumes of unstructured data, and this was not easily achieved. Unstructured data is data that does not explicitly declare its metadata. It cannot be ingested into databases without considerable cost and effort. And in those days the volume of data on websites was far more than what a relational or any other kind of database could cope with. So the search companies designed their own file systems and built their own massively parallel environments for the sake of search. This eventually led to the emergence of Hadoop®, a massively parallel processing environment for data.

The primary use for document databases is to store hierarchical data – particularly web data. Rather than storing data in relational tables, they store and access it in JSON form as object structures, which is far more compatible with most programming languages and much faster for retrieving data. In reality, they were a new generation of object databases that had also been

Page 4: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

THE NATURE OF GRAPH DATA

3

engineered to scale out, catering for very high data volumes and massively parallel processing. In most respects they were complementary to relational databases.

The graph databases were specifically built to process data in a graphical manner, storing data efficiently to enable queries that traverse networks of data. These are complementary to relational or document databases, and many of them have been engineered to scale out and process very large volumes of data.

Relational Data, Document Data and Graphical Data

The illustration below shows the basic distinction between the three types of data – relational, document and graph – and is worthy of discussion, for the sake of clarification.

Relational data structures require data to be stored in two-dimensional tables, each of which may be related to the other by key values. So in the simple example shown above, an Order Header table has a one-to-many relationship with an Order Line table, related together by the Ord_No key. When a new order is added the header goes into one table and the associated order line is placed in the other. In a relational database all data is stored in tables like this using common keys (table columns) to relate tables to each other.

The document data example shows data broken up into a cascade of categories. (Food is categorized into fruit, vegetables and meat. Fruit is categorized into fresh and juice, and so on.) This structure supports the definition of many keys at many levels. In the example, food, fruit, fresh, juice, vegetables, root, leaf, meat, pork and beef are all keys, and the data values residing beneath can be accessed by those keys. Alternatively, this particular example could be placed into a single table, but storing it as a hierarchical structure will (once data volumes grow large)

Food

Relational Data

Graph Data

Buyer Purchase Book

Document Data

Root LeafJuiceFresh Pork Beef

Fruit Veg Meat

Ord_NoAA-0142AA-0143AA-0145

...

Cust_IDRJ1960BS1427TY1094

...

Date01/13/1601/19/1601/19/16

...

ApplePear

OrangeLemonMango...

RibSteakBrisketMince...

CarrotParsnip

TurnipRadish

Beet...

LettuceCabbageKale

Spinach...

RibChop

SausageMince...

AppleLemonGrapeCherry...

Ord_NoAA-0142AA-0142AA-0143

...

Line01

0102

...T0151

T0151S2231

Product

...

Qty5

62

...

Price10.50

10.5022.00

...

PersonJackJillGeorge

...

Born01/13/1601/19/1601/19/16

...

GenderMaleFemaleMale...

Date01/13/1602/19/1602/19/1603/01/16

...

Paid15.2511.0011.3510.50......

PurposeTo readGift

GiftTo read

...

AuthorDickensOrwell

...

Great ExpectationsAnimal Farm

Title

...

Figure 1. Relational, Document and Graphical Data Structures

Page 5: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

THE NATURE OF GRAPH DATA

4

provide faster access and is more likely to align with how a developer thinks about the data.

The graph data example is distinctly different. It highlights the fact that graphical structures naturally record both verbs and nouns. The two nouns, buyer and book, are related together by the act of purchasing (a verb). Note that both “nouns” and “verbs” can have attributes. Again, when data volumes get large, it is preferable to hold data of this kind in a graphical structure because it is semantically meaningful to do so.

Note also that there are many situations where a class of nouns (person, for example) can relate to its own class, as in: Jane went to school with John, Jane has a sister called Janet, John worked with Janet, Justin likes Jane but does not like John. When considering graph data conceptually, it helps to see some simple graphs, as in Figure 2, which illustrates the relationships we described. As shown, some of the relationships are mutual (is the sibling of, went to school with, worked with) and some are not (Justin likes Jane does not necessarily mean that Jane likes Justin). In graphical terminology, Janet, John, Jane and Justin are nodes, and the relationships between them are called edges.

There are several things to note here. First, much more data could be added without changing the structure of the graph. We could add more data about the people (address, color of eyes, etc.) and more data about the relationships (such as when and where Jane worked with John). Second, there can be many relationships between any two people (who is older, who is taller, etc.), whereas we have shown at most two. Third, nodes can be connected by an edge to themselves (e.g., Justin draws pictures of himself). Finally, while in the example all nodes are of the same class (people), graphs often span many classes of things.

Graph databases enable new kinds of query. Aside from the usual set of database queries, they handle questions that can be resolved only by walking the graph from node to edge to node. A good way to think of this is by considering the well-known “six degrees of Kevin Bacon” game. All databases can handle questions about which movies a given actor played in, but when you have to play that game and, say, walk the graph from Charlie Chaplin to Kevin Bacon, you’ll encounter a vast number of possible paths and dead ends – and you’ll need a graph database. One path that works, by the way, is: Charlie Chaplin — Jackie Coogan (in The Kid) — Montgomery Clift (in Lonely Hearts) — Burt Lancaster (in From Here to Eternity) — Kevin Costner (in Field of Dreams) — Kevin Bacon (in JFK).

Graph databases allow the user to examine the relationships between the data items as well as the data items. It is not just walking the graph that provides insights but also the ability to find patterns. In our movie database example you might wish to examine the pattern of which actors worked with a small number of different directors. You might also wish to examine interesting subgraphs, such as “Which actors appeared in more than one Western or more than one Oscar-winning movie?” But pattern matching gives the ability to extend insight far beyond “who knows whom.” Being able to detect patterns is essential to understanding relationships, how and why those relationships occur and what relationships might form.

Figure 2. A Simple Graph

Page 6: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

THE NATURE OF GRAPH DATA

5

The Versatility of Graphical Data Structures

What may not be obvious when considering the three distinct types of database we discussed is that graphical data structures can easily represent both relational data structures and hierarchical data structures. The converse is not the case.

With relational data structures the relationship of one table to another can simply be thought of conceptually as an edge connecting two nodes of a graph. In practice, an order header has order lines, so there is an edge (representing the verb “has”) connecting each of the order lines to a given order header. Similarly, if we consider the hierarchy of food that we illustrated, which was in fact a series of categories, we can think of the hierarchical connections representing the verb “is.” Thus a rib is pork, which is meat, which is food.

In the relational case the graphical representation doesn’t add anything useful, although in the hierarchical case it can be useful, especially with hierarchies that define complex bill-of-materials data structures. (A bill of materials records the relationship of a product to its components, which themselves may have components, and so on.) In the graphical example, the value of the graphical data structure is clear because the verb involved, “purchase,” has attributes that are worth recording. We have more to add to this, but before we do that it is probably worth discussing some general examples of graph database applications

Where the Rubber Meets the Road

The world is permeated by a large number of networks: telephone networks, electricity supply networks, road networks, airport networks, computer networks, the worldwide web, mobile phone networks, social networks and so on. A rapid and unprecedented proliferation of networks was provoked by the internet. The connectivity of software and the disparate pools of website information that grew spawned thousands of entirely new businesses that rode on that network or on networks within that network.

As a consequence the importance of graphical data has mushroomed. The application of graph databases has gained traction in many disparate areas: in network management, for master data management (MDM), in IT security, in telecom, in fraud detection, in content management, for recommendation engines, for analyzing social networks of every kind, for analyzing multiplayer games, for risk analysis in banks and in the insurance industry, for bioinformatics and for gene sequencing and in many other areas of healthcare. Networks exhibit behavior, that behavior needs to be analyzed, and such analysis requires getting answers to graphical queries that scan the network and its data sources.

There is data to be mined and knowledge to be discovered, and a great deal of it is new and valuable, primarily because analytics on such data was not previously practical. Many processes in a network are marked out by paths or cascades of events that play out from node to node until they complete. This is where graph analytics can make a unique impact, analyzing such activities and aggregating multitudes of such activities to identify predictable patterns within them. Data volumes have grown so quickly that a dedicated graph database that is optimized for such analytical activity has become a necessity.

Page 7: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

THE NATURE OF GRAPH DATA

6

Graph Databases

As we noted there are three kinds of analytic database – relational, document and graph – each with different areas of focus. Within any given category you find different products that adopt different engineering approaches to implementation. Graph database products can be roughly divided between products like Neo4j that implement their own query language and data structures and those that have adopted the graph processing standards that were established by the World Wide Web Consortium (W3C).

In our view the standards-based approach to a graph database is likely to dominate in the long term, as standards tend to do. Graph databases built to these standards are called RDF databases or sometimes triple stores, because they consider data in terms of triples. RDF stands for resource description framework, a term that unfortunately is not self explanatory. To appreciate its meaning, it may help if we step back a little and consider the meaning of data.

A great deal of the data we need to store and analyze is organized in such a way that the only way to capture and analyze it easily is in a graph database. Computer and networking technology is awash with languages at every level, whether it is the language of a protocol for sending messages between devices or a programming language executing a process or the English text of an email passed from one person to another. Language is our primary means of recording information, knowledge and meaning.

Triples

The primary reason for storing data in triples logically is that data triples allow you break up human languages into triples in a subject-predicate-object form. A simple sentence such as “John kicked the ball” can be parsed as John (subject) kicked (predicate) the ball (object). A slightly more involved sentence, such as “John kicked the large ball,” breaks into two triples: “John kicked the ball” and “The ball was large.” In fact, any sentence in any language can be represented as a triple or a series of triples, and it can also be represented as a graph that connects the words together in a way that captures their meaning.

This reduction of data into triples is possible for all data, and it has the virtue of preserving all the meaning that data contains. As such, it provides a foundation for processing data semantically, but it can also easily accommodate all graph data and, if desired, less complex data structures such as those stored in relational or document databases.

The Semantic Web

Tim Berners-Lee, the inventor of the world wide web, coined the term “semantic web” in a Scientific American article published in 2001. He formally defined it as “a web of data that can be processed directly and indirectly by machines.” His vision was that the web would evolve to become a resource where computers would be capable of analyzing all web data – not just the content, but the links connecting content and all transactions of any kind which used it. If that were possible, it would become possible to automate, in a distributed manner, all the day-to-day trade and documentation of all web users, and also to resolve natural language requests from web users.

This is still a distant goal. Nevertheless the hope for such a capability led to the creation and evolution of a set of standards. For the purposes of understanding what an RDF database is and how it can be used, we only need to discuss the following standards: RDF, RDFS, SPARQL and OWL.

Page 8: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

THE NATURE OF GRAPH DATA

7

RDF and RDF Schema

In essence, RDF is a data model that represents data using subject–predicate–object expressions. In some ways, it can be thought of as similar to an entity-attribute-value (EAV) data model. Consider a simple example: “The rose is red.” The RDF representation is the triple: rose(subject)-is(predicate)-red(object); the EAV representation is rose(entity)-color(attribute)-value(red). The virtue of RDF is that it is far more versatile and captures more meaning because of verb variety.

Although it is not obvious from the example, the RDF subject, predicate and object actually represent resources, which explains why RDF stands for resource description framework. So, while we may not think of a “rose” as a resource, it is a word, defined in a dictionary, that we can use to denote a particular thing (a kind of flower). If we did not know the meaning of the word we would discover its meaning by accessing its definition.

RDFS (RDF schema) constitutes a way for RDF to define metadata. Triples such as rose (subject) is (predicate) a flower (object) can be used not just to establish the metadata “flower” for rose but to define whole taxonomies. For example: A lion belongs to the subfamily pantherinae, which belongs to the family felidae, which belongs to the order carnivora, which are mammals. Using such collections of triples you can establish the taxonomy for the whole animal kingdom or even all of life on Earth.

As should now be apparent, a collection of RDF statements can be represented and processed as a labelled, directed multigraph. For an RDF database there is a need to decide how to physically store such collections of triples. Triples can, for example, be stored in three-column tables (subject-predicate-object) with the subject being treated as a key. Alternatively, all subject-predicate-object triples with the same subject can be stored as a record, with the subject data acting as a key and appearing just once. And, naturally, there are other possibilities, such as a sparse-matrix data structure.

Ontologies and OWL

A taxonomy is a useful construct enabling us to define a meaningful set of categories and subcategories (a hierarchical structure) for data and metadata. But more sophistication than that is required to button down all the possibilities of human language. In particular there is a need to accurately define knowledge rather than just information. For that we have what are called ontologies. Practically, an ontology consists of a set of “individuals” (you can think of these as classes or categories) together with a set of “property assertions” (you can think of these as rules or constraints or axioms) which define how the “individuals” are allowed to relate to each other.

The word “figment”provides a simple English-language example. Figment is a very restricted word as it can apply only to the word “imagination.” You cannot have a figment of an orange or a figment of anything else other than imagination. So a precise ontology of the English language would include such a constraint. If you consider it, you quickly realize that there are many such constraints in human language. You cannot sing a sofa and you cannot sit on a song. Human language also has the difficulty of metaphorical constructions, such as “he sang like a canary.” The usual meaning we apply is not that some person imitated a canary, but that she told everything she knew to the police. An English ontology needs to identify such sentences accurately.

Fortunately a great deal of work has already been done to construct ready-to-use ontologies, some in specific areas of knowledge and some more general. Many such ontologies deal with medicine and healthcare. They include ontologies for cellular processes, for diseases, for

Page 9: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

THE NATURE OF GRAPH DATA

8

human anatomy, for genomics, for pharmaceutical engineering and so on. There are scientific ontologies, business ontologies and so on. There is even a search engine, Swoogle, which searches RDF resources, including ontologies – which are written in RDF – that are available on the web.

OWL, the Web Ontology Language, is based on RDF triple and is a knowledge representation language for authoring ontologies. OWL ontologies can import other ontologies, adding information from the imported ontology to the one that is being built or enhanced. Among other things, an ontology can help resolve ambiguities. A simple example might be that in one database customer data is kept in the customer table, while in another it is kept in the client table. This naming issue can be resolved in OWL by declaring customer and client to be the same.

SPARQL

SPARQL is the query language for RDF data in exactly the same way that SQL is the query language for relational data. Like SQL it provides a full set of analytic query operations that correspond to the joins, sorts and data aggregations, as well as graph walking operations for traversing graphs and networks. Property paths (possible routes between specific nodes of a network) can also be queried in SPARQL. The results of a SPARQL query can be a results set or an RDF graph.

The power of SPARQL can be, and often is, increased by the use of an external ontology that allows data from different sources to be joined together. In these cases it resolves any differences in naming between multiple databases or other sources of RDF information, such as any of the four-million-plus web domains that include semantic web markup.

RDF Database

Hopefully, you will have begun to understand from the above description of RDF and SPARQL what distinguishes an RDF database from other types of graph database. It implements a well-defined and well-tried set of graph processing standards.

In practice, databases are engineered for specific workloads: the relational database for transactions and table oriented queries, the document database for transactions (with few joins) and hierarchical queries, and the graph database for graph queries. Graph databases perform extremely well for graphs and hence come into their own when the relationships between data items are at least as important as the data items themselves. They can also support semantic queries and applications (which are beyond the capabilities of other types of databases).

One example of an RDF database is the Cray Graph Engine (CGE).

Page 10: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

THE NATURE OF GRAPH DATA

9

Cray Urika-GX Platform and the Cray Graph Engine

Cray’s Urika-GX platform marries high performance computing to big data analytics. It integrates a variety of hardware and software technologies to manage analytics workflows including data preparation, data exploration, operational analytics, machine learning and streaming analytics, including graph analytics. The full software stack is as depicted in the adjacent diagram which shows the three environments of Spark™, Hadoop and the Cray Graph

Engine (CGE) supported by scheduling and cluster management software (YARN for Hadoop, Mesos™ for CGE, Marathon as the container orchestration platform and Mesos for cluster management).

In practice this allows the execution of a wide variety of analytics applications and large volumes of data distributed across the Cray cluster also enabling workflows that might be needed to coordinate those applications with each other. What is distinctive about the Urika-GX platform, and what is likely to prove attractive to users, is the power of CGE.

CGE is a semantic database that uses RDF triples to store data and SPARQL as the query language, and includes specific mathematical extensions to support classical graph algorithms. The point to note is that a SPARQL query can be used to select a particular graph or subgraph for analysis, and then an appropriate graph algorithm can be applied to analyze its structural properties in a way that would be impossible or at least very convoluted if were attempted with SPARQL alone.

CGE provides the tools for capturing and organizing, as well as analyzing, large sets of graphical data. You can interactively analyze data using pattern matching and filtering, as well as by employing a multitude of algorithms. It offers optimized support for inferencing and deep graph analysis, and enables real-time analytics on the largest and most complex graph.

Figure 3. The Cray Urika-GX platform stack

Page 11: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

THE NATURE OF GRAPH DATA

10

Inferencing is especially important. A graph database can significantly increase the knowledge it holds by allowing users to define sets of inferencing rules. These are employed during database builds and updates to automatically create new relationships between objects. For example, the simple ontological definition that a faculty member is also an employee, and an employee is also a person, can be set as two rules. This eliminates the need to explicitly include each such type for each corresponding item in the database. The inferencing rules generate the appropriate links. Thus the inferencing rules provide SPARQL queries access to inferred data as well as the raw data. They can be thought of as an effective means of data compression.

Cross-database rules can also be established, making it possible to establish a relationship (for example, via a global ontology) between triples in different databases. A simple example would be jointly querying two databases created in different languages, say two pharmaceutical databases, one in English and the other in German. A global ontology would identify which fields were equivalent, but even in its absence it would be possible to create rules simply defining equivalent data between the two data sources.

Performance

It should be no surprise that CGE offers extreme scalability and extreme performance. It achieves this by combining sophisticated parallel software engineering with the unique performance capabilities of Cray’s HPC hardware. There is an important aspect to this that is particular to graph analytics. In contrast to queries on relational databases or document databases, graph queries tend to pull together data that is widely distributed through a network of data relationships rather than stored contiguously. Thus the result of a query is assembled in a manner more akin to random reads than serial reads. When graphs get very large, the ability to assemble such data quickly has a huge impact on response times.

As a graph database, CGE naturally takes advantage of Cray’s mastery of in-memory data processing. CGE also leverages Cray’s proficiency with high-speed networks, making effective use of the Cray Aries™ interconnect. This has particular benefits for graph queries depending on context. It is particularly fast in its ability to access small amounts of data over the network, a capability that pays dividends for queries on very high volumes of graph data.

The upshot of such capabilities is that CGE is often an order of magnitude or two faster than competitive products running on a similarly sized system and can scale far beyond most such products. This is not just a matter of queries executing faster – at scale CGE can run workloads that are simply not possible on other products.

Cray Graph Engine (CGE) Market Position

CGE has attracted customers in very large organizations that had a clear need to implement graphical analytics on big data. It is, of course, unlikely to be prominent on the list of possible technology solutions for companies that are just “dipping their toe in the water.”

Companies most likely to be interested in the product are likely to be those that have encountered the limits, in terms of either speed or scalability, of less-capable products. Such companies will likely fit the following profile:

• They have invested heavily in big data analytics and have a particular interest in analyzing data relationships and patterns in such relationships.

• The company’s current data models and schemas do not support a significant number of the queries that need to be posed. In these circumstances, it is often a graphical

Page 12: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

THE NATURE OF GRAPH DATA

11

capability that will resolve the problem, and hence should be investigated.

• The company’s graphical analytics has encountered problems, perhaps because the queries have become too complex for the current environment or simply take too long. It is not unusual for such companies to discover that some of their queries are “unrunnable” on standard distributed-memory databases.

• They wish to analyze a great deal of new and disparate data sources, many of which may be inherently unstructured or poorly structured.

The Explosion of Graph Applications

In the past few years, graph analytics has established itself as an additional dimension to big data analytics. It had been a niche and quiescent area of activity for many years. While the various security agencies of the world have undoubtedly built and mined very large volumes of graphical data for many years, until recently there were no big data graph analytics products available to corporate users. That changed, leading to growing interest and a considerable increase in graph analytics activity.

Prior to the advent of Hadoop, most analytics activity was done on corporate data held in tabular form in data warehouses or data marts. Hadoop increased such activity, enabling analytics on much broader collections of data, particularly log file data and external data sources. It also stimulated interest in graph analytics by virtue of the open-source Apache Giraph software – an entry-level capability. The introduction of products like CGE stimulated additional investment in large-scale graph analytics.

Many business sectors are adopting graph analytics, including:

There are many business areas where graph analytics is currently being adopted. They include

• Pharmaceuticals, particularly in drug interactions and drug repurposing, but also in more esoteric areas such as precision medicine.

• Biomedical research, particularly in the area of genome sequencing.• Disease research, primarily for common chronic conditions and particularly in cancer

research.• The financial services industry, touching many areas from insurance through fraud

detection and the increasingly important area of risk analysis.• Social networks of every type, including the increasingly popular and complex business

of multiplayer games.• Large-scale retail where graph analytics is proving increasingly useful in determining

opportunities to up-sell and cross-sell products.• IT security, particularly in the area of network security and cybersecurity.• News analytics, using graph analytics to generate knowledge from news data streams.

An area where graph analytics is proving particularly important is in data discovery. It is proving to be a widely applicable means of finding patterns in data that are not obvious and would probably not be discovered by common data mining techniques. Data discovery plays to a strength of graph analytics in its ability to span diverse collections of data and tease out relationships that may not be intuitively obvious. The graphical analysis of multiple datasets can provide a more comprehensive picture of factors that impact business outcomes by including,

Page 13: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

THE NATURE OF GRAPH DATA

12

for example, weather data, demographic data, local economic data, social media and so on. Graph analytics can identify unexpected correlations, determining their importance.

In the two use cases that follow we focus on graph analytics applications that were run using CGE on the Urika-GD appliance, the predecessor to the Urika-GX platform, describing how the power of Cray’s technology helped solve common and difficult challenges.

Cybersecurity

Before the internet, data could be compromised only via physical access, and such breaches were typically carried out by someone close to the data. Today, with vast networks spanning the globe, data is much more vulnerable. Organizations use firewalls, encryption techniques and other security technology, but cybercrime continues to evolve and often stays one step ahead of the game.

The consequences of a data breach can be devastating, not just to the individual or organization whose data was stolen, but also to the security professionals charged with keeping the data safe. Reputations can be ruined, and regulatory fines are hefty. Organizations simply cannot afford breaches of any kind.

Identifying threats after they happen is far easier than discovering the attack as it is happening. Because network traffic is so heavy – and data sources so disparate – analysts often need to decide between operating on a subset of data or constantly adjusting schemas to accommodate new data sources. Both of these solutions are flawed in combatting a cyberattack as they don’t provide fast queries results that span all relevant datasets. Add to this the fact that commodity hardware typically cannot scale well as data volumes increase, and you have a recipe for disaster.

One Cray customer, a large government organization, realized that in order to discover threats before they caused irrevocable damage it needed to invest in a scalable database capable of linking data from many sources that executed at lightning speed on in-memory hardware.

With Cray’s Urika-GD appliance, the organization was able to load network data from multiple, disparate sources into single graph, while maintaining the entity and relationship information within the data. The Urika-GD platform’s shared memory architecture meant that the entire graph of relationships could be held in memory, enabling blazing-fast query performance and real-time updates as new data streamed in.

The Urika-GD system allowed the customer to find, isolate and neutralize cyberthreats immediately.

Cancer Research

Big data is no stranger to the medical research community. Research labs collect vast amounts of data . Some of this data is public, such as The Cancer Genome Atlas, and some datasets are private. But, awkwardly, these data collections are stored in a variety of structures and formats.

One research firm, the Institute for Systems Biology (ISB), wanted to investigate and validate the relationships between genes and drugs by using existing datasets. It found that its traditional solutions failed in terms of performance – they simply were not fast enough. Analysts and data scientists were spending an egregious amount of time gathering data from multiple sources and building models to analyze the data. With time of the essence, this was not a practical approach.

Page 14: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

THE NATURE OF GRAPH DATA

13

Using Cray’s Urika-GD system, the ISB was able to dramatically accelerate its cancer research. Working with Cray’s team, the ISB developed and implemented a real-time graph analytics solution that required no partitioning or data modeling in order to test a hypothesis. This approach, combined with the ease of an RDF/SPARQL interface and multithreaded graph processors, enabled researchers to rapidly integrate their private data with public datasets, thus creating the possibility to quickly discover new relationships in the data.

Unlike traditional, query-based systems, graph analytics lets the data guide the user, which in turn fosters deeper knowledge discovery and greatly reduces time to insight. With Cray, the ISB team can now validate 1,000 hypotheses in the time it used to take to validate one. No current relational system could return results at that speed.

In Summary

While RDF databases like CGE are capable of running semantic queries and hence supporting semantic applications, most applications for such products currently focus on high-volume graphical data with the primary goal of teasing out new knowledge from extensive collections of data. This is a rich area of analytics activity primarily because such diverse data has not previously been subjected to graph analytics at scale.

It is fair to say that this area of analytics is in its infancy and is already showing a great deal of promise in providing new insights and analytics solutions for those businesses that are motivated to assemble and analyze large volumes of disparate data. The ability to deeply explore and find meaningful patterns and relationships in diverse datasets offers a wealth of new business and analytical applications and opportunities.

As a closing thought, consider the following questions:

• Are your business goals in line with the nature of your data?

• Is your data strategy effective when it comes to analytics and knowledge discovery?

• Will your current infrastructure scale to support growing and disparate data sources?

If you answered “no” to any of these questions, supplementing your database environment with a graph analytics platform may be the right solution for your organization. Graph engines deliver speed, scale and insight over vast and complex data, and as such, their applications can provide an enormous competitive edge.

Page 15: e Bloor Group - Cray · -A cybersecurity use case where the customer was able to nd, isolate and neutralize cyberthreats in real time -A cancer research use case where a research

THE NATURE OF GRAPH DATA

14

About The Bloor GroupThe Bloor Group is a consulting, research and technology analysis firm that focuses on openresearch and the use of modern media to gather knowledge and disseminate it to IT users.Visit both www.TheBloorGroup.com and www.InsideAnalysis.com for more information.

The Bloor Group is the sole copyright holder of this publication.Austin, TX 78720 | 512-524–3689