Blue Canopy Semantic Web Approach v25 brief

18
Knowledge at the Speed of Need: HADOOP Integration and the Semantic Web National Cyber Practice Blue Canopy White Paper Written by Nick Savage, Robert Bergstrom May 2013 Blue Canopy

Transcript of Blue Canopy Semantic Web Approach v25 brief

Page 1: Blue Canopy Semantic Web Approach v25 brief

Knowledge at the Speed of Need:

HADOOP Integration and the Semantic

Web

National Cyber Practice

Blue Canopy White PaperWritten by Nick Savage, Robert Bergstrom

May 2013 Blue Canopy

Page 2: Blue Canopy Semantic Web Approach v25 brief

Contents

Executive Summary..............................................................................3Solving the “BIG DATA” Challenge – Volume, Velocity, and Variety. 4

Identifying a Hi Volume Data Challenge ..............................................4Identifying a Hi Velocity Data Challenge .............................................6Identifying a Hi Variety Data Challenge ..............................................7The Distinct Roles of HADOOP Integration versus the Semantic web .8The Blue Fusion Technical Architecture 9Tools of the Trade 11Conclusion 12

May 2012 Page 2 Blue Canopy Group, LLC Internal Only

Page 3: Blue Canopy Semantic Web Approach v25 brief

Section 1 froExecutive Summary

In the world of government and business, there are a variety of industries that require the capabilities to efficiently find solutions to exchange trillions of dollars, petabytes of data for billions of consumers and workers in picoseconds to get the job done daily. The data produced consist of continuous data written to log files for tracking security requirements, patient or client records required to remain online or archived offline to ensure standards are met to comply with mandates or laws. For example, in healthcare patient records must be maintained typically 7 to 10 years, whereas military patient records must be maintained beyond the life of the soldier and beneficiaries. How do you look for trends and patterns in so much data where the future value cannot be predicted or measured? For example, before ingesting the data you need to determine where to store the data to avoid read versus write issues and query the data effectively.

Another challenge for the technologist is managing stream data generated rapidly by networks, web searches, sensors, or phone conversations that involve or incur large volumes of data may be read infrequently with limited computing and storage capabilities available. How do you mine this data? How do you integrate and correlate this dynamically changing data with your structured data (i.e., database, marts, or warehouses without available foreign keys).

Data velocity is another operational challenge faced by institutions where time is critical searching for one defined state to save lives, avoid catastrophic events, or acknowledge positive major activities. Certain systems must perform complex event processing required by extracting information from various data sources quickly reducing the events to one activity, state or alert; the analyst or expert system must then predict certain events based on patterns in the data and infer an anomaly or complex event. For example in real world operations (i.e. Situational Awareness scenarios), a complex event requiring data and operational involvement from multiple federal agencies (i.e., Intel, CDC, Healthcare, FBI) and systems must be identified with velocity or expediency prior to Bioterrorist unleashing a pandemic smallpox threat. Blue Fusion is the Blue Canopy solution that adapts to new information, new ontologies, and new relationships based on the domain and presents the results to the user in simple way. Organizations must also overcome disparate data challenges by managing stovepipe systems in the enterprise in structured (e.g., database, XML) and unstructured (e.g., email, blobs, PowerPoint slides) formats to extract the most knowledge based on different data delivery requirements. How do you maintain performance and system tolerance to integrate and correlate the multiple data sources?

This paper provides the architectural details for the Blue Canopy technical approach, Blue Fusion, to design a loosely couple architecture and develop an ontology-based

May 2012 Page 3 2013 Blue Canopy Group, LLC CN Savage

Page 4: Blue Canopy Semantic Web Approach v25 brief

application environment that leverages the open source distributed Hadoop framework and the flexibility of the Semantic web to produce a BIG DATA integration solution that is fast, dynamic, cost-effective and reliable.

Solving the “Big Data” Challenge - Volume, Velocity, and Variety

You must first formulate a data processing profile and analyze the type of "Big Data" challenge the user and the organization is facing by establishing whether you are addressing data velocity requirements to address critical conditions, data volume issues (e.g., fusing data infrequently to a large integrated data store), or a third category where the obstacle is consolidating a large variety of data sources , types, and delivery systems into a seamless but loosely- coupled environment. A well conceived loosely-coupled architecture accommodates combinations of the three data profiles (i.e., hi-volume, hi-velocity, and hi-variety). In the first case, we will explore a big data high volume scenario.

Identifying a High Volume Data Scenario

Figure 1 Banking Use Case - Hi Volume Data Scenario

May 2012 Page 4 2013 Blue Canopy Group, LLC CN Savage

Page 5: Blue Canopy Semantic Web Approach v25 brief

(Hi-Volume Data) As depicted in the Figure 1 drawing at a high level in this Banking use case the banking analyst runs analytics on very large amounts of data from possibly disparate data sources, identifying “gaps” and “overlaps” in composite view of the subject that is not time sensitive;Forensics can be performed in this case to search for the pre-defined conditions and anomalies based on the ontology. To reiterate the objective, the emphasis is on addressing high volume not data velocity, in which the disparate data must be ingested, collected, normalized, and analyzed.

The Situation

The banking system runs thousands of transactions daily. Somehow an internal criminal has inserted an algorithm depositing “half pennies” into a “shadow" account that collects funds from all banking accounts regionally. These miniscule series of thefts may not be detected by the analyst. The funds add up rapidly to accruing millions of dollars. Without the aid of a forward thinking expert system, the analyst may take months to uncover the pattern, if at all.

The Approach

The Blue Fusion approach rolls up all of the banking information into a nightly backup containing all transactions, and emails, for the day to be analyzed later based on the ontology definitions and business rules established at the business ruler loaded into memory, created to look for correlations and patterns related to fraud generating an alert to the bank analyst. In this case, the system is alerted to seek out a pattern identifying a large series of very unusually small transactions that correlate with very large withdrawals or withdrawals committed just below the $10,000 threshold.

1. The Blue Fusion Rules server determines where the information will be stored and/or queried once the data has been ingested onto HDFS.

2. Once the data is map reduced, a non-RDF triple is stored in HBASE cache using the timestamp as the unique identifier; Why? HBASE does not currently recognize data graphs.

3. After each ingest of new data, all data is pushed to the RDF Database for storage and possible display. The notification system looks for the pre-defined alerts or conditions defined in the ontologies.

4. The triples are pushed to the RDF Database for immediate storage or SPARQL queries, and the result set from the query is displayed to the user's dashboard.

May 2012 Page 5 2013 Blue Canopy Group, LLC CN Savage

Page 6: Blue Canopy Semantic Web Approach v25 brief

5. Based on user’s rights and settings, the analyst may see the user account number, number of associated transactions, and the total amount stolen. Whereas, the investigator may have access to the culprits name and other vital detailed information.

Identifying a High Velocity Data Scenario

In the second case, we will explore a high velocity data scenario.

Figure 2 Bioterrorism Hi-Velocity Data Scenario

(Hi-Velocity) In this scenario the analyst must focus on mining near real-time and streaming data to resolve critical conditions. As depicted in the Figure 2 drawing at a high level in this Bioterrorism use case the law enforcement analyst runs analytics on very large amounts of data from possibly disparate data sources, identifying “gaps” and “overlaps” in composite view of the subject but this time the information returned is time sensitive; Forensics can be performed in this case to search for the pre-defined conditions and anomalies based on the ontology. The emphasis is on addressing data velocity, in which the disparate data must be ingested, collected, map reduced, saved in triple stores to identify a state based monitoring the network, sensors in the atmosphere, weather indicators in addition to the structured data contained in the transactional database, the business intelligence, and the reporting system.

May 2012 Page 6 2013 Blue Canopy Group, LLC CN Savage

Page 7: Blue Canopy Semantic Web Approach v25 brief

The Situation

Bioterrorist infect members of an operational cell with smallpox. Their plans are to fly in planes to football games held in domed facilities where 100,000 sports fans are in attendance during the late summer while the weather is still very hot. Authorities will have seven to seventeen days to contain the emergency. Several civilians become infected prior to the planned attack and visit area hospitals showing symptoms of high fever and rash. After several days, the hospitals conduct blood test, the results return positive for smallpox. A powerful sensor that can detect airborne pathogens generates data indicating higher levels of variola virus in a subway tunnel area where the terrorist traveled from the airport.

The Operation Approach

The local hospitals continue to treat the patient based on a limited amount of information. Patients have a high fever which is a common symptom for many diseases. Meanwhile, the CDC analyst monitors the systems searching for any critical conditions related to pandemics. As the disease severely progresses the rash and lesions appear on the patients. Blood tests are conducted on the severely ill patients. Results return positive for variola indicating smallpox. At this stage,the race against the clock begins as the medical and criminal organizations are notified.

The Blue Fusion Approach

1. In the background the Blue Fusion ontology rules created to look for correlations and patterns is processed by the Complex Event server .

2. The rules server queries the data for reports of large sums of capital distributed to a suspicious non-profit, criminal activity on 911 calls, web activity, symptoms recorded for other patients and hospitals;

3. Initially this information was stored in the HBASE cache.

4. Within days, the Blue Fusion approach receives information from the pathogen bio sensor system retained in the triple stores in the RDF database and simultaneously the CDC receives single blood test returned positive for a patient;

5. The CDC and Health Analyst alert the authorities. The expert system correlates the new data and retrieves the information to be processed in RDF database to conduct additional queries for law enforcement personnel to track down the operational cell.

6. The community of interest mobilizes to immunize the population and capture the criminals.

May 2012 Page 7 2013 Blue Canopy Group, LLC CN Savage

Page 8: Blue Canopy Semantic Web Approach v25 brief

Identifying a High Variety Data Scenario

In the final case, we will explore the hybrid where the data fusion requirements require the developer to manage a high variety of data sources, types, and delivery requirements.

Figure 3 Hi-Variety Data Scenario

(Hi-Variety) In this financial hacker scenario the analyst must focus on defining relationships and uncovering links between data that identifies, segments, and stratifies subject populations based on a wide variety of variables, types, and structures; Semantically integrating data to create a holistic 360-degree view of the subject across a defined continuum. As depicted in the Figure 3 drawing at a high level in this Financial use case the financial analyst and law enforcement analyst work together to runs analytics on very large amounts of data from possibly disparate data sources, identifying “gaps” and “overlaps” in composite view of the subject where the mobile applications exist to initiate trades, video must be merged with transactional data and emails and flat files; Forensics can be performed in this case to search for the pre-defined conditions and anomalies based on the ontology; However, the system is in a constant flux building on the original ontologies defined in the repository to ensure the system is adapting and learning from the new relationships; the emphasis is on addressing flexibility and complex data integration, in which the disparate data must be ingested, collected, map reduced, saved in a very high number of triple stores to identify the relationship at the data graph level.

May 2012 Page 8 2013 Blue Canopy Group, LLC CN Savage

Page 9: Blue Canopy Semantic Web Approach v25 brief

The Situation

An international Hackers' organization has been hired to create havoc on the stock market by contaminating and limiting access to major banking, financial and corporate sites in addition to many high profile social media sites; the hiring criminal firm has plans on exploiting the stock market by placing "short sales" on the stock of the affected firms. As the global market plummets and panic ensues, the firm makes billions of dollars based on the negative events routing the gains to private accounts. One hacker with similar features to a high level deceased executive has assumed his identity with all required ID cards and access codes; The individual liquidates millions and disappears in the night; All parties are unaware of the Executive's whereabouts and demise. The body of the missing Executive is discovered the next day. In the meantime, the hackers have disabled blogs, the email systems, and penetrated the network accessing personal credit card and banking information for millions. The identities are sold on the black market.

The Blue Fusion Approach

The Blue Fusion approach gathers BIG DATA from multiple sources. First ingesting the data exchanged between several large organizations. Based on established data exchange agreements between the FBI, the financial institutions affected, ingesting the structured data into HDFS and the HBASE cache for integration with related emails, security and network audit log file for future data mining purposes to be processed after highest level processing events have been completed. The rules server map reduces the higher priority real time data produced (i.e., network intruder alert detections, phone activity) and stores the triples in the RDF database based on the ontologies that define the relationships between stock price changes and formulas that identify anomalies in selling and buying patterns on the market. The expert system identifies the companies and individuals that capitalize on the major losses. The location and description of related videos from the financial office of the missing executive indicates strange activity in his office after a Blue Fusion alert is generated indicating his logout time did not coincide with his departure from the office. The link to the video is accessed on a query through the data graph stored in the RDF database identifying imposter in the FBI Hacker Database. Initial information stored in the HBASE cache is mined and moves the relevant triples to the RDF database. The expert system correlates the new data and retrieves the information to be processed in RDF database to conduct additional queries for law enforcement personnel. The communities of interest mobilize to trace the money trail and correlate the Hacker methods of operation from the Hacker Database and locate their current whereabouts.

The Distinct Roles of HADOOP Integration versus the Semantic web

Technically, what role does HADOOP framework and the Semantic web provide?

May 2012 Page 9 2013 Blue Canopy Group, LLC CN Savage

Page 10: Blue Canopy Semantic Web Approach v25 brief

In this case the HADOOP components fulfill several functions, the primary role is storage, distributed processing, and fault tolerance to ingest the data and function as a data cache using HBASE. HBASE stores data in a scalable and large distributed database that runs on HADOOP. From a technological perspective, HBASE is designed better for fast reading purposes not writing; Thus, it's better to run queries on "time sensitive" data in the Resource Description Framework (RDF) Database; Why? Web semantics graph data of RDF form triple stores (i.e., subject-predicate-object triplet) requires optimization on the data with indices which don't exist in the HBASE database structure thus significant coding required in Java and use of JENA and REST servers to poll for data changes and maintain notifications to the user's dashboard ; however, when time is not critical, HDFS and HBASE is best used when it has yet to be determined what data mining must occur based on the data source; Based on time sensitivity a rules server must determine if the information should:

1. Remain in HDFS and stored in a RDF database for immediate processing 2. Batched at a later timeframe and stored in the HBASE for possible future

mining.

Once the data is stored in the RDF database in the graph data format in triples the analyst or the expert system can utilize SPARQL to execute queries in the data. It is recommended to leverage RDF database that have the capacity to store in the range over a billion to 50 billion triples to ensure you capture the various permutations of relationships is well defined for the domain while maintaining speed on writing and querying the database.

What do the Ontology approach and the Semantic Web provide?

First and foremost the ontological approach allows for the designers and developers the flexibility and adaptability to dynamically define the rules and relationships between the structured data, semi-structured and unstructured data. BlueFusion uses the XML structure of incoming data and reference ontology to automatically derive a conceptual graph representing the semantics of the data. RDF schema contains the metadata (i.e., the description of the data). For each data source, most importantly a conceptual graph is created according to the framework that can be saved in RDF database to ensure triples can be processed and queried using SPARQL (Resource Description Framework Query Language). This data graph represents semantic integration of the original data sources . Integration ontologies can be extended to repurpose or share data and to support different tasks across disparate applications and eventually across the boundaries if it is an internet-based application.

The automated agent evaluates incoming XML messages and compares information to the integrated ontology using established reference ontology for lookup purposes. The reference ontology is implemented in OWL (Ontology Web Language) and contains a model of XML document structure, and rules and axioms to interpret a generic XML document. With regards to integration, each data instance is stored as an instance of a

May 2012 Page 10 2013 Blue Canopy Group, LLC CN Savage

Page 11: Blue Canopy Semantic Web Approach v25 brief

concept, each of which has relations to other concepts that represent metadata such as time stamps and source data.

The Blue Fusion Architecture

Figure 4 The BlueFusion Technical Approach

The BlueFusion approach provides the analyst, administrator, developer and most importantly the end user the flexibility to expand the robustness of the enterprise by providing the methodology and the tools required to leverage the strengths of HADOOP and the semantic web.

As depicted in the previous drawing, Figure 4, an administrator has the ability to build a configurable loosely coupled architecture based on the use case and the tools available tools available that are the "best fit" for the requirements.

May 2012 Page 11 2013 Blue Canopy Group, LLC CN Savage

The Blue Fusion Architecture

Page 12: Blue Canopy Semantic Web Approach v25 brief

Tools of the Trade

The Administrator can perform the following task thru the Bluefusion Admin Tool:

Configure Storage Requirements1. Determine which data source will exist for the life of the enterprise and

remain as the system of record or purged at certain intervals :

A. Original XML documents created prior to ingestion into HDFS B. The Cache which contain the tuples in non-RDF Format; this

information cannot utilize SPARQL to conduct queries; Java custom coding is required.

C. The RDF Database Data Incubator which contains the triplestores that may contain information for mining purposes where conditionshave not been identified by the analyst or expert system but is ready to be

analyzed quickly or queried; Based on the configuration, the data incubator may be a replica of the HBASE cache but in RDF format where data graphs are recognized and SPARQL can be executed.

2. Based on the business rules and ontologies, configure which triples will be stored in the High Priority Database (Red - Alert Condition Database) for immediate processing and which triples will be stored in Medium Priority Database (Yellow - Discovery Condition Database) certain partial criteria has been identified for a defined event. For example in the case of bioterrorism early symptoms for some pandemic diseases are the same as many common diseases; The victim has a rash and a high fever. The analyst must have a way to keep monitoring that activity until the condition is correlated or eliminated as opportunity or threat. Once the condition has been defined and processed through the dashboard the event will be promoted to the Alert Condition Database to identify it earlier if it happens as a future event.

Configure Access Control Levels granting Create, Read, Update, and Delete (CRUD) rights to the :

1. User Dashboard Features to determine what is presented to the end user based on user level and use case.

2. Visual & Command line SPARQL applets to ensure the analyst has access to query and perform analysis on the use case.

3. HADOOP Cache to determine duration for storing the data the ACL to the cache and related logs.

May 2012 Page 12 2013 Blue Canopy Group, LLC CN Savage

Page 13: Blue Canopy Semantic Web Approach v25 brief

4. Semantic Materialized Views to ensure appropriate user level information is reported based on use case.

5. HADOOP Performance tuning features to manage the workload balance between the various data stores and distribution management.

6. Semantic Situational Awareness Data Profile performance variables or template based on the use case determine temporal settings, subject settings, as relates to data velocity,volume and variety. Workload balance tuning to spawn more expert agents to mine data.

The Bluefusion approach is to establish three separate RDF Database instances using HBASE as a cache.

This White Paper is for informational purposes only. BLUE CANOPY MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS WHITE PAPER. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Blue Canopy disclaims proprietary interest in the marks and names of others.

©Copyright 2011 Blue Canopy Group, LLC. All rights reserved. Reproduction in any manner whatsoever without the express written permission of Blue Canopy Group, LLC is strictly forbidden. Information in this document is subject to change without notice.

May 2012 Page 13 2013 Blue Canopy Group, LLC CN Savage