MarkLogic 8: Semantics · 2018. 9. 29. · Samplestack and Reference Architecture . Kasey Alderete...

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

MarkLogic 8: Semantics Stephen Buxton, John Snelson, Aries Li November 2014

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 2

MarkLogic 8 Feature Presentations Topics Product Manager

Developer Experience: Samplestack and Reference Architecture Kasey Alderete

Developer Experience: Node.js and Java Client APIs, Server-side JavaScript, and Native JSON

Justin Makeig

REST Management API, Flexible Replication, Sizing, and Reference Hardware Architectures

Caio Milani

Bitemporal Jim Clark

Semantics Stephen Buxton


Agenda Semantics Overview – 1-3-10

– MarkLogic Semantics in 1 slide

– How MarkLogic Semantics supports the "pillars" – 3 slides

– 10 slides to explain MarkLogic Semantics

SPARQL 1.1 support

SPARQL Update

Inference

Performance

Q&A


Semantics Enterprise triple store, document store, database combined

Store and query billions of facts and relationships; infer new facts

Facts and relationships provide context for better search

Flexible data modeling—integrate and link data from different sources

Standards-based for ease of use and integration – RDF, SPARQL, and standard REST interfaces

Even better with Built-in Search and Bitemporal – Triples, documents, and data combined

Presenter

Presentation Notes

Short Description:�Store RDF triples and query them using SPARQL—providing meaning and context to your data using the only database that can handle a combination of documents, data, and triples. Long Description:�Semantics provides a universal framework to describe and link different data so that it can be better understood and searched holistically, allowing both people and computers to see and discover relationships in the data. MarkLogic provides the capability to store and query linked data, including a native RDF Triple Store for storing and managing hundreds of billions of triples that can be queried with SPARQL—all right inside MarkLogic. Not only that, but MarkLogic combines the triple store with its document store, providing the capability to store and manage documents, data, and triples together so you can discover, understand, and make decisions in context. MarkLogic 8 extends the use of standard SPARQL so you can do analytics (aggregates) over triples; explore semantics graphs using property paths; and update semantic triples; all using the standard SPARQL 1.1 language over standard protocols. In addition, MarkLogic 8 lets you discover new facts and relationships with automatic inference. Script for Presenting: Enterprise triple store, document store, database …combined MarkLogic Semantics adds the capabilities of an Enterprise Triple Store to its document store and database. Store and query billions of facts and relationships; infer new facts The triple store lets you store and query billions of facts (assertions) and relationships. Facts/relationships are represented as triples, made up of subject, predicate, and object For example, we can represent the facts "John lives in London" and "London is in England" as triples like this: Subject Predicate Object John livesIn London London isIn England We can also infer new facts. From what we (as humans) know about "livesIn" and "isIn", we can infer that John lives in England. The triple store can do that too – you can specify rules that say exactly what a predicate means, and the triple store will infer new facts when querying. Many of these rules are specified in the RDFS and OWL specifications, and can be applied in MarkLogic queries out of the box. Facts and relationships provide context for better search (see also slide 2): Imagine how much better a search application can be if the app has access to billions of facts and relationships. The app can leverage those facts in several ways (see future slide): Find more relevant information by expanding the terms the user typed in Present more/better information about whatever the user is searching for Publish information dynamically to web or print or mobile Flexible data modeling - integrate and link data from different sources (see also slide 3 and 4): Triples are atomic and schemaless – so they are easy to share, easy to combine. When you model data as triples, it's easy to load the data as-is, and query across all your data. You can also link data from different sources by creating new triples. For example, if you have information about the same customer from two sources, and one source calls the customer "cust123" while the other calls the same customer "cus_id_456", Simply add a triple cust123 sameAs cus_id_456 and you can query across all the information about that customer in a single simple query. As well as creating and extracting your own triples, there are billions of triples available on the Open Linked Data web. For example, you can download sections of dbpedia (the triples version of wikipedia) Einstein was born in Germany Buzz Aldrin was on the crew of Apollo 11 A labrador is a type of dog Or you can download facts from Geonames: London is in England London has a population of 7,504,800 London is at lat/long position 51.5/-0.16667 Or you can go to data.gov to get facts about food from the Dept of Agriculture (http://data-gov.tw.rpi.edu/wiki/Dataset_1294) Pineapple juice has 140 calories per serving See http://www.w3.org/wiki/DataSetRDFDumps for a partial listing of RDF data available for download and ingestion into MarkLogic. See http://data-gov.tw.rpi.edu/wiki/Data.gov_Catalog_-_Complete for a listing of Open Government RDF datasets. Standards-based for ease of use and integration MarkLogic Semantics is based on W3C standards. RDF describes the data model for facts and relationships (http://www.w3.org/RDF/). MarkLogic can load RDF files in all the popular RDF formats – RDF/XML, Turtle, RDF/JSON, N3, N-Triples, N-Quads, and TriG (http://docs.marklogic.com/guide/semantics/loading#id_70682) SPARQL is the W3C standard language for querying RDF. MarkLogic supports SPARQL 1.1, which includes paths, aggregates, and inserts/deletes. (http://www.w3.org/TR/sparql11-query/ and http://www.w3.org/TR/sparql11-update/) MarkLogic also supports standard interfaces. http://www.w3.org/TR/sparql11-protocol/ defines a SPARQL endpoint, which is a standard REST endpoint for SPARQL queries. http://www.w3.org/TR/sparql11-http-rdf-update/ defines the Graph Store HTTP Protocol, which is a standard REST endpoint for managing RDF graphs. Even better with search, bitemporal The real power of MarkLogic comes not from a single feature, but in the ability to combine features in a single, powerful query. Semantics isn't a product, it's a feature of a product. MarkLogic Semantics works particularly well with search (including GeoSpatial search) and bitemporal. Search (see also slide 6 and 7): In MarkLogic, you can embed triples in XML or JSON documents and run combination queries. You can combine SPARQL and cts:query in two ways: run a SPARQL query that is filtered by a cts:query condition; or embed a cts:triple-range-query (which returns a cts:query) in a cts:search. For example, you might want to ask "show me all the people who met with John". If you have triples of the form "john metWith X", that's a simple SPARQL query. But if those triples are embedded in the documents where that fact was asserted or discovered – say, a police report or e-mail exchange – you can ask much richer questions such as "show me all the people who met with John, where the fact was discovered in the last 6 months and the source is a police report from a county in the eastern US and that report also mentions some kind of weapon and some kind of controlled substance". Or you might want to ask "how many emails and tweets in my sample are generally positive?" If you have triples of the form "message1002 hasSentiment +9", that's a simple SPARQL query. But if those triples are embedded in the messages, you can ask much richer questions such as "show me snippets of all the messages that were overwhelmingly positive, and were sent by someone who is an executive of a Fortune 500 company, between these dates, and which mention the companies ‘IBM’ and ‘Oracle’, and mention a word that has something to do with takeovers or acquisitions". Bitemporal: Bitemporal Data Management handles historical data along two different timelines, making it possible to rewind the information “as it actually was” in combination with “as it was recorded” at some point in time. It facilitates the creation of complete audit trail of data. Since you can compose SPARQL and cts:query, you can do a bitemporal SPARQL query! Simply run the SPARQL query with a cts:query constraint over one or both bitemporal axes.


Better Answers From Today’s Data Find more relevant information using facts as context

– Example: Search for "cardiac catheter"; show documents about "devices that stimulate nerves" and "implantable devices"

Present more, better information for more productive users

– Example: Search for "Ireland"; show facts about Ireland with search results

Publish information dynamically to web or print or mobile

– Example: BBC Sports page about a team, event, sport, person

– Example: Wiley Custom Select for customized learning materials

Presenter

Presentation Notes

Imagine how much better a search application can be if the app has access to billions of facts and relationships. The app can leverage those facts in several ways (see future slide): Find more relevant information by expanding the terms the user typed in For example, at BSI (British Standards Institute) users want to find all standards that apply to a product they are building. With MarkLogic, they can type in "cardiac catheter" and find standards that apply to "devices that stimulate nerves" and "implantable devices" (since a cardiac catheter stimulates the nerves of the heart, and it's implantable), even though those standards don’t specifically mention the phrase "cardiac catheter".�It's like having a domain expert looking over your shoulder and guiding your search!�Under the covers we're expanding the search terms the user is typing in, mostly using ontologies and data sets that are freely available.�This capability turns transforms a standards search from a very long, expert-driven process to a short process which is bullet-proof. It's all about getting the search results the user actually wants to see, quickly and easily.� Present more/better information about whatever the user is searching for When a user types in "Ireland", his intent is not to "find links to all the documents that contain the word 'Ireland' ". Rather, he wants to understand, discover, make decisions about "Ireland". Since the app has access to billions of facts, it shows selected facts about "Ireland" in an Infopanel alongside the search results.�This I something Google is doing more and more with its "Google Knowledge Graph".�Now you can do that too, with your own application. You can also decorate the search snippets with facts about entities in the snippet.�For example, if the search snippet mentions an author, pop-up an infobox about that author. It's all about getting as much relevant information in front of the user as possible, not just links to documents.� Publish information dynamically to web or print or mobile Another way to "re-think the search page" is to dynamically create a mash-up of relevant content for each web page.�The BBC pioneered Dynamic Semantic Publishing for users to see and navigate relevant content on the BBC Sports pages.�Instead of creating a static page for each team, event, sport, person that a user might be interested in, the URL becomes the search – so a visit to http://www.bbc.com/sport/football/teams/west-ham-united pulls together match reports, pictures, videos, news items, league tables related to West Ham United.�The app knows which items are related to West Ham by querying a sports ontology – so it knows for example that West Ham is a soccer team in England, based in London; they play in the Premier League; and Diafra Sakho is a current player; so the page includes a news story about Sakho, the Premier League table, and links to other soccer-related stories.�This gives users an information-rich experience; but it's easy to maintain, up-to-the-minute, and error-free.�The BBC had spectacular successes with this site reporting on the 1010 World Cup and 2012 London Olympics.�For more on Dynamic Semantic Publishing at the BBC, see �http://www.bbc.co.uk/blogs/legacy/bbcinternet/2010/07/bbc_world_cup_2010_dynamic_sem.html�http://www.bbc.co.uk/blogs/legacy/bbcinternet/2012/04/sports_dynamic_semantic.html�See also the BBC Sports Ontology at http://www.bbc.co.uk/ontologies/sport Wiley Custom Select, Houghton Mifflin Harcourt, and others use MarkLogic to customize learning materials.�Semantics could improve ease of use and accuracy in the custom publishing space.�See:�http://www.marklogic.com/resources/custom-publishing-in-education-harnessing-technology-to-maximize-results/�http://www.marklogic.com/press-releases/announcing-wiley-custom-select-next-generation-custom-publishing-application-powered-by-marklogic/�http://customselect.wiley.com/ Further discussion points: Combination queries In MarkLogic, you can embed triples in XML or JSON documents and run combination queries. You can combine SPARQL and cts:query in two ways: run a SPARQL query that is filtered by a cts:query condition; or embed a cts:triple-range-query (which returns a cts:query) in a cts:search. For example, you might want to ask "show me all the people who met with John". If you have triples of the form "john metWith X", that's a simple SPARQL query. But if those triples are embedded in the documents where that fact was asserted or discovered – say, a police report or e-mail exchange – you can ask much richer questions such as "show me all the people who met with John, where the fact was discovered in the last 6 months and the source is a police report from a county in the eastern US and that report also mentions some kind of weapon and some kind of controlled substance". Or you might want to ask "how many emails and tweets in my sample are generally positive?" If you have triples of the form "message1002 hasSentiment +9", that's a simple SPARQL query. But if those triples are embedded in the messages, you can ask much richer questions such as "show me snippets of all the messages that were overwhelmingly positive, and were sent by someone who is an executive of a Fortune 500 company, between these dates, and which mention the companies ‘IBM’ and ‘Oracle’, and mention a word that has something to do with takeovers or acquisitions".


More Information, More Productivity

Presenter

Presentation Notes

This slide show dynamically produced content boxes all using RDF data published on other sites like DBPedia. The example on the left is Google, using RDFa, and the example on the right is a demo application built by MarkLogic that pulls RDF data about companies into the app from DBPedia.


Intelligent Data Layer Discover connections between entities

– Example: Show me papers cited by Steven Pinker, and papers cited by…

Walk hierarchies

– Example: Show me who directly owns Acme, and who ultimately owns them

Infer new facts for simple data modeling, powerful queries

– Example: If Acme owns Amertek, then Amertek is owned by Acme

– Example: If prod001 is a blue henley, it's also a blue shirt

Facts may be embedded in documents to keep context

Presenter

Presentation Notes

Discover connections between entities A lot of people think of RDF as a graph data model – since the object of one triple can be the subject of another, it's easy to represent graphs of facts and relationships. So for example if you have triples that represent authors and papers, and triples that represent which papers cite which other papers, it's easy to ask questions like "which papers are cited by the papers that Steven Pinker wrote?" Then it’s easy to extend that to ".. and which papers did they cite? And which papers did they cite?" And so on. In this way, you can build up a view of who influenced whom; or who knows whom; or which business entity owns which business entity. And you can discover connections that might not otherwise be found – Joe works for Acme; Acme is putting together a deal for Amertek to acquire Alchemax; Mary bought a large number of shares in Alchemax; Mary is married to Joe. Possible insider trading? You can find connections to some specified depth – find me everyone associated with Kevin Bacon, but only look 3 steps. Or you can take an unspecified number of steps – show me everyone who has had contact with this Ebola patient, and everyone who has had contact with them, and so on for as far as you can go. This is very difficult to do with the relational data model. Walk hierarchies A hierarchy is a special case of a graph. If you have triples that tell you which business entities own which business entities, it's easy to walk the hierarchy and find all business entities in the ownership chain. And you can say "find me the ultimate owner" even if you don't know how many steps away that is. Note this applies to terms (such as search terms) as well as to entities. So a thesaurus represented as triples is easy to manage, extremely flexible, and easy to combine with existing thesauri and ontologies; and it's easy to find terms at the top or bottom of the hierarchy. So for example if you have triples that tell you a dog is a canine is a mammal, it's easy to ask questions like "what is the root classification for a dog?" (a mammal) and "what are the entities that are classified as mammals?" (dogs, cats, and so on) without knowing how many steps there are, and without enumerating every intermediate relationship. See http://www.w3.org/2004/02/skos/ Infer new facts for simple data modeling, powerful queries With inference your data is intelligent – you can ask questions, and get answers that depend on facts that aren't explicitly in the database. So for example, if you have a fact "Acme owns Amertek", you can answer the question "who is Amertek owned by?" simply by defining a rule "if A owns B then B is owned by A". Without inferencing you'd have to add another fact "Amertek is owned by Acme" if you wanted an answer to that question. Similarly, if you have a fact "prod001 is a blue henley" and a fact that "a henley is a kind of shirt", then with inferencing you can say "show me all blue shirts" and get the answer "prod001". Facts may be embedded in documents to keep context Facts/relationships may be "free-floating" (London is the capital of England) or they may be embedded in documents (John met with Mary … and we know that from an FBI report in Dublin on July 11 2011 that also mentioned 3 kinds of controlled substances and a known cartel leader).


Enforced Linear Journey Contextual Journey

More Intelligent Web Experiences

Presenter

Presentation Notes

This slide shows some of the concepts that were mentioned on the last few slides, showing how a website built with linked data provides a more dynamic experience. A graph-based experience doesn’t have a linear direction in the same way a hierarchical architecture imposes. Users bring the x-axis. They navigate the graph in an order of their choice – establishing their own personalised temporal/sequential element to the experience. There is still some order. Nodes are related to each other through modeled relationships. But these relationships can run across multiple axes, not just using the most widely understood categorisation. Through linked data, the categories become less important – it’s the content that counts. Reference: http://www.bbc.co.uk/academy/technology/software-engineering/semantic-web/article/art20130720153136618


Simpler Data Integration Flexible data modeling through triples

– Triples are atomic and schema-less

– Triples are easy to share, easy to combine, and readily available

Integrate data by adding links

– Load triples as-is, then add triples to link entities or documents

– No need to change the underlying data

– Example: cust123 (source1) is the sameAs cus_id_456 (source2)

– Example: cust123 hasOrderDoc /orders/ab42.json

Presenter

Presentation Notes

Flexible data modeling through triples Triples are atomic – they are the ultimate in schemaless data modeling. That means you can take a million triples from dbpedia, and a million from the CIA World Factbook, and a million that you derived from your own data, and put them in the same triple store. No need to map them all to some table schema, or even a document schema. Just throw them all in one bucket and query across them. (For relational thinkers – think of a triple as a single cell in a table, where the subject represents the table + the primary key associated with some row; the predicate is the column; and the object is the cell value). RDF's atomic data model makes it easy to combine triples from different sources. And since it's so easy to combine triples, it's easy to share them. And so triples are readily available. Integrate data by adding links Once you've loaded triples from different sources, how do you query across them?�Let's suppose you're writing a Customer 360 app for a telcom company.�You have facts about some customer from different parts of the company – from accounts, from support, from your website, and so on.�You may also have facts about the same customer that came from different companies – perhaps you just acquired a cable TV company. All those different sources talk about the customer in a different way – for example, accounts refers to them as cust123 while support refers to them as cus_id_456. Simply add one new fact – "cust123 is the sameAs cus_id_456" (sameAs is a standard predicate in the OWL vocabulary – see http://www.w3.org/TR/owl-ref/#sameAs-def). Now you can write queries that treat cust123 and cus_id_456 as the same thing, so that any question about cust123 will return answers about cus_id_456 and vice versa. [Note: to make this happen automatically, use inference with a sameAs rule; otherwise, expand your SPARQL queries to include sameAs resources] So you can load triples as-is, then add linking triples to integrate them. With MarkLogic, you can also link resources to documents. If you already have documents that relate to cust123, simply add a triple that links the customer to that document. For example: cust123 hasOrderDoc /orders/ab42.json [Note that if you embed a triple into a document, theres an implicit link – the triple index includes the CodID, so it knows where that triple came from.] You can also do data reconciliation with triples. Just as you can equate cust123 and cus_id_456, you can equate different spellings such as "John Smith" and "Jon Smith". This is simpler and faster than mapping all your sources to a single schema. You can load triples as-is and get value from querying them right away, even as you figure out which linking triples to add.


Disconnected Data Semantics as the Glue

Simpler Data Integration

Accounts Support Acquired Company

cust123

cust_id_456

order_ab12

cust123

Accounts Support Acquired Company

Presenter

Presentation Notes

This slide shows some of the concepts that were mentioned on the last few slides, showing how a website built with linked data provides a more dynamic experience. A graph-based experience doesn’t have a linear direction in the same way a hierarchical architecture imposes. Users bring the x-axis. They navigate the graph in an order of their choice – establishing their own personalised temporal/sequential element to the experience. There is still some order. Nodes are related to each other through modeled relationships. But these relationships can run across multiple axes, not just using the most widely understood categorisation. Through linked data, the categories become less important – it’s the content that counts. Reference: http://www.bbc.co.uk/academy/technology/software-engineering/semantic-web/article/art20130720153136618


Example of an App Using Semantics

How does “Euro zone” relate to “European Union”, “Europe OECD”, or “Europe”? How does a term such as “Small States,” relate to “Least Developed Countries,” “Lower Middle Income,” or “Low & Middle Income.”

Presenter

Presentation Notes

Applied Relevance created an application on MarkLogic called Epinomy, a time series search engine that combines the best full-text search engine and business analytics for time series data. Time series data is the accumulation of measurements taken at successive points in time spaced at uniform time intervals, and is the most common form of structured data. The challenge Epinomy has addressed is figuring how to combine time series data with other unstructured and constantly changing data such as global economic indicator data. For example, the World Bank publishes data for poverty, inflation, and GDP in a format called SKOS SDMX Data Cube format, a triples format for tracking economic indicators and doing statistical analysis. But, there is lots of other economic data that is not already formatted for easy analysis. With relational databases, this challenge is difficult and even impossible to solve but with MarkLogic Semantics, new data can be incorporated in days, not months. Consider the difficulty in trying to search across various data sources for a common term such as “Euro zone.” It means something different from “European Union”, “Europe OECD”, or “Europe.” Or what about a term such as “Small States,” which is different from “Least Developed Countries,” “Lower Middle Income,” or “Low & Middle Income.” Semantics provides the ability to map all of these terms so that a user can perform natural language searches. Semantics also allows the application to quickly create facets without pre-defining what they should be. Facets, or the categories of results typically grouped down a left-hand column on a webpage, are created in Epinomy completely using triples. It happens dynamically on the fly, is dependent on the content loaded, and is presented fast to the user. Another challenge is when the same economic data is released multiple times. These multiple “vintages” of the same data would typically be a headache to deal with. Semantics handles the various vintages of data by simply creating new sets of triples tagged as “vintage.” And, the natural language search was also designed so that a search can specifically return those vintage values.


What Is “Linked Data”?

<http://example.org/dir/js> <http://xmlns.com/foaf/0.1/firstname> "John". <http://example.org/dir/js> <http://xmlns.com/foaf/0.1/lastname> "Smith".

Example of RDF

SELECT ?person ?place WHERE { ?person <http://example.org/LivesIn> ?place . ?place <http://example.org/IsIn> "England" . }

Example of SPARQL

Presenter

Presentation Notes

“Semantic Web Technology” versus “Semantic Technology”? MarkLogic Semantics falls in the category of “Semantic Web Technologies”, which is slightly different from what experts refer to at “Semantic Technologies.” It gets confusing, however, because MarkLogic does work with partners that provide “Semantic Technologies.” For example, MarkLogic works with Smartlogic to form a complete semantics stack, with MarkLogic storing and managing the triples data, and Smartlogic providing entity enrichment and ontology management. That said, we like to be accurate with the “semantics” of semantics. Semantic Web Technologies Semantic Web Technologies refer to a family of specific W3C standards that allow an exchange of related data – whether it resides on the Web or within organizations. It requires a flexible data model (RDF), query tool (SPARQL), and a common markup language (e.g. RDFa, Turtle, N-Triples). RDF allows you to deconstruct pieces of knowledge called triples, which are linked together in a graph-like representation that is without hierarchy. MarkLogic allows you to natively store and manage RDF triples and query them using SPARQL. Semantic Technologies Semantic Technologies are a variety of linguistic tools and techniques such as Natural Language Processing (NLP) and artificial intelligence to analyze unstructured text to classify and relate it. By identifying the parts of speech (a subject from a predicate, etc.), powerful algorithms can pinpoint entities (people, places, things, time, etc.), concepts, and categories. Once analyzed, text can be further enriched with vocabularies, dictionaries, taxonomies, and ontologies (so regardless which representation is used, assets can be found, eg. Coca-Cola, Coke, KO).


MarkLogic Semantics Architecture

Storage Layer

Scalability and Elasticity

ACID Transactions

Bitemporal Triple Store

Interface Layer

mlcp JSON, XML, RDF, Geo, Binaries

HTTP Protocol REST API

Query Layer

JS XQuery SPARQL

JavaScript XQuery SPARQL SQL

Indexes Universal Index

Geospatial Index

Triple Index

Triple Cache

Presenter

Presentation Notes

How have we extended our Enterprise NoSQL approach with Semantics? At the STORAGE LAYER MarkLogic scales horizontally, and has all the Enterprise capabilities you need from an Enterprise database including replication, failover, backup and recovery, ACID transactions, and government-grade security. We added the triple store at the storage layer, so now you have a triple store that scales horizontally, and has all the Enterprise capabilities you need from an Enterprise database including replication, failover, backup and recovery, ACID transactions, and government-grade security. . At the next layer are the INDEXES. Some of those indexes are for full text search, range indexes for range queries over scalar values, geospatial indexes and reverse query indexes for alerting. At that layer we added a TRIPLES INDEX and a TRIPLES CACHE to store triples in memory for efficient retrieval. All MarkLogic indexes are designed to work together, so you can do queries with any combination of indexes. The triples cache means you don't need to have all of your triple index in memory, unlike some triple stores. So the size of your triple store is not constrained by physical memory limits. At the QUERY LAYER, MarkLogic has native SPARQL support. You can query using SPARQL only or you can run SPARQL as part of a JavaScript or XQuery server-side program. At the INTERFACE LAYER you can run SPARQL over REST. You can use MarkLogic's comand-line tool for fast bulk-loading, MarkLogic Content Pump (mlcp), to load lots of triples very fast, using parallelization and fast load techniques. You can also use standard REST endpoints to query (SPARQL endpoint) and mange GRAPHS (Graph http Protocol). (http://www.w3.org/TR/sparql11-protocol/ defines a SPARQL endpoint, which is a standard REST endpoint for SPARQL queries.) (http://www.w3.org/TR/sparql11-http-rdf-update/ defines the Graph Store HTTP Protocol, which is a standard REST endpoint for managing RDF graphs.) With MarkLogic you can query across documents, facts, and metadata, and present results "in context", over REST or from a server-side program written in JavaScript or XQuery. All with the Enterprise robustness you need to run mission-critical applications.


A call comes into your call center:

– “Some maniac in a blue van just tried to run me down"

– "I got the first three letters of his license plate: ABC"

You need to look for similar incident reports

– Reports that mention a "blue van"

… around the same time

… around the same place

… with a license plate that starts with "ABC"

Combination Query Example

Presenter

Presentation Notes

What do we mean by a "combination query"? Suppose you work in a call center. Someone calls and say "some maniac in a blue van just tried to run me down - I got the first three letters of his license plate: ABC". You could look up ABC* in the Vehicle licensing database. But that would give you lots of results, and probably wouldn't help much. You have a lot of information here – let's see if we can use all that context to find the driver of the blue van. If there really is a maniac in a blue van, there's probably another incident report that will give you more information. That incident report would mention a "blue van"; it would be around the same time and place; and if it has the license plate, it'll start with "ABC". But that's a really hard query!


<report>

<title> Suspicious vehicle… Suspicious vehicle near airport

<date>

<type>

<threat>

2012-11-12Z

observation/surveillance

<type> suspicious activity

<category> suspicious vehicle

<location>

<lat> 37.497075

<long> -122.363319

<subject> IRIID

<subject> IRIID

<predicate>

<predicate>

isa

value

<triple>

<triple>

<object> license-plate

<object> ABC 123

<description> A blue van… A blue van with license plate ABC 123 was observed parked behind the airport sign…

</title>

</date>

</type>

</type>

</category>

</threat>

</lat>

</long> </location>

</subject>

</subject>

</predicate>

</predicate>

</object>

</object> </description>

</triple>

</triple>

Combination Query Example

</report>

Presenter

Presentation Notes

Luckily your incident reports are all in MarkLogic! With MarkLogic you can query across all these kinds of information – a date range, a geospatial query, a full-text search, and a triples query, all in a single simple efficient query. With MarkLogic you can use all the context you have to search across all your information to get just exactly what you need to complete your task. This is the power of querying across documents, data, geospatial, and triples in a single, simple query.


MarkLogic Semantics Use Cases Information Delivery Platforms

– Dynamic Publishing – Custom Publishing

Data Integration – Customer 360 – Know Your Customer – Patient 360 Repository

Decision Support Metadata Catalogs Intelligence, fraud detection

Open Government Initiatives

Regulatory compliance

Data provenance

Enterprise Reference Data

Research Management


Better Together

Triple Store Document Store + Data Store +

Inference

Traversal

Presenter

Presentation Notes

Only MarkLogic combines a document store, data store, and triple store; lets you query across all your information.


Tech Specs

Store and manage hundreds of billions of RDF triples Query across documents, data, and triples Triple index for sub-second search results Triple cache for high performance across large clusters Bulk-load triples via MarkLogic Content Pump Provenance and reification by adding metadata SPARQL 1.0+ over REST or XQuery SPARQL calls from server-side programs with query restrictions Standard SPARQL endpoint and graph store protocol support XQuery helper modules for serializations and transitive closures Updates, aggregates via MarkLogic APIs Semantic enrichment with partners (e.g. Smartlogic, Temis, NetOwl) Enterprise Features: ACID transactions, scalability and elasticity,

HA/DR, government-grade security, monitoring and performance tools

MarkLogic 7

Everything in MarkLogic 7, plus: SPARQL 1.1 including updates and aggregates Graph traversal with property paths and transitive closures Automatic inference using rule sets

– Supplied rule sets for RDFS, RDFS+, OWL Horst – Support for user-defined rule sets

SPARQL from server-side JavaScript, Node.js

MarkLogic 8



SPARQL 1.1 support

– Property paths

– SPARQL Aggregates

SPARQL Update

Inference

Performance

Q&A


SPARQL 1.1 Support MarkLogic 8.0-1 includes support for all the major features of SPARQL 1.1

including paths, aggregates, and SPARQL Update

Details: see https://wiki.marklogic.com/display/rootwiki/SPARQL+support+-+MarkLogic+8.0-1

Contact [email protected] with enhancement requests (or log a support ticket)

https://wiki.marklogic.com/display/rootwiki/SPARQL+support+-+MarkLogic+8.0-1

https://wiki.marklogic.com/display/rootwiki/SPARQL+support+-+MarkLogic+8.0-1

mailto:[email protected]


Property paths SPARQL 1.1 Query – Property Paths

Property path operators added for MarkLogic 8:

– ? – zero or one path

– + – one or more path

– * – zero or more path


Property paths - example ## find papers that cite paperA, and papers that cite papers that cite paperA, and so on

SELECT ?s

WHERE {

?s c:cites*/dc:title "Paper A" .

}

ORDER BY ?s

Note: taken from Bob DuCharme's "Learning SPARQL" For more examples, see http://ea.marklogic.com/wp-content/uploads/2014/08/SPARQL-paths-examplesEA2.pdf

http://ea.marklogic.com/wp-content/uploads/2014/08/SPARQL-paths-examplesEA2.pdf


SPARQL aggregates Aggregate functionality includes:

GROUP BY

COUNT

SUM

MIN

MAX

SAMPLE

AVG

GROUP_CONCAT

GROUP BY .. HAVING <some aggregate variable>

ORDER BY <some aggregate variable>

GROUP BY <more than one item>


SPARQL aggregates - example ## count how many companies are in each industry sector

SELECT ?industry ( COUNT ( ?company ) AS ?count_companies )

FROM <http://marklogic.com/semantics/sb/COMPANIES100/>

WHERE {

?company demov:industry ?industry .

}

GROUP BY ?industry

For more examples, see http://ea.marklogic.com/wp-content/uploads/2014/09/SPARQL-aggregates-examples-EA2.pdf

http://ea.marklogic.com/wp-content/uploads/2014/09/SPARQL-aggregates-examples-EA2.pdf



SPARQL 1.1 support

SPARQL Update

– SPARQL Update operations, examples

– Managed triples

– Graph permissions

– Locking

Inference

Performance

Q&A


SPARQL UPDATE operations 1. GRAPH MANAGEMENT Manipulate RDF graphs using the SPARQL 1.1 Update language

CREATE – create a graph

DROP – drop a graph and its contents

COPY – make the destination graph into a copy of the source graph; any content in the destination graph before this operation will be removed (think copy/paste)

MOVE – move the contents of the source graph into the destination graph, and remove them from the source graph; any content in the destination graph before this operation will be removed (think cut/paste)

ADD – add the contents of the source graph into the destination graph; keep the source graph intact; keep the initial contents of the destination graph intact


SPARQL UPDATE operations

2. GRAPH UPDATE Delete, insert, and update (delete/insert) triples using the SPARQL 1.1 Update language.

INSERT DATA

DELETE DATA

DELETE .. INSERT WHERE

DELETE WHERE

INSERT WHERE

CLEAR


SPARQL Update – example[1] DROP SILENT GRAPH <graph-1> ; CREATE GRAPH <graph-1> ;

PREFIX PREFIX prod: http://example.com/products/ PREFIX ex: <http://example.com/>

INSERT DATA

{

GRAPH <graph-1> { prod:1001 rdf:type ex:Henley ; ex:color "blue" . prod:1002 rdf:type ex:Shirt ; ex:color "red" . } } ;


SPARQL Update – example[2] PREFIX prod: <http://example.com/products/> PREFIX ex: http://example.com/

## change all blue products to (only) "azure" WITH <http://marklogic.com/semantics/sb/products/> DELETE { ?prod ex:color ?c . } INSERT { ?prod ex:color "azure" . } WHERE { ?prod ex:color "blue" }


SPARQL Update – more examples GRAPH MANAGEMENT:

http://ea.marklogic.com/wp-content/uploads/2014/08/SPARQL-update-examples-graph-EA2.pdf

GRAPH UPDATE:

http://ea.marklogic.com/wp-content/uploads/2014/08/SPARQL-update-examples-EA2.pdf





SPARQL Update APIs Query Console – new mode "SPARQL Update"

REST – post to /v1/graphs/sparql

Server-side built-ins

– sem:sparql-update()

– sem.sparqlUpdate()


Managed Triples SPARQL Update operations operate over managed triples only

Managed triples are triples loaded into the database using

– mlcp with -input_file_type RDF

– sem:rdf-load()

– sem:rdf-insert()

cf embedded triples

– Triples embedded in XML or JSON documents

– SPARQL Update operations don’t affect embedded triples

[Strictly, a managed triple is a sem:triple element in a document with root element sem:triples]


Graph Permissions Set permissions when you create a graph:

import module namespace sem = http://marklogic.com/semantics at "/MarkLogic/semantics.xqy";

sem:sparql-update(

'CREATE GRAPH <graphs/sb/graph-1>', (),(),(), ( xdmp:permission( "demo-reader", "read" ), xdmp:permission( "demo-writer", "update" ) ) )

Note: arg5 is called $default-permissions , but you should set permissions explicitly

– See also sem:graph-set-permissions()

Presenter

Presentation Notes

Arg5 to sem:sparql-update() is called $default-permissions; if you add triples to a graph that doesn't yet exist in the SPARQL Update statement, MarkLogic will use the default-permissions for the graph. BUT Best practice is to set permissions explicitly – i.e. on purpose – either by doing a CREATE GRAPH (as in this slide) or by using sem:graph-set-permissions()


Graph Permissions Check permissions on the graph you just created:

import module namespace sem = http://marklogic.com/semantics at "/MarkLogic/semantics.xqy";

sem:graph-get-permissions( sem:iri("graphs/sb/graph-1") )

See also

– sem:graph-set-permissions()

– sem:graph-add-permissions()

– sem:graph-remove-permissions()

Presenter

Presentation Notes



Graph Permissions Set permissions when you create a graph:

var sem = require('/MarkLogic/semantics');

sem.sparqlUpdate( 'CREATE GRAPH <graphs/sb/graph-2>', [],[],[], ( xdmp.permission( "demo-reader", "read" ), xdmp.permission( "demo-writer", "update" ) ) )

Note: arg5 is called $default-permissions , but you should set permissions explicitly

– See also sem.graphSetPermissions()

Presenter

Presentation Notes



Graph Permissions Check permissions on the graph you just created:

var sem = require('/MarkLogic/semantics');

sem.graphGetPermissions( sem.iri("graphs/sb/graph-5") )

See also

– sem:graphSetPermissions()

– sem:graphAddPermissions()

– sem:graphRemovePermissions()

Presenter

Presentation Notes



Graph Permissions All SPARQL queries over managed triples are governed by the graph permissions

– … because managed triples documents will inherit those permissions at ingest

Artefact: for each graph, you'll see a graph document

– The database URI of the graph document is the URI of the graph

– The graph document belongs to: – a collection whose name is the URI of the graph

– a collection that contains all graph documents i.e. http://marklogic.com/semantics#graphs


SPARQL Update locking sem:sparql-update( $sparql , $bindings, $options, $store, $default-permissions),

“locking” option:

– read-write: read-lock documents containing triples being accessed, write-lock documents being updated.

– write: Only write-lock documents being updated.

– Default is locking=read-write.

sem:sparql( $sparql , $bindings, $options, $store), “locking” option:

– read-write: read-lock documents containing triples being accessed

– write: no locks (because sem:sparql() doesn't write)

– Default is locking=read-write. Locking is ignored in query transaction.

Presenter

Presentation Notes

In ML7, Triple range query neither lock triples nor the documents containing the triples. In ML8, we provide a locking option in sem:sparql-update and sem:sparql to specify the type of lock to be enforced. Sem:sparql-update Always run in update transaction. If locking is set to read-write, server read-lock documents containing triples being accessed, write-lock documents being updated. This guarantees ACID properties when updating triples. If locking is set to write, server only write-lock documents being updated. We allow user to set locking=write if user prefers performance to consistency. Sem:sparql Usually run in query transaction where locking option is ignored. If sem:sparql runs in update transaction ( for example, sem:sparql and xdmp:node-replace run in the same transaction ): If locking is set to read-write, server read-lock documents containing triples being accessed, write-lock documents being updated. This guarantees ACID properties when updating triples. If locking is set to write, server only write-lock documents being updated. We allow user to set locking=write in an update transaction. E.g., user may run sem:sparql and do document-insert into docs that has nothing to do with triples.



SPARQL 1.1 support

SPARQL Update

Inference

– Backwards vs forwards chaining

– What are rules, and how do you use them?

– Tips on Inference

Performance

Q&A


Backward-chaining vs Forward-chaining Forward-chaining: At ingest time, insert a new triple John livesIn England

Backward-chaining: At query time, return results as if a new triple existed John livesIn England

MarkLogic 8 does inference this way

John England

London isIn

livesIn

livesIn


Backward-chaining vs Forward-chaining Forward-chaining: At ingest time, insert a new triple

Ingest (and update) is very slow

More diskspace

Materialize every possible inferred triple

Implications for security, ACID

.. But queries are fast

Backward-chaining: At query time, return results as if a new triple existed

Fast ingest, less diskspace, security and ACID are straightforward, only do the work that's needed

.. But work is done at query-time

John England London isIn

livesIn

livesIn


Inference rules Choose an appropriate ruleset – the right level of inference

Set a default ruleset for the database

Specify a ruleset as part of your query

Create your own ruleset


Choose an appropriate ruleset Supplied rulesets are in $MARKLOGIC/Config

with .rules extension

Common levels of inference:

– rdfs, rdfs-plus, owl-horst

– each has an optimized ruleset + "full"

Rulesets are built up in a modular way, using import


Rule - example excerpt from subClassOf.rules

rule "subClassOf rdfs9" construct { ?x a ?c2 } { ?x a ?c1 . ?c1 rdfs:subClassOf ?c2 . filter(?c1!=?c2) }

Syntax is similar to SPARQL CONSTRUCT

We can read the rule as: foreach x, c1, c2 where x is a c1 and c1 is a subclass of c2, construct "x is a c2"

e.g. prod001 is a henley; henley is a subclass of shirt; construct "prod001 is a shirt"


Ruleset – example – rdfs.rules # is a comment line

Prefix – same as SPARQL

Import other rulesets – modular


Set a default ruleset(s) for the database


Set a default ruleset(s) for the database xquery version "1.0-ml"; import module namespace admin = "http://marklogic.com/xdmp/admin" at "/MarkLogic/admin.xqy";

(: add "subClassOf.rules" as default ruleset for database "Documents" :)

let $config := admin:get-configuration() let $dbid := admin:database-get-id($config, "Documents") let $rules := admin:database-ruleset("subClassOf.rules") let $c := admin:database-add-default-ruleset($config, $dbid, $rules) return admin:save-configuration($c)

(: See also: admin:database-get-default-rulesets() admin:database-delete-default-ruleset() :)


Specify a ruleset(s) as part of your query (: create a store that uses the RDFS ruleset for inferencing :) let $rdfs-store := sem:ruleset-store("rdfs.rules",sem:store( "no-default-rulesets" ) ) return (: use the store you just created - pass it into sem:sparql() :) sem:sparql(' PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ex: <http://example.com/>

SELECT ?product FROM <http://marklogic.com/semantics/sb/products/inf-1> WHERE { ?product rdf:type ex:Shirt ; ex:color "blue" } ', (),(), $rdfs-store )

For full examples, see http://ea.marklogic.com/wp-content/uploads/2014/12/SPARQL-inference-examples-EA3-2.pdf

Presenter

Presentation Notes

Create a sem:store using sem:store() and sem:ruleset-store() Feed that store into sem:sparql() Note: "no-default-rulesets" option to sem:store determines whether this ruleset replaces the default, or is added to it

http://ea.marklogic.com/wp-content/uploads/2014/12/SPARQL-inference-examples-EA3-2.pdf


Create your own ruleset (: create a rules file and insert it into the Schemas database :) (: Note: run this from Query Console with "Content Source" set to "Schemas" :)

xdmp:document-insert( '/rules/livesin.rules' , text{ ' # my rules for inference prefix ex: <http://example.com/> prefix gn: <http://www.geonames.org/ontology#>

rule "lives in" construct { ?person ex:livesIn ?place2 } { ?person ex:livesIn ?place1 . ?place1 gn:parentFeature ?place2 }' } )


Use your own ruleset (: find places that John Smith lives in – with inferencing, using my ruleset :)

let $my-store := sem:ruleset-store("/rules/livesin.rules", sem:store() ) return (: use the store you just created - pass it in to sem:sparql() :) sem:sparql(' PREFIX ex: http://example.com/

SELECT ?person ?placeName FROM http://marklogic.com/semantics/sb/customers/inf-1 WHERE { ?person ex:livesIn ?place . ?place gn:name ?placeName } ORDER BY ?person ', (), (), $my-store )

For full examples, see http://ea.marklogic.com/wp-content/uploads/2014/12/SPARQL-inference-examples-EA3-2.pdf

http://ea.marklogic.com/wp-content/uploads/2014/12/SPARQL-inference-examples-EA3-2.pdf


Inference rules - Summary Choose an appropriate ruleset

– the right level of inference – can use more than one ruleset

Set a default ruleset for the database – Admin UI or XQuery/JavaScript API

Specify a ruleset as part of your query – create a sem:store using your ruleset location(s) – include or override default ruleset – ruleset location is resolved from Schemas database, then $MARKLOGIC/Config

Create your own ruleset – text file inserted in Schemas database


Tips on Inference Use the fewest rules that you actually need

– Query performance slows as you add rules

– Database default + query-time ruleset(s) gives great flexibility

Consider doing inference in your query, possibly with paths

– Gives you the most control, best performance, most predictable results


Tips on Inference Use the fewest rules that you actually need

– Query performance slows as you add rules

– Database default + query-time ruleset(s) gives great flexibility

Consider doing inference in your query, possibly with paths

– Gives you the most control, best performance, most predictable results

## find all blue shirts (including henleys) without inference SELECT ?product FROM <http://marklogic.com/semantics/sb/products/inf-1> WHERE { ?product rdf:type/rdfs:subClassOf* ex:Shirt . ?product ex:color "blue" }



SPARQL 1.1 support

SPARQL Update

Inference

Performance

– Improvements 7 to 8

– Inference performance

– Diagnosing slow queries

Q&A


SPARQL Performance 8.0-1 will be ~10% faster than 7.0-4.1

– Especially for queries with larger joins

– Reduced overhead for common join algorithms

7.0-4.1 is ~10% faster than 7.0-4

– Query optimization fix

7.0-4 is 65% faster than 7.0-1 (23% faster than 7.0-3)

– Cost optimization improvements in 7.0-3

– Execution efficiency improvements in 7.0-4

Significant investment in performance planned for 8.0-2 to 9.0


Inference Algorithms Tableau or equivalent (Racer Pro, Pellet, etc.)

– Most powerful

– Severe scaling problems

Forward chaining

– Pay costs during ingest in disk space and time

– Hours and double disk space are not uncommon (bulk ingest, ontology changes)

– Materialized triples make queries fast

Backward chaining

– Pay costs during querying time in extra triple index lookups, CPU, and memory

– Ingest is fast


Architectural Reasons for Backwards Chaining MarkLogic can query arbitrary subsets of triples (cts:query constraint)

– MarkLogic security is an important example of subsetting triples

– MVCC is also a subset based on document timestamps

Very hard to efficiently query materialized inferences if arbitrary subsets of the database might be queried


Benefits of Backwards Chaining Flexibility!

Choose your ontology at query time

– include or exclude triples with cts:query or named graphs

Choose your rulesets at query time

Perform inference across in memory triples and database triples

Fast scale-out ingest

Presenter

Presentation Notes

Ingest has been benchmarked as faster that competitors, and scales out as more hosts are added


Inference Performance Different performance profile to competitors that use forward-chaining inference

Restricted triple patterns will perform much better – ?s a :type

– ?s :hasA ?o

Never query for ?s ?p ?o !

Less rules is better. Twenty rules is a lot (RDFS+), ten is better (RDFS) – Rules are applied recursively, with exponentially increasing complexity

– Rulesets are modular and can be flexibly combined

Inference memory size may need increasing – To complete the query execution, not to get better performance


Alternatives to Automatic Inference Property Paths in SPARQL

rule ”rdfs9" construct { ?x a ?c2 } { ?x a ?c1 . ?c1 rdfs:subClassOf ?c2 }

?s a foaf:Document

?s rdfs:subClassOf*/a foaf:Document }


Alternatives to Automatic Inference

If the rulesets, ontologies, security, and data are static or change infrequently

Consider materializing inferred triples as a one-off

– Use “?s ?p ?o” query with built in inference, piped into sem:rdf-insert()

– or use the rulesets as the basis for SPARQL CONSTRUCT queries

Bulk Materialization


Recursive Rule Application


?s a foaf:Document

?x = ?s

?c2 = foaf:Document


rule "rdfs11" construct { ?c1 rdfs:subClassOf ?c3 } { ?c1 rdfs:subClassOf ?c2 . ?c2 rdfs:subClassOf ?c3 }



?s a foaf:Document


rule ”rdfs2" construct { ?x a ?c } { ?x ?p ?y . ?p rdfs:domain ?c }

rule "rdfs11" construct { ?c1 rdfs:subClassOf ?c3 } { ?c1 rdfs:subClassOf ?c2 . ?c2 rdfs:subClassOf ?c3 }



?s a foaf:Document

Presenter

Presentation Notes

Highlight loops in graph


Making a SPARQL Query Fast

Try using a smaller ruleset

– owl-horst.rules > rdfs-plus.rules > rdfs.rules

– or try combining the rulesets for the specific ontology predicates/types you’re using

Try using a smaller ontology, or a smaller set of data

– restrict to a named graph or cts:query

Turn on the “SPARQL Execution” trace flag to log the triple index lookups as they happen

Are you querying for a large result set?

– Slowest inferences are just accessing a lot of data

SPARQL with Inference?


Making a SPARQL Query Fast Optimization Time Optimization happens when a query is first seen

Cached by string value of the query

– Re-optimized after ~5 minutes to read new statistics

Don’t count optimization time in query time

Warm the cache up with the “prepare” option to sem:sparql()

Pass bindings to sem:sparql(), don’t use string concatenation (safer and faster)

Use higher levels of optimization (ie: “optimize=2”) for bigger or problematic queries

– Longer spent optimizing can find a better query plan

– Trade off between planning and doing

Presenter

Presentation Notes

Use bind variables, just like SQL


Making a SPARQL Query Fast Serialization RDF serialization can be slow (sem:rdf-serialize(), sem:query-results-serialize())

– Avoid and use the sequence of maps directly if possible

– Return aggregates using fn:count() or SPARQL 1.1 Aggregates in ML 8.0

The SPARQL endpoint uses sem:query-results-serialize().

– May get better results using XQuery and sem:sparql()


credit: http://commons.wikimedia.org/wiki/User:Takkk

Making a SPARQL Query Fast If You Need Support to Speed Up a Query Contact MarkLogic

File a support case

MarkLogic 8: Semantics · 2018. 9. 29. · Samplestack and Reference Architecture . Kasey Alderete...

Documents

Transcript of MarkLogic 8: Semantics · 2018. 9. 29. · Samplestack and Reference Architecture . Kasey Alderete...