Store and query billions of facts and relationships; infer new facts
Facts and relationships provide context for better search
Flexible data modeling—integrate and link data from different sources
Standards-based for ease of use and integration – RDF, SPARQL, and standard REST interfaces
Even better with Built-in Search and Bitemporal – Triples, documents, and data combined
Presenter
Presentation Notes
Short Description:�Store RDF triples and query them using SPARQL—providing meaning and context to your data using the only database that can handle a combination of documents, data, and triples. Long Description:�Semantics provides a universal framework to describe and link different data so that it can be better understood and searched holistically, allowing both people and computers to see and discover relationships in the data. MarkLogic provides the capability to store and query linked data, including a native RDF Triple Store for storing and managing hundreds of billions of triples that can be queried with SPARQL—all right inside MarkLogic. Not only that, but MarkLogic combines the triple store with its document store, providing the capability to store and manage documents, data, and triples together so you can discover, understand, and make decisions in context. MarkLogic 8 extends the use of standard SPARQL so you can do analytics (aggregates) over triples; explore semantics graphs using property paths; and update semantic triples; all using the standard SPARQL 1.1 language over standard protocols. In addition, MarkLogic 8 lets you discover new facts and relationships with automatic inference. Script for Presenting: Enterprise triple store, document store, database …combined MarkLogic Semantics adds the capabilities of an Enterprise Triple Store to its document store and database. Store and query billions of facts and relationships; infer new facts The triple store lets you store and query billions of facts (assertions) and relationships. Facts/relationships are represented as triples, made up of subject, predicate, and object For example, we can represent the facts "John lives in London" and "London is in England" as triples like this: Subject Predicate Object John livesIn London London isIn England We can also infer new facts. From what we (as humans) know about "livesIn" and "isIn", we can infer that John lives in England. The triple store can do that too – you can specify rules that say exactly what a predicate means, and the triple store will infer new facts when querying. Many of these rules are specified in the RDFS and OWL specifications, and can be applied in MarkLogic queries out of the box. Facts and relationships provide context for better search (see also slide 2): Imagine how much better a search application can be if the app has access to billions of facts and relationships. The app can leverage those facts in several ways (see future slide): Find more relevant information by expanding the terms the user typed in Present more/better information about whatever the user is searching for Publish information dynamically to web or print or mobile Flexible data modeling - integrate and link data from different sources (see also slide 3 and 4): Triples are atomic and schemaless – so they are easy to share, easy to combine. When you model data as triples, it's easy to load the data as-is, and query across all your data. You can also link data from different sources by creating new triples. For example, if you have information about the same customer from two sources, and one source calls the customer "cust123" while the other calls the same customer "cus_id_456", Simply add a triple cust123 sameAs cus_id_456 and you can query across all the information about that customer in a single simple query. As well as creating and extracting your own triples, there are billions of triples available on the Open Linked Data web. For example, you can download sections of dbpedia (the triples version of wikipedia) Einstein was born in Germany Buzz Aldrin was on the crew of Apollo 11 A labrador is a type of dog Or you can download facts from Geonames: London is in England London has a population of 7,504,800 London is at lat/long position 51.5/-0.16667 Or you can go to data.gov to get facts about food from the Dept of Agriculture (http://data-gov.tw.rpi.edu/wiki/Dataset_1294) Pineapple juice has 140 calories per serving See http://www.w3.org/wiki/DataSetRDFDumps for a partial listing of RDF data available for download and ingestion into MarkLogic. See http://data-gov.tw.rpi.edu/wiki/Data.gov_Catalog_-_Complete for a listing of Open Government RDF datasets. Standards-based for ease of use and integration MarkLogic Semantics is based on W3C standards. RDF describes the data model for facts and relationships (http://www.w3.org/RDF/). MarkLogic can load RDF files in all the popular RDF formats – RDF/XML, Turtle, RDF/JSON, N3, N-Triples, N-Quads, and TriG (http://docs.marklogic.com/guide/semantics/loading#id_70682) SPARQL is the W3C standard language for querying RDF. MarkLogic supports SPARQL 1.1, which includes paths, aggregates, and inserts/deletes. (http://www.w3.org/TR/sparql11-query/ and http://www.w3.org/TR/sparql11-update/) MarkLogic also supports standard interfaces. http://www.w3.org/TR/sparql11-protocol/ defines a SPARQL endpoint, which is a standard REST endpoint for SPARQL queries. http://www.w3.org/TR/sparql11-http-rdf-update/ defines the Graph Store HTTP Protocol, which is a standard REST endpoint for managing RDF graphs. Even better with search, bitemporal The real power of MarkLogic comes not from a single feature, but in the ability to combine features in a single, powerful query. Semantics isn't a product, it's a feature of a product. MarkLogic Semantics works particularly well with search (including GeoSpatial search) and bitemporal. Search (see also slide 6 and 7): In MarkLogic, you can embed triples in XML or JSON documents and run combination queries. You can combine SPARQL and cts:query in two ways: run a SPARQL query that is filtered by a cts:query condition; or embed a cts:triple-range-query (which returns a cts:query) in a cts:search. For example, you might want to ask "show me all the people who met with John". If you have triples of the form "john metWith X", that's a simple SPARQL query. But if those triples are embedded in the documents where that fact was asserted or discovered – say, a police report or e-mail exchange – you can ask much richer questions such as "show me all the people who met with John, where the fact was discovered in the last 6 months and the source is a police report from a county in the eastern US and that report also mentions some kind of weapon and some kind of controlled substance". Or you might want to ask "how many emails and tweets in my sample are generally positive?" If you have triples of the form "message1002 hasSentiment +9", that's a simple SPARQL query. But if those triples are embedded in the messages, you can ask much richer questions such as "show me snippets of all the messages that were overwhelmingly positive, and were sent by someone who is an executive of a Fortune 500 company, between these dates, and which mention the companies ‘IBM’ and ‘Oracle’, and mention a word that has something to do with takeovers or acquisitions". Bitemporal: Bitemporal Data Management handles historical data along two different timelines, making it possible to rewind the information “as it actually was” in combination with “as it was recorded” at some point in time. It facilitates the creation of complete audit trail of data. Since you can compose SPARQL and cts:query, you can do a bitemporal SPARQL query! Simply run the SPARQL query with a cts:query constraint over one or both bitemporal axes.
Better Answers From Today’s Data Find more relevant information using facts as context
– Example: Search for "cardiac catheter"; show documents about "devices that stimulate nerves" and "implantable devices"
Present more, better information for more productive users
– Example: Search for "Ireland"; show facts about Ireland with search results
Publish information dynamically to web or print or mobile
– Example: BBC Sports page about a team, event, sport, person
– Example: Wiley Custom Select for customized learning materials
Presenter
Presentation Notes
Imagine how much better a search application can be if the app has access to billions of facts and relationships. The app can leverage those facts in several ways (see future slide): Find more relevant information by expanding the terms the user typed in For example, at BSI (British Standards Institute) users want to find all standards that apply to a product they are building. With MarkLogic, they can type in "cardiac catheter" and find standards that apply to "devices that stimulate nerves" and "implantable devices" (since a cardiac catheter stimulates the nerves of the heart, and it's implantable), even though those standards don’t specifically mention the phrase "cardiac catheter".�It's like having a domain expert looking over your shoulder and guiding your search!�Under the covers we're expanding the search terms the user is typing in, mostly using ontologies and data sets that are freely available.�This capability turns transforms a standards search from a very long, expert-driven process to a short process which is bullet-proof. It's all about getting the search results the user actually wants to see, quickly and easily.� Present more/better information about whatever the user is searching for When a user types in "Ireland", his intent is not to "find links to all the documents that contain the word 'Ireland' ". Rather, he wants to understand, discover, make decisions about "Ireland". Since the app has access to billions of facts, it shows selected facts about "Ireland" in an Infopanel alongside the search results.�This I something Google is doing more and more with its "Google Knowledge Graph".�Now you can do that too, with your own application. You can also decorate the search snippets with facts about entities in the snippet.�For example, if the search snippet mentions an author, pop-up an infobox about that author. It's all about getting as much relevant information in front of the user as possible, not just links to documents.� Publish information dynamically to web or print or mobile Another way to "re-think the search page" is to dynamically create a mash-up of relevant content for each web page.�The BBC pioneered Dynamic Semantic Publishing for users to see and navigate relevant content on the BBC Sports pages.�Instead of creating a static page for each team, event, sport, person that a user might be interested in, the URL becomes the search – so a visit to http://www.bbc.com/sport/football/teams/west-ham-united pulls together match reports, pictures, videos, news items, league tables related to West Ham United.�The app knows which items are related to West Ham by querying a sports ontology – so it knows for example that West Ham is a soccer team in England, based in London; they play in the Premier League; and Diafra Sakho is a current player; so the page includes a news story about Sakho, the Premier League table, and links to other soccer-related stories.�This gives users an information-rich experience; but it's easy to maintain, up-to-the-minute, and error-free.�The BBC had spectacular successes with this site reporting on the 1010 World Cup and 2012 London Olympics.�For more on Dynamic Semantic Publishing at the BBC, see �http://www.bbc.co.uk/blogs/legacy/bbcinternet/2010/07/bbc_world_cup_2010_dynamic_sem.html�http://www.bbc.co.uk/blogs/legacy/bbcinternet/2012/04/sports_dynamic_semantic.html�See also the BBC Sports Ontology at http://www.bbc.co.uk/ontologies/sport Wiley Custom Select, Houghton Mifflin Harcourt, and others use MarkLogic to customize learning materials.�Semantics could improve ease of use and accuracy in the custom publishing space.�See:�http://www.marklogic.com/resources/custom-publishing-in-education-harnessing-technology-to-maximize-results/�http://www.marklogic.com/press-releases/announcing-wiley-custom-select-next-generation-custom-publishing-application-powered-by-marklogic/�http://customselect.wiley.com/ Further discussion points: Combination queries In MarkLogic, you can embed triples in XML or JSON documents and run combination queries. You can combine SPARQL and cts:query in two ways: run a SPARQL query that is filtered by a cts:query condition; or embed a cts:triple-range-query (which returns a cts:query) in a cts:search. For example, you might want to ask "show me all the people who met with John". If you have triples of the form "john metWith X", that's a simple SPARQL query. But if those triples are embedded in the documents where that fact was asserted or discovered – say, a police report or e-mail exchange – you can ask much richer questions such as "show me all the people who met with John, where the fact was discovered in the last 6 months and the source is a police report from a county in the eastern US and that report also mentions some kind of weapon and some kind of controlled substance". Or you might want to ask "how many emails and tweets in my sample are generally positive?" If you have triples of the form "message1002 hasSentiment +9", that's a simple SPARQL query. But if those triples are embedded in the messages, you can ask much richer questions such as "show me snippets of all the messages that were overwhelmingly positive, and were sent by someone who is an executive of a Fortune 500 company, between these dates, and which mention the companies ‘IBM’ and ‘Oracle’, and mention a word that has something to do with takeovers or acquisitions".
This slide show dynamically produced content boxes all using RDF data published on other sites like DBPedia. The example on the left is Google, using RDFa, and the example on the right is a demo application built by MarkLogic that pulls RDF data about companies into the app from DBPedia.
Intelligent Data Layer Discover connections between entities
– Example: Show me papers cited by Steven Pinker, and papers cited by…
Walk hierarchies
– Example: Show me who directly owns Acme, and who ultimately owns them
Infer new facts for simple data modeling, powerful queries
– Example: If Acme owns Amertek, then Amertek is owned by Acme
– Example: If prod001 is a blue henley, it's also a blue shirt
Facts may be embedded in documents to keep context
Presenter
Presentation Notes
Discover connections between entities A lot of people think of RDF as a graph data model – since the object of one triple can be the subject of another, it's easy to represent graphs of facts and relationships. So for example if you have triples that represent authors and papers, and triples that represent which papers cite which other papers, it's easy to ask questions like "which papers are cited by the papers that Steven Pinker wrote?" Then it’s easy to extend that to ".. and which papers did they cite? And which papers did they cite?" And so on. In this way, you can build up a view of who influenced whom; or who knows whom; or which business entity owns which business entity. And you can discover connections that might not otherwise be found – Joe works for Acme; Acme is putting together a deal for Amertek to acquire Alchemax; Mary bought a large number of shares in Alchemax; Mary is married to Joe. Possible insider trading? You can find connections to some specified depth – find me everyone associated with Kevin Bacon, but only look 3 steps. Or you can take an unspecified number of steps – show me everyone who has had contact with this Ebola patient, and everyone who has had contact with them, and so on for as far as you can go. This is very difficult to do with the relational data model. Walk hierarchies A hierarchy is a special case of a graph. If you have triples that tell you which business entities own which business entities, it's easy to walk the hierarchy and find all business entities in the ownership chain. And you can say "find me the ultimate owner" even if you don't know how many steps away that is. Note this applies to terms (such as search terms) as well as to entities. So a thesaurus represented as triples is easy to manage, extremely flexible, and easy to combine with existing thesauri and ontologies; and it's easy to find terms at the top or bottom of the hierarchy. So for example if you have triples that tell you a dog is a canine is a mammal, it's easy to ask questions like "what is the root classification for a dog?" (a mammal) and "what are the entities that are classified as mammals?" (dogs, cats, and so on) without knowing how many steps there are, and without enumerating every intermediate relationship. See http://www.w3.org/2004/02/skos/ Infer new facts for simple data modeling, powerful queries With inference your data is intelligent – you can ask questions, and get answers that depend on facts that aren't explicitly in the database. So for example, if you have a fact "Acme owns Amertek", you can answer the question "who is Amertek owned by?" simply by defining a rule "if A owns B then B is owned by A". Without inferencing you'd have to add another fact "Amertek is owned by Acme" if you wanted an answer to that question. Similarly, if you have a fact "prod001 is a blue henley" and a fact that "a henley is a kind of shirt", then with inferencing you can say "show me all blue shirts" and get the answer "prod001". Facts may be embedded in documents to keep context Facts/relationships may be "free-floating" (London is the capital of England) or they may be embedded in documents (John met with Mary … and we know that from an FBI report in Dublin on July 11 2011 that also mentioned 3 kinds of controlled substances and a known cartel leader).
This slide shows some of the concepts that were mentioned on the last few slides, showing how a website built with linked data provides a more dynamic experience. A graph-based experience doesn’t have a linear direction in the same way a hierarchical architecture imposes. Users bring the x-axis. They navigate the graph in an order of their choice – establishing their own personalised temporal/sequential element to the experience. There is still some order. Nodes are related to each other through modeled relationships. But these relationships can run across multiple axes, not just using the most widely understood categorisation. Through linked data, the categories become less important – it’s the content that counts. Reference: http://www.bbc.co.uk/academy/technology/software-engineering/semantic-web/article/art20130720153136618
Simpler Data Integration Flexible data modeling through triples
– Triples are atomic and schema-less
– Triples are easy to share, easy to combine, and readily available
Integrate data by adding links
– Load triples as-is, then add triples to link entities or documents
– No need to change the underlying data
– Example: cust123 (source1) is the sameAs cus_id_456 (source2)
– Example: cust123 hasOrderDoc /orders/ab42.json
Presenter
Presentation Notes
Flexible data modeling through triples Triples are atomic – they are the ultimate in schemaless data modeling. That means you can take a million triples from dbpedia, and a million from the CIA World Factbook, and a million that you derived from your own data, and put them in the same triple store. No need to map them all to some table schema, or even a document schema. Just throw them all in one bucket and query across them. (For relational thinkers – think of a triple as a single cell in a table, where the subject represents the table + the primary key associated with some row; the predicate is the column; and the object is the cell value). RDF's atomic data model makes it easy to combine triples from different sources. And since it's so easy to combine triples, it's easy to share them. And so triples are readily available. Integrate data by adding links Once you've loaded triples from different sources, how do you query across them?�Let's suppose you're writing a Customer 360 app for a telcom company.�You have facts about some customer from different parts of the company – from accounts, from support, from your website, and so on.�You may also have facts about the same customer that came from different companies – perhaps you just acquired a cable TV company. All those different sources talk about the customer in a different way – for example, accounts refers to them as cust123 while support refers to them as cus_id_456. Simply add one new fact – "cust123 is the sameAs cus_id_456" (sameAs is a standard predicate in the OWL vocabulary – see http://www.w3.org/TR/owl-ref/#sameAs-def). Now you can write queries that treat cust123 and cus_id_456 as the same thing, so that any question about cust123 will return answers about cus_id_456 and vice versa. [Note: to make this happen automatically, use inference with a sameAs rule; otherwise, expand your SPARQL queries to include sameAs resources] So you can load triples as-is, then add linking triples to integrate them. With MarkLogic, you can also link resources to documents. If you already have documents that relate to cust123, simply add a triple that links the customer to that document. For example: cust123 hasOrderDoc /orders/ab42.json [Note that if you embed a triple into a document, theres an implicit link – the triple index includes the CodID, so it knows where that triple came from.] You can also do data reconciliation with triples. Just as you can equate cust123 and cus_id_456, you can equate different spellings such as "John Smith" and "Jon Smith". This is simpler and faster than mapping all your sources to a single schema. You can load triples as-is and get value from querying them right away, even as you figure out which linking triples to add.
This slide shows some of the concepts that were mentioned on the last few slides, showing how a website built with linked data provides a more dynamic experience. A graph-based experience doesn’t have a linear direction in the same way a hierarchical architecture imposes. Users bring the x-axis. They navigate the graph in an order of their choice – establishing their own personalised temporal/sequential element to the experience. There is still some order. Nodes are related to each other through modeled relationships. But these relationships can run across multiple axes, not just using the most widely understood categorisation. Through linked data, the categories become less important – it’s the content that counts. Reference: http://www.bbc.co.uk/academy/technology/software-engineering/semantic-web/article/art20130720153136618
How does “Euro zone” relate to “European Union”, “Europe OECD”, or “Europe”? How does a term such as “Small States,” relate to “Least Developed Countries,” “Lower Middle Income,” or “Low & Middle Income.”
Presenter
Presentation Notes
Applied Relevance created an application on MarkLogic called Epinomy, a time series search engine that combines the best full-text search engine and business analytics for time series data. Time series data is the accumulation of measurements taken at successive points in time spaced at uniform time intervals, and is the most common form of structured data. The challenge Epinomy has addressed is figuring how to combine time series data with other unstructured and constantly changing data such as global economic indicator data. For example, the World Bank publishes data for poverty, inflation, and GDP in a format called SKOS SDMX Data Cube format, a triples format for tracking economic indicators and doing statistical analysis. But, there is lots of other economic data that is not already formatted for easy analysis. With relational databases, this challenge is difficult and even impossible to solve but with MarkLogic Semantics, new data can be incorporated in days, not months. Consider the difficulty in trying to search across various data sources for a common term such as “Euro zone.” It means something different from “European Union”, “Europe OECD”, or “Europe.” Or what about a term such as “Small States,” which is different from “Least Developed Countries,” “Lower Middle Income,” or “Low & Middle Income.” Semantics provides the ability to map all of these terms so that a user can perform natural language searches. Semantics also allows the application to quickly create facets without pre-defining what they should be. Facets, or the categories of results typically grouped down a left-hand column on a webpage, are created in Epinomy completely using triples. It happens dynamically on the fly, is dependent on the content loaded, and is presented fast to the user. Another challenge is when the same economic data is released multiple times. These multiple “vintages” of the same data would typically be a headache to deal with. Semantics handles the various vintages of data by simply creating new sets of triples tagged as “vintage.” And, the natural language search was also designed so that a search can specifically return those vintage values.
“Semantic Web Technology” versus “Semantic Technology”? MarkLogic Semantics falls in the category of “Semantic Web Technologies”, which is slightly different from what experts refer to at “Semantic Technologies.” It gets confusing, however, because MarkLogic does work with partners that provide “Semantic Technologies.” For example, MarkLogic works with Smartlogic to form a complete semantics stack, with MarkLogic storing and managing the triples data, and Smartlogic providing entity enrichment and ontology management. That said, we like to be accurate with the “semantics” of semantics. Semantic Web Technologies Semantic Web Technologies refer to a family of specific W3C standards that allow an exchange of related data – whether it resides on the Web or within organizations. It requires a flexible data model (RDF), query tool (SPARQL), and a common markup language (e.g. RDFa, Turtle, N-Triples). RDF allows you to deconstruct pieces of knowledge called triples, which are linked together in a graph-like representation that is without hierarchy. MarkLogic allows you to natively store and manage RDF triples and query them using SPARQL. Semantic Technologies Semantic Technologies are a variety of linguistic tools and techniques such as Natural Language Processing (NLP) and artificial intelligence to analyze unstructured text to classify and relate it. By identifying the parts of speech (a subject from a predicate, etc.), powerful algorithms can pinpoint entities (people, places, things, time, etc.), concepts, and categories. Once analyzed, text can be further enriched with vocabularies, dictionaries, taxonomies, and ontologies (so regardless which representation is used, assets can be found, eg. Coca-Cola, Coke, KO).
How have we extended our Enterprise NoSQL approach with Semantics? At the STORAGE LAYER MarkLogic scales horizontally, and has all the Enterprise capabilities you need from an Enterprise database including replication, failover, backup and recovery, ACID transactions, and government-grade security. We added the triple store at the storage layer, so now you have a triple store that scales horizontally, and has all the Enterprise capabilities you need from an Enterprise database including replication, failover, backup and recovery, ACID transactions, and government-grade security. . At the next layer are the INDEXES. Some of those indexes are for full text search, range indexes for range queries over scalar values, geospatial indexes and reverse query indexes for alerting. At that layer we added a TRIPLES INDEX and a TRIPLES CACHE to store triples in memory for efficient retrieval. All MarkLogic indexes are designed to work together, so you can do queries with any combination of indexes. The triples cache means you don't need to have all of your triple index in memory, unlike some triple stores. So the size of your triple store is not constrained by physical memory limits. At the QUERY LAYER, MarkLogic has native SPARQL support. You can query using SPARQL only or you can run SPARQL as part of a JavaScript or XQuery server-side program. At the INTERFACE LAYER you can run SPARQL over REST. You can use MarkLogic's comand-line tool for fast bulk-loading, MarkLogic Content Pump (mlcp), to load lots of triples very fast, using parallelization and fast load techniques. You can also use standard REST endpoints to query (SPARQL endpoint) and mange GRAPHS (Graph http Protocol). (http://www.w3.org/TR/sparql11-protocol/ defines a SPARQL endpoint, which is a standard REST endpoint for SPARQL queries.) (http://www.w3.org/TR/sparql11-http-rdf-update/ defines the Graph Store HTTP Protocol, which is a standard REST endpoint for managing RDF graphs.) With MarkLogic you can query across documents, facts, and metadata, and present results "in context", over REST or from a server-side program written in JavaScript or XQuery. All with the Enterprise robustness you need to run mission-critical applications.
– “Some maniac in a blue van just tried to run me down"
– "I got the first three letters of his license plate: ABC"
You need to look for similar incident reports
– Reports that mention a "blue van"
… around the same time
… around the same place
… with a license plate that starts with "ABC"
Combination Query Example
Presenter
Presentation Notes
What do we mean by a "combination query"? Suppose you work in a call center. Someone calls and say "some maniac in a blue van just tried to run me down - I got the first three letters of his license plate: ABC". You could look up ABC* in the Vehicle licensing database. But that would give you lots of results, and probably wouldn't help much. You have a lot of information here – let's see if we can use all that context to find the driver of the blue van. If there really is a maniac in a blue van, there's probably another incident report that will give you more information. That incident report would mention a "blue van"; it would be around the same time and place; and if it has the license plate, it'll start with "ABC". But that's a really hard query!
<title> Suspicious vehicle… Suspicious vehicle near airport
<date>
<type>
<threat>
2012-11-12Z
observation/surveillance
<type> suspicious activity
<category> suspicious vehicle
<location>
<lat> 37.497075
<long> -122.363319
<subject> IRIID
<subject> IRIID
<predicate>
<predicate>
isa
value
<triple>
<triple>
<object> license-plate
<object> ABC 123
<description> A blue van… A blue van with license plate ABC 123 was observed parked behind the airport sign…
</title>
</date>
</type>
</type>
</category>
</threat>
</lat>
</long> </location>
</subject>
</subject>
</predicate>
</predicate>
</object>
</object> </description>
</triple>
</triple>
Combination Query Example
</report>
Presenter
Presentation Notes
Luckily your incident reports are all in MarkLogic! With MarkLogic you can query across all these kinds of information – a date range, a geospatial query, a full-text search, and a triples query, all in a single simple efficient query. With MarkLogic you can use all the context you have to search across all your information to get just exactly what you need to complete your task. This is the power of querying across documents, data, geospatial, and triples in a single, simple query.
Store and manage hundreds of billions of RDF triples Query across documents, data, and triples Triple index for sub-second search results Triple cache for high performance across large clusters Bulk-load triples via MarkLogic Content Pump Provenance and reification by adding metadata SPARQL 1.0+ over REST or XQuery SPARQL calls from server-side programs with query restrictions Standard SPARQL endpoint and graph store protocol support XQuery helper modules for serializations and transitive closures Updates, aggregates via MarkLogic APIs Semantic enrichment with partners (e.g. Smartlogic, Temis, NetOwl) Enterprise Features: ACID transactions, scalability and elasticity,
HA/DR, government-grade security, monitoring and performance tools
MarkLogic 7
Everything in MarkLogic 7, plus: SPARQL 1.1 including updates and aggregates Graph traversal with property paths and transitive closures Automatic inference using rule sets
– Supplied rule sets for RDFS, RDFS+, OWL Horst – Support for user-defined rule sets
Property paths - example ## find papers that cite paperA, and papers that cite papers that cite paperA, and so on
SELECT ?s
WHERE {
?s c:cites*/dc:title "Paper A" .
}
ORDER BY ?s
Note: taken from Bob DuCharme's "Learning SPARQL" For more examples, see http://ea.marklogic.com/wp-content/uploads/2014/08/SPARQL-paths-examplesEA2.pdf
SPARQL UPDATE operations 1. GRAPH MANAGEMENT Manipulate RDF graphs using the SPARQL 1.1 Update language
CREATE – create a graph
DROP – drop a graph and its contents
COPY – make the destination graph into a copy of the source graph; any content in the destination graph before this operation will be removed (think copy/paste)
MOVE – move the contents of the source graph into the destination graph, and remove them from the source graph; any content in the destination graph before this operation will be removed (think cut/paste)
ADD – add the contents of the source graph into the destination graph; keep the source graph intact; keep the initial contents of the destination graph intact
Note: arg5 is called $default-permissions , but you should set permissions explicitly
– See also sem:graph-set-permissions()
Presenter
Presentation Notes
Arg5 to sem:sparql-update() is called $default-permissions; if you add triples to a graph that doesn't yet exist in the SPARQL Update statement, MarkLogic will use the default-permissions for the graph. BUT Best practice is to set permissions explicitly – i.e. on purpose – either by doing a CREATE GRAPH (as in this slide) or by using sem:graph-set-permissions()
Arg5 to sem:sparql-update() is called $default-permissions; if you add triples to a graph that doesn't yet exist in the SPARQL Update statement, MarkLogic will use the default-permissions for the graph. BUT Best practice is to set permissions explicitly – i.e. on purpose – either by doing a CREATE GRAPH (as in this slide) or by using sem:graph-set-permissions()
Note: arg5 is called $default-permissions , but you should set permissions explicitly
– See also sem.graphSetPermissions()
Presenter
Presentation Notes
Arg5 to sem:sparql-update() is called $default-permissions; if you add triples to a graph that doesn't yet exist in the SPARQL Update statement, MarkLogic will use the default-permissions for the graph. BUT Best practice is to set permissions explicitly – i.e. on purpose – either by doing a CREATE GRAPH (as in this slide) or by using sem:graph-set-permissions()
Arg5 to sem:sparql-update() is called $default-permissions; if you add triples to a graph that doesn't yet exist in the SPARQL Update statement, MarkLogic will use the default-permissions for the graph. BUT Best practice is to set permissions explicitly – i.e. on purpose – either by doing a CREATE GRAPH (as in this slide) or by using sem:graph-set-permissions()
– read-write: read-lock documents containing triples being accessed
– write: no locks (because sem:sparql() doesn't write)
– Default is locking=read-write. Locking is ignored in query transaction.
Presenter
Presentation Notes
In ML7, Triple range query neither lock triples nor the documents containing the triples. In ML8, we provide a locking option in sem:sparql-update and sem:sparql to specify the type of lock to be enforced. Sem:sparql-update Always run in update transaction. If locking is set to read-write, server read-lock documents containing triples being accessed, write-lock documents being updated. This guarantees ACID properties when updating triples. If locking is set to write, server only write-lock documents being updated. We allow user to set locking=write if user prefers performance to consistency. Sem:sparql Usually run in query transaction where locking option is ignored. If sem:sparql runs in update transaction ( for example, sem:sparql and xdmp:node-replace run in the same transaction ): If locking is set to read-write, server read-lock documents containing triples being accessed, write-lock documents being updated. This guarantees ACID properties when updating triples. If locking is set to write, server only write-lock documents being updated. We allow user to set locking=write in an update transaction. E.g., user may run sem:sparql and do document-insert into docs that has nothing to do with triples.
Set a default ruleset(s) for the database xquery version "1.0-ml"; import module namespace admin = "http://marklogic.com/xdmp/admin" at "/MarkLogic/admin.xqy";
(: add "subClassOf.rules" as default ruleset for database "Documents" :)
let $config := admin:get-configuration() let $dbid := admin:database-get-id($config, "Documents") let $rules := admin:database-ruleset("subClassOf.rules") let $c := admin:database-add-default-ruleset($config, $dbid, $rules) return admin:save-configuration($c)
(: See also: admin:database-get-default-rulesets() admin:database-delete-default-ruleset() :)
Specify a ruleset(s) as part of your query (: create a store that uses the RDFS ruleset for inferencing :) let $rdfs-store := sem:ruleset-store("rdfs.rules",sem:store( "no-default-rulesets" ) ) return (: use the store you just created - pass it into sem:sparql() :) sem:sparql(' PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ex: <http://example.com/>
SELECT ?product FROM <http://marklogic.com/semantics/sb/products/inf-1> WHERE { ?product rdf:type ex:Shirt ; ex:color "blue" } ', (),(), $rdfs-store )
For full examples, see http://ea.marklogic.com/wp-content/uploads/2014/12/SPARQL-inference-examples-EA3-2.pdf
Presenter
Presentation Notes
Create a sem:store using sem:store() and sem:ruleset-store() Feed that store into sem:sparql() Note: "no-default-rulesets" option to sem:store determines whether this ruleset replaces the default, or is added to it
Create your own ruleset (: create a rules file and insert it into the Schemas database :) (: Note: run this from Query Console with "Content Source" set to "Schemas" :)
xdmp:document-insert( '/rules/livesin.rules' , text{ ' # my rules for inference prefix ex: <http://example.com/> prefix gn: <http://www.geonames.org/ontology#>
Use your own ruleset (: find places that John Smith lives in – with inferencing, using my ruleset :)
let $my-store := sem:ruleset-store("/rules/livesin.rules", sem:store() ) return (: use the store you just created - pass it in to sem:sparql() :) sem:sparql(' PREFIX ex: http://example.com/
SELECT ?person ?placeName FROM http://marklogic.com/semantics/sb/customers/inf-1 WHERE { ?person ex:livesIn ?place . ?place gn:name ?placeName } ORDER BY ?person ', (), (), $my-store )
For full examples, see http://ea.marklogic.com/wp-content/uploads/2014/12/SPARQL-inference-examples-EA3-2.pdf
Inference rules - Summary Choose an appropriate ruleset
– the right level of inference – can use more than one ruleset
Set a default ruleset for the database – Admin UI or XQuery/JavaScript API
Specify a ruleset as part of your query – create a sem:store using your ruleset location(s) – include or override default ruleset – ruleset location is resolved from Schemas database, then $MARKLOGIC/Config
Create your own ruleset – text file inserted in Schemas database
Tips on Inference Use the fewest rules that you actually need
– Query performance slows as you add rules
– Database default + query-time ruleset(s) gives great flexibility
Consider doing inference in your query, possibly with paths
– Gives you the most control, best performance, most predictable results
## find all blue shirts (including henleys) without inference SELECT ?product FROM <http://marklogic.com/semantics/sb/products/inf-1> WHERE { ?product rdf:type/rdfs:subClassOf* ex:Shirt . ?product ex:color "blue" }