A brief introduction to linked data

7/3/2018 - Joep Meindertma

Linked data is a way to structure and share information, using links. These links make data more meaningful and useful. To understand why, let’s take a piece of information and upgrade its data quality step by step, until it's linked data. In the later paragraphs, We'll get a little more technical. We'll discuss the RDF data model, serialization formats, ontologies and publishing strategies. If you're just interested in why linked data is awesome, skip to the advantages of linked data.

Human Language

Tim is born in London on the 8th of June, 1955.

Humans understand what this sentence means, but to a computer, this is just a string of characters. If we wanted an application to do something with this sentence, such as display Tim's birthdate, we'd need the computer to understand English. A simpler solution would be to structure our information in a way that's useful to a computer.

Tables

If we put the information in a table, we can simply let the computer read the birthDate field for Tim.

| name | birthPlace | birthDate |---------|------------|----------- | Tim | London | 06-08-1955

Great! By structuring data, computers can be programmed to do more useful things with it.

But now someone else wants to use this data and has a couple of questions.

Who is Tim?
Which London do you mean, the big one in the UK or the smaller one in Canada?
Does 06-08 mean June 8th or August 6th?

Links

Now, let's add links to our data:

| name | birthPlace | birthDate | |---------|----------------|------------| | Tim | London |1955-06-08 |

By adding these links, others can answer all previous questions by themselves. The links solve three problems:

Links provide extra information. Follow the link to Tim to find out more about him.
Links remove ambiguity. We now know exactly which London we're talking about.
Links add standardization. The birthDate link tells us we need to use the YYYY-MM-DD notation.

These three characteristics make linked data more reusable. The data quality has been improved because other people and machines can now interpret and use the information more reliably.

Let's look at the questions about the first table again. The ambiguity in the table is obvious to someone who reuses the data but is not apparent for the creator of the table. I made the table, I knew which Tim and London I was talking about, I knew how the birthdate should be read. There was no ambiguity for me.

This closed worldview is the root cause of many of the problems in digital systems today. We tend to ignore the information that is stored in the context of data. Developers tend to make software that produces data that only their systems can fully understand. They have their own assumptions, identifiers, and models. Linked data solves this problem by removing all ambiguity about what data represents and how it should be interpreted.

Statements & the RDF data model

In the tables above, we were making two separate statements about Tim: one about his birthdate and one about his birthplace. Each statement had its own cell in the table. In linked data, these statements are often called triples. That's because every triple statement has three parts: a subject, a predicate, and an object.

| Subject | Predicate | Object | |---------|----------------|------------| | Tim |birthPlace | London | | Tim |birthDate | 1955-06-08 |

A bunch of statements about a single subject (such as Tim) is called a resource. That's why we call this data model the Resource Description Framework: RDF. RDF is the de facto standard for linked data.

Instead of using a table of triples, we could visualize the RDF data as a graph.

A visualization of the above triples in a graph

The object of the first triple, for the birthPlace, contains a link (an IRI) to some other resource (London). The object of the second triple (the birthDate) is not a link, but a so-called literal value. The literal value cannot have any properties since it's not a resource.

Calling RDF statements Triples can be a little confusing, because the earlier mentioned literal values can consist of multiple fields as well. A literal consists of a value, a datatype and a language, which means that these triples would take up five columns in your database. In most serialization formats, the datatype and language fields are optional. A datatypes is always a link, and default datatype for literals is xsd:string.

That's a lot of new words and concepts, and they can be a bit confusing at first. However, these concepts will appear all the time when you're actually working with linked data, so try to get an accurate mental model of these concepts.

Let's take a step back and reflect. What can we say about the RDF model, looking at how it works? First, this shows that RDF is actually a very simple model. You can represent anything in RDF with just three (or five) columns. Second, you should note that it is not possible to add extra information on edges (these arrows in the graph). This is different from most graph models, where edges can have their own properties. Another characteristic of the RDF model is that it is really easy to combine two RDF graphs. Integrating two datasets is a luxury that most data models don't have. Finally, having a database model that is decoupled from your application models, means high extensibility and flexibility. Changing your model or adding properties do not require any schema changes. This makes RDF so great for systems that change over time.

RDF Serialization

Let's get a little more technical (feel free to skip to Ontologies if you don't like all this code). RDF is just a data model, not a serialization format. This is different from JSON or XML, for example, which are both a data models and a serialization formats.

In other words: The subject, predicate, object model can be represented in several ways. For example, here's the same triples from the table and the graph above, serialized in the Turtle format:

<https://www.w3.org/People/Berners-Lee/> <http://schema.org/birthDate> "1955-06-08".
<https://www.w3.org/People/Berners-Lee/> <http://schema.org/birthPlace> <http://dbpedia.org/resource/London>.

The <> symbols indicate IRIs and the "" symbols indicate literal values.

This example doesn't look as good as the graph above, right? Long URLs tend to take up a lot of space and make the data a bit tough to read. We can use namespaces (denoted with @prefix) to compress RDF data and make it more readable.

@prefix tim: <https://www.w3.org/People/Berners-Lee/>.
@prefix schema: <http://schema.org/>.
@prefix dbpedia: <http://dbpedia.org/resource/>.

<tim> schema:birthDate "1955-06-08".
<tim> schema:birthPlace <dbpedia:London>.

You could also express the same RDF triples as JSON-LD:

{
  "@context": {
    "schema": "http://schema.org/",
    "dbpedia": "http://dbpedia.org/resource/"
  },
  "@id": "https://www.w3.org/People/Berners-Lee/",
  "schema:birthDate": "1955-06-08",
  "schema:birthPlace": {
    "@id": "dbpedia:London"
  }
}

Or as HTML with some extra RDFa attributes:

<div xmlns="http://www.w3.org/1999/xhtml"
  prefix="
    schema: http://schema.org/
    rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
    rdfs: http://www.w3.org/2000/01/rdf-schema#"
  >
  <p typeof="rdfs:Resource" about="https://www.w3.org/People/Berners-Lee/">
    Tim
    <span rel="schema:birthPlace" resource="http://dbpedia.org/resource/London">
      is born in London
    </span>
    <span property="schema:birthDate" content="1955-06-08">
      on the 8th of June, 1955
    </span>
  </p>
</div>

The Turtle, JSON-LD, and HTML+RDFa each contain the same RDF triples and can be automatically converted into each other. You can try this for yourself and discover even more RDF serialization formats, such as microformats, RDF/XML (don't use this, please) and N-Triples.

The number of serialization options for RDF might be a bit intimidating, but you shouldn't feel the need to understand and know every single one. The important thing to remember is that there's a lot of options that are compatible with each other and use the RDF data model.

Update: I've written an article about when to choose which RDF serialization format!

Ontologies

Let's tell a bit more about Tim. First of all, it might be useful to specify that Tim is a person:

@prefix tim: <https://www.w3.org/People/Berners-Lee/>.
@prefix schema: <http://schema.org/>.
@prefix dbpedia: <http://dbpedia.org/resource/>.
@prefix foaf: <http://xmlns.com/foaf/spec/#term_>.

<tim> a <foaf:Person>;
  schema:birthDate "1955-06-08";
  schema:birthPlace <dbpedia:London>.

We've referred to foaf:Person to specify that Tim is an instance of the class Person. Foaf (Friend Of A Friend) is an ontology that is designed to describe data related to people in social networks. It defines the concept of Person and some attributes, such as a profile image. We used the schema.org ontology for the concepts of birthDate and birthPlace.

There exist many ontologies, ranging from organizations (which describes concepts like memberships) to pizza (which describes concepts like ingredients). These ontologies should be described in RDF as well. A powerful and popular to describe ontologies, is with the OWL format (the Web Ontology Language). The new SHACL ontology help to define shapes of RDF, and can be used to constrain.

An ontology described in RDF is a machine-readable data model. This opens up some really cool possibilities. You can generate documentation. You can use reasoners to infer new knowledge about your data. You can even generate forms and other UI components in React using libraries such as Link-Redux.

The power of the ontology goes far, but that probably deserves its own article.

Publishing linked data

Linked data is meant to be shared. We can do this in several ways:

Firstly, there's the data dump. Serialize your RDF the way you like and make it accessible as a single file. It's the easiest and often the cheapest way to publish your data. However, if someone just wants to know something about a single subject (or resource) in your data dump, he'd have to download the entire data dump. That's cumbersome and makes your data not as re-usable as it could be. All processing and querying efforts are left to the downloader. Furthermore, data dumps are hard to manage and therefore likely to be outdated.

Subject pages to the rescue! Make the RDF data available through HTTP at the location where you'd expect it: at the same link as the resource IRI. Doing this makes your data truly linked since every resource can now be downloaded separately and automatically. Subject pages can be either static or dynamic. Static subject pages are simply RDF files hosted on some URL. Sharing static subject pages is very simple, but static data is hard to maintain or edit. Dynamic pages are generated by a server, so the underlying data could be edited by any framework. Another advantage of using dynamic subject pages is that you can serialize to many different formats. You can show HTML to humans and RDF to computers. For example, our project Argu (an online democracy and discussion tool) works like this. Visit a webpage (or subject page) (e.g. argu.co/nederland/m/46). If you want the same content as linked data, add a serialization extension (e.g. .ttl) or use HTTP Accept Headers. Note that even though this project serializes all sorts of RDF formats, the project itself does not use an RDF database / triple store.

Perhaps the most popular and easiest way to publish linked data is with annotated pages. Remember the RDFa serialization format, discussed above? That's annotated pages. Using RDFa or Microdata in your existing web pages provides some benefits, especially to SEO. For example, you can get these cool boxes in google, which show things like star ratings in search previews. However, annotated pages are more for adding a bit of spice to your existing webpage than to make huge datasets available. Parsing (reading) RDFa from a large HTML document will always be more expensive than reading Turtle or any other simple triple RDF format.

A radically different way to share your linked data is through a SPARQL endpoint. SPARQL is a query language, like SQL, designed to perform complex search queries in large RDF graphs. With SPARQL, you can run queries such as 'which pianists live in the Netherlands', or 'what proteins are involved in signal transductions and related to pyramidal neurons?'. SPARQL is without any doubt extremely powerful, but using it as the only way of sharing RDF might not be the best idea. The subjects that you define are URLs, and there should resolve. Having a SPARQL endpoint is a nice bonus, but making the subjects themselves available at their URLs should have priority. If you want a SPARQL endpoint, you will need to store your RDF data in a triple store with SPARQL support. Various proprietary and open source ones (e.g. Apache Jena) exist. Ask yourself if your users will need to run complex queries on your data. Keep in mind that having an open SPARQL endpoint might have very inconsistent performance.

Other technologies like Linked Data Fragments and HDT allow for even more efficient sharing and storing of linked data.

Remember that you don't necessarily need a Triple Store or SPARQL if you want to share linked data. If you're importing linked data from other sources, you're probably going to need a triple store, because you can't know in advance what kind of data models you're going to get. However, if you have a constrained and clear schema (which most applications have!), and you want to make your apps data available as linked data, you can simply keep using your existing database. The thing that you need to do, is serialize your data as some RDF format. This might mean adding an @context object to your JSON bodies, which maps JSON keys to RDF properties. Or it might mean using some RDF serializer and create the mappings for internal concepts to external URLs inside your app.

Note that there's a difference between linked data and linked open data. Although linked data would be a great choice for publishing open data (it's also known as 5 star open data), you don't have to make your linked data accessible to others. It's perfectly possible to secure linked data using OAuth, [WebID], ACL or other methods.

Advantages of linked data

Links provide a path to extra information on something, since you can follow them. If you link to other linked data resources, it means that machines can traverse these graphs as well.
Links remove ambiguity, so it becomes very clear what is being stated.
Linked data enables a decentralized architecture. Since URLs point directly to the source of the data, even if the data is on a completely different domain and server, it can connect datasets to each other.
Linked data stays at the source, so it does not have to be copied as much. A user can simply request one specific part of the data, without having to download the entire dataset. This prevents a lot of expensive issues related with data duplication.
You don't need new APIs and API descriptions, since you can just use HTTP + Content Negotiation to fetch specific items. The data itself is browsable, like webpages are.
You can easily merge linked datasets without any collisions in identifiers. This is because URLs are unique even across multiple domains.
Linked data can be converted to many serialization formats. This blogpost compares them. Since RDF contains more information, it's easy to convert Linked Data to JSON (for example), but the other way around is more difficult.
Linked data is a standard with many available tools, libraries and query options (e.g. SPARQL).
Linked data is highly extendible, as anyone can use their own URLs for classes, predicates and datatypes.

Disadvantages of linked data

Creating new linked data can be more time consuming, since you are expected to use (working) links instead of the words that come to mind.
It can be a bit confusing at first, especially the plurality of serialization formats.
Handling sequential data / arrays in RDF is more difficult than it should be.
Having a good URL strategy becomes more important, especially when people will use your linked data.
Rendering RDF data in a fancy GUI / web application can be tricky (check out our link-redux library for rendering linked data in React).
Re-using it often requires mapping efforts. In Object Oriented environments (e.g. javascript), developers tend to use forms of dot syntax to navigate data, e.g. when accessing a key in a JSON object, such as myObject.someProperty. With RDF, these keys are (long) URLs, so this might require some RDF ORM.
Few people are familiar with linked data, and there is a bit of a learning curve.