The Closure of 500M owl:sameAs Statements

Wouter Beek ( Joe Raad (, Jan Wielemaker, and Frank van Harmelen

June 7th, 2018


Linked Data requires a formal understanding of identity
(i.e. owl:sameAs).

You must have both

Formal meaning

〈x, owl:sameAs, y〉
means that
(∀P)(Px ↔ Py)

Linked Data

“Include links to other URIs, to discover more things.” Linked Data principle 4, TBL

Similarity is not good enough

“SKOS exactMatch indicates a high degree of confidence that two concepts can be used interchangeably across a wide range of information retrieval applications”
SKOS specification, 2009

The only thing worse than owl:sameAs is ‘clever’ replacements for owl:sameAs

lexvo:nearlySameAs lexvo:somewhatSameAs owl:sameAs

lexvo:nearlySameAs lexvo:nearlySameAs lexvo:somewhatSameAs ?

owl:sameAs lexvo:somewhatSameAs bbc:sameAs ?

We need an enabler for empirical research into how owl:sameAs is actually being used.

The analytic approach: “people make mistakes” / “it's just noise” is not enough. requirements

A performant and cost-effective solution for determining whether two things are (claimed to be) the same.

This solution must scale to the LOD Cloud.

This solution must be formally interpretable (no skos:exactMatch, rdfs:seeAlso).

It must be calculated incrementally.


Formal properties of Identity

Identity is the smallest equivalence relation, it is:

  • reflexive (x,x)
  • symmetric (x,y) → (y,x)
  • transitive (x,y) ∧ (y,z) → (x,z)


Explicit identity relation over {:a,:b,:c,:d}:

                  :a owl:sameAs :b

                  :d owl:sameAs :b

Then the implicit identity relation is:

                    :a owl:sameAs :a
                    :a owl:sameAs :b
                    :a owl:sameAs :d
                    :b owl:sameAs :a
                    :b owl:sameAs :b

                    :b owl:sameAs :d
                    :c owl:sameAs :c
                    :d owl:sameAs :a
                    :d owl:sameAs :b
                    :d owl:sameAs :d

Where do we go for the explicit identity relation?


Fernández et al. 2017

Extract the explicit identity relation

                prefix owl: <>
                construct {
                  ?s owl:sameAs ?o
                } where {
                    select distinct ?s ?o {
                      ?s owl:sameAs ?o
                      filter(?s < ?o)

558.9M → 331M triples

Compaction: 2.8M reflexive and 225M duplicate symmetric triples

Create HDT: 4 hours (1 CPU core); 4.5GB + 2.2GB index

Calculate the implicit identity relation

RDF nodes N
key : ID ↦ Ƥ(N)
val : N ↦ ID
Identity closure for x := key(val(x))

Add explicit identity pair (x,y):

  1. Both are new: x ↦ id, y ↦ id, id ↦ {x,y}
  2. Only y is new: y ↦ val(x), val(x) ↦ key(val(x)) ∪ {y}
  3. Both are old: val(x) ↦ key(val(x)) ∪ key(val(y)),
    ∀ y'∈key(val(y)) . y' ↦ val(x)

5 hours (2 CPU cores); 9.3GB disk (RocksDB)


Analysis: explicit identity relation

Aggregation by namespace

Aggregate 558.9M owl:sameAs statements into 2,618 namespaces, 10,791 edges, 142 components

Relatively few namespaces have internal links. (Indicator that RDF datasets enforce UNA internally.)

Domain-specific identity hubs:

  • Bibliographic datasets:
  • Geographic datasets:
  • Biochemistry datasets:
  • Online reviews:

Analysis: implicit identity relation

№ Identity sets in implicit identity relation

5,044,948,869 singleton identity sets

48,999,148 non-singleton identity sets

Non-singleton identity sets

31,337,556 identity sets (63.96%) have cardinality 2

The largest identity set has cardinality 177,794. It includes Albert Einstein, the countries of the world, and the empty string. Responsible for 31,610,706,436 (90%) of the implicit identity relation.

Kernel calculation

The minimum number of owl:sameAs triples that would be needed in order to express the full materialization.

Nice use case: stream through the full identity closure.

Processing time: 55.6 seconds (3 CPU cores)

Kernel size: 130,673,158 triples

0.37% of the implicit identity relation

23.4% of the explicit identity relation

Use cases / Why are we doing this?

owl:sameAs triples about ‘Barack Obama’

Formal semantics: all these identifiers denote the exact same thing.

But are they really the same thing?



‘Barack Obama’ after community detection

purple: person
green: president
blue: senator
orange: government

Future work

Thank you!

Try it yourself:

Wouter Beek ( Joe Raad (, Jan Wielemaker, and Frank van Harmelen

June 7th, 2018