Quality & Scalability
of Linked Data

October 10th, 2016

Wouter Beek (w.g.j.beek@vu.nl)

What is the problem?

Most data cannot be automatically:

  • found
  • read
  • queried
  • reasoned over

Many PhD students' worse nightmare…

Problem 1: Most data cannot be found

SotA findability comparable to 1995 Yahoo! index

Problem 2: Most data cannot be read

  • Character encoding issues
  • Socket errors
  • Protocol errors
  • Corrupted archives
  • Authentication problems
  • Syntax errors
  • Wrong metadata
  • Lexical form ↛ value
  • Non-canonical lexical form
  • Logically inconsistent

Problem 3: Most data cannot be queried

Problem 4: Most data cannot be reasoned over

  • Web-scale reasoning is only performed in the lab
  • Federation does not scale to thousands of endpoints
  • Real-world reasoning immediately goes ex falso quodlibet

How to solve this?


Beek & Rietveld & Bazoobandi & Wielemaker & Schlobach “LOD laundromat: A Uniform Way of Publishing Other People’s Dirty Data”, International Semantic Web Conference, 2014. Best Linked Open Data Application Award 2015

How to query >30B statements (1/2)

How to query >30B statements (2/2)

Rietveld & Verborgh & Beek & Vander Sande & Schlobach, “Linked Data-as-a-Service: The Semantic Web Redeployed”, European Semantic Web Conference, 2015.

Built on top of the ClioPatria triple store: github.com/ClioPatria/ClioPatria

Wielemaker & Beek & Hildebrand & Van Ossenbruggen, “ClioPatria: A SWI-Prolog Infrastructure for the Semantic Web” in Semantic Web Journal, 2016. Beek & Wielemaker, “SWISH: An Integrated Semantic Web Notebook” in International Semantic Web Conference, 2016 (to appear).

Alt. SW layer cake

Beek & Rietveld & Schlobach & Van Harmelen “LOD Laundromat: Why the Semantic Web Needs Centralization (Even If We Don't Like It)” IEEE Internet Computing, 20 (2), p.78-81, 2016.

Semantic Search Engine

Ilievski & Beek & Van Erp & Rietveld & Schlobach, “LOTUS: Adaptive Text Search for Big Linked Data”, European Semantic Web Conference, 2016. 2nd place, Linked Data Challenge, 2016.



Why bother?

How generalizable is SW research?

ISWC 2014 Research Track:

  • 17 datasets overall; avg. 2 per per paper (avg. 2)
  • Data was cleaned locally and deleted afterwards

Reproducing “Linked Data Best Practices” (Schmachtenberg 2014)

Original LOD Lab
Prefix #datasets %datasets Prefix #documents %documents
rdf 996 98.22% rdf 639,575 98.40%
rdfs 736 72.58% time 443,222 68.19%
foaf 701 69.13% cube 155,460 23.92%
dcterm 568 56.01% sdmxdim 154,940 23.84%
owl 370 36.49% worldbank 147,362 22.67%

Rietveld & Beek & Schlobach, “LOD Lab: Experiments at LOD Scale”, International Semantic Web Conference, 2015 (best paper award).

Calculate a metric over all WoD documents:

                      frank documents --downloadUri |


Beek & Rietveld, “Frank: The LOD Cloud at your Fingertips” in ESWC Developers Workshop, 2015.

LOD Laundromat gives a
Web-scale Data Quality report

Rietveld & Beek & Hoekstra & Schlobach, “Meta-Data for a Lot of LOD”, Semantic Web Journal, (to be published).

Large-scale Data Quality Improvement (1/2): Datatypes

Large-scale Data Quality Improvement (2/2): Language tags

Beek & Ilievski & Debattista & Sclobach & Wielemaker, ‘Literally Better: Analyzing and Improving the Quality of Literals’ under submission.

Evaluation results for ±600,000 datasets

De Rooij & Beek & Bloem & Schlobach & Van Harmelen, ‘Are Names Meaningful? Quantifying Social Meaning on the Semantic Web’ in ISWC, 2016.

Contextual semantics for owl:sameAs

Beek & Schlobach & Van Harmelen, ‘A Contextualised Semantics for owl:sameAs’ in International Semantic Web Conference, p. 405--419, 2016.

Thank you!

WWW: wouterbeek.com

Mail: w.g.j.beek@vu.nl

Triply: triply.cc