HDT Tutorial:
LOD Laundromat


Wouter Beek w.g.j.beek@vu.nl, Javier Fernández javier.fernandez@wu.ac.at, Ruben Verborgh ruben.verborgh@ugent.be

Data scientists spend N% of their time on data cleaning

lodlaundromat.org



Beek & Rietveld & Bazoobandi & Wielemaker & Schlobach “LOD laundromat: A Uniform Way of Publishing Other People’s Dirty Data”, International Semantic Web Conference, 2014. Best Linked Open Data Application Award 2015

Data cleaning

  • Character encoding issues
  • Socket errors
  • Protocol errors
  • Corrupted archives
  • Authentication problems
  • Syntax errors
  • Incorrect metadata

How to query >30B statements (1/2)

How to query >30B statements (2/2)

Rietveld & Verborgh & Beek & Vander Sande & Schlobach, “Linked Data-as-a-Service: The Semantic Web Redeployed”, European Semantic Web Conference, 2015.

A Future Laundromat?

  • More data (> 100B triples)
  • Lexical forms must map to their value space
  • Canonical lexical forms
  • IRI Closure over percent encoding and HTTP(S)?
  • Snapshots over time
  • What is a dataset?

LOD-a-lot

LOD-a-lot: Obtaining statistics

triples 28,362,198,927
subject 3,214,347,198
predicates 1,168,932
objects 3,178,409,386
subject & object 1,298,808,567

Low-cost LOD access

  • 1 file
  • 28,362,198,927 unique triples
  • >650K data documents
  • 524 GB of disk space
  • 15.7 GB of RAM
  • €305,- hardware cost

HDT: Go Where No SPARQL Endpoint Has Gone Before

  • Term enumerators
  • Triple enumerators
  • Triple pattern cardinalities
  • Random triples from triple pattern
  • Random terms (by position / by type)
  • Exact prefix match

Example: Enumerating schema

comparison

Example: Identity closure

558,943,116 owl:sameAs triples

Try it out yourself


Wouter Beek w.g.j.beek@vu.nl,
Javier Fernández javier.fernandez@wu.ac.at,
Ruben Verborgh ruben.verborgh@ugent.be