A Single-File Enabler for Data Science

Wouter Beek w.g.j.beek@vu.nl, Javier Fernández javier.fernandez@wu.ac.at, Ruben Verborgh ruben.verborgh@ugent.be

What is the cost of access?

Datasets used by us

ISWC research papers use 2 datasets on average.

What is the cost of access?

You need a €10K+ cluster!

Low-cost LOD access

  • 1 file
  • 28,362,198,927 unique triples
  • >650K data documents
  • 524 GB of disk space
  • 15.7 GB of RAM
  • €305,- hardware cost

How did we do it?

(1/3) LOD Laundromat

Beek & Rietveld et al. 2014, LOD laundromat: a uniform way of publishing other people's dirty data

(2/3) Header Dictionary Triples (HDT)

Fernández & Martínez-Prieto & Gutiérrez, 2013, Binary RDF representation for publication and exchange (HDT)

(3/3) Linked Data Fragments (LDF)

Verborgh & Vander Sande et al. 2014, Web-Scale Querying through Linked Data Fragments

A Single-File Enabler for Data Science


  • Enumerate terms
  • Query for Triple Patterns
  • Retrieve metrics

Data Science use cases

  • obtaining statistics
  • enumerating schema
  • identity closure
  • graph navigation
  • query planning
  • random sampling for Machine Learning
  • generating specialized indexes
  • versioning
  • analyzing inconsistencies

Use case 1/3: Obtaining statistics

triples 28,362,198,927
subject 3,214,347,198
predicates 1,168,932
objects 3,178,409,386
subject & object 1,298,808,567

Use case 2/3: Enumerating schema


Use case 3/3: Identity closure

558,943,116 owl:sameAs triples

Thank you!

Wouter Beek w.g.j.beek@vu.nl
Javier Fernández javier.fernandez@wu.ac.at
Ruben Verborgh ruben.verborgh@ugent.be