The Semantic Frontier

Going Where No SPARQL Endpoint has Gone Before

Wouter Beek

November 6th, 2017

What is the cost of access?

Cost of access too high for scientists

ISWC research papers use 2 datasets on average.

You need a €100K+ something something cluster!

Scalability ‘solved’ in research

  • Large-scale storage
  • Large-scale querying
  • Integration in business processes
  • Large-scale reasoning
  • Large-scale everything

Where are these services running today (free or paid)?

Allocative Efficiency

What the consumer is willing to pay should equal the marginal cost of production

Why is allocative efficiency not reached?

  • Existing deployments are closed systems (no Metcalfe effect)
  • Negative incentive model for data that is good & open
  • Marginal cost of production goes up with №nodes

Negative incentive model

SPARQL endpoints are either valuable or available

Metcalfe's Law

The value of a network is proportional to the square of the number of connected nodes

№Internet hosts


Our approach (180°)

Not a memory-based multi-node cluster

The cheapest thing in computing today: disk


The cost of access (our approach)

  • 1 file
  • 28,362,198,927 unique triples
  • >650K data documents
  • 524 GB of disk space
  • 15.7 GB of RAM
  • €305,- hardware cost


LOD-a-lot: Go Where No SPARQL Endpoint Has Gone Before

  • Term enumerators
  • Triple enumerators
  • Triple pattern cardinalities
  • Random triples from triple pattern
  • Random terms (by position / by type)
  • Exact prefix match

Example: Enumerating schema


Empirical semantics





Example: Identity closure

558,943,116 owl:sameAs triples

Cleaning owl:sameAs

Community 1: Obama, the personባራክ_ኦባማ

Community 2: Obama, the administration

Thank you!