LOD Laundromat

The Web of Data at Your Fingertips

Wouter Beek (w.g.j.beek@vu.nl)

Slides at http://wouterbeek.github.io

The first deployment of the Semantic Web has failed


  • After 15 years most data cannot be automatically:
    • found
    • read
    • queried
    • reasoned over (not covered in this presentation)

Problem 1

Most data cannot be found

  • SotA comparable to Yahoo! index anno 1995: hierarchy of links / catalogues (CKAN, LOV, VoID-store)
  • Most SW datasets are not available online.
  • Most online datasets are not registered in any catalogue.

Problem 2

Most data cannot be read

Freebase 'Monkey' (< 10% syntactically correct)

Problem 3

Most data cannot be queried

  • Data dumps are the most popular deployment strategy
  • Many live queriable datasets have a custom API
  • Most custom APIs are not self-describing
  • Many SPARQL endpoints enforce restrictions
  • Most SPARQL endpoints that do not enforce restrictions have low availability
  • Different SPARQL endpoints enforce different restrictions
  • Different SPARQL endpoints implement different subsets of different versions of the SPARQL standard
  • Web-scale federated querying has not even been considered

Query endpoint availablility according to LODStats

SPARQL cannot fulfill the role of SW query language

There are millions of data documents but only hundreds of live query endpoints with reasonable availability.

Existing deployment techniques are unable to close the gap between downloadable data dumps and live queryable data.

Data is growing faster than SPARQL deployment uptake.

Why is it not working?

[1] Distributed approach

Coordination techniques:

  • Standards
  • Guidelines
  • Best practices
  • Tools

Targeted towards humans

Inherently slow

Mixed results after 15 years

[2] Data navigation broken by design

[3] Broken cost-benefit model

  • Publishing large volumes of high-quality data is penalized.
  • Consuming large volumes of data / asking DDOS-like queries is free.

How to redeploy the Semantic Web?

  1. Data collection
  2. Data cleaning
  3. Data publishing
  4. Data consumption
  5. Text-based search

[1] Data collection

[2] Data cleaning

New solution for data cleaning

(1) Automate conformity to standards

"Days not decades"

(2) Tools → Web Service

Look at email

LOD Laundromat

The Clean Linked Data programming platform



Linked Closed Data deployment

[3] Data publishing

Fix the cost/benefit model

(1) Publishing data is free

(2a) Asking more questions increases client costs

(2b) Asking more complex questions increases client costs


HDT: Disk-based, efficient yet queryable storage

SSD: Disks become faster and cheaper

LDF: BGP queries require client-side joins

[4] Data consumption

Data navigation fixed


Federated Resource Architecture for Networked Knowledge



$ ./frank statements --predicate foaf:name | head -n 5
		eurostat:void.rdf#Eurostat foaf:name "Eurostat".
		author:5ff33...1c4 foaf:name "Dong-Mei Shi".
		author:d873s...19b foaf:name "Feng-Xia Ma".
		author:fbbcf...54c foaf:name "Ya-Guang Chen".
		author:1ec76...f4b foaf:name; "Jian Yu".

[5] Natural language text search


Linked Open Text Unleashed

LOTUS in numbers

# literals12,018,939,378
# integers and dates6,699,148,542
# indexed strings5,319,790,836
# distinct sources508,244
# distinct languages713
# hours to create index56
disk space use484.77 GB

How generalizable is SW research?

17 datasets are used in total

1-6 datasets per article

2 datasets per article on average

Study the SW as a complex system

Formal meaning vs social meaning

		abox:item1024 rdf:type       tbox:Tent    .
		abox:item1024 tbox:soldAt    abox:shop72  .
		abox:shop72   rdf:type       tbox:Store   .
		fy:jufn1024   pe:ko9sap_     fyufnt:Ufou  .
		fy:jufn1024   fyufnt:tmffqt  fy:aHup      .
		fy:aHup       pe:ko9sap_     fyufnt:70342 .

These graphs denote the same models.

According to model theory, IRIs are individual constants or predicate letters whose names are chosen arbitrarily and thus carry no meaning.

Try to refute the hypothesis that names and meanings are independent.

Hypothesis testing over 544K datasets