LOD Laundromat

Cleaning Other People's Dirty Data

August 23rd, 2018

Wouter Beek (w.g.j.beek@vu.nl)

Why is data dirty?

  • Character encoding issues
  • Socket errors
  • Protocol errors
  • Corrupted archives
  • Authentication problems
  • Syntax errors
  • Wrong metadata
  • Lexical form ↛ value
  • Non-canonical lexical form
  • Logically inconsistent

Approach

  1. Scrape all LOD
  2. Clean all LOD
  3. Publish all LOD in a standards-compliant form

Large-scale service / low hardware footprint

  • Hundreds of thousands of datasets
  • Tens of billions of triples

LOD Laundromat


Semantic Web Layer Cake

LOD Laundromat Layer Cake

Frank

Federated Resource Architecture for Networked Knowledge

https://github.com/LOD-Laundromat/Frank

W. Beek & L. Rietveld, 2015. “Frank: The LOD Cloud at your Fingertips” Extended Semantic Web Conference: Developers Workshop.

Howto

Programma

  1. Data uploaden
  2. Data downloaden
  3. Metadata gebruiken
  4. Data gebruiken
  5. Data vinden
  6. Data applicaties bouwen

[1] Data uploaden

LOD Basket

[2.1] Data downloaden: Web site

LOD Wardrobe

[2.2] Data downloaden: Web Service

http://download.lodlaundromat.org/MD5

[2.3] Data downloaden: Frank

$ frank documents
$ frank statements

[3.1] Metadata gebruiken

Widgets

http://lodlaundromat.org/visualizations

[3.2] Metadata: dereference

http://lodlaundromat.org/resource/MD5

[3.3] Metadata: endpoint

[3.3] Metadata: endpoint

prefix llo: <http://lodlaundromat.org/ontology>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
select distinct ((?duplicates/xsd:double(?duplicates+?triples)) as ?relative) ?url {
 ?dataset
   llo:duplicates ?duplicates;
   llo:triples ?triples;
   llo:url ?url.
   filter(?triples > 0)
}
order by desc(?relative)
limit 50

Run query

[3.4] Metadata: frank

$ ./frank documents --minTriples=1000 --maxTriples=10000

[4.1] Data gebruiken

LOD Wardrobe

[4.1] Data gebruiken: Web site

http://ldf.lodlaundromat.org/MD5

[4.2] Data gebruiken: Web service

http://ldf.lodlaundromat.org/MD5

Parameters

  • subject
  • predicate
  • object

[5.1] Data vinden: namespace

Namespace index

[5.2] Data vinden: resource

Resource index

[6] Data-intensieve applicaties bouwen

Cheatsheet

[1] Data uploaden
LOD Basket
[2] Data downloaden
http://download.lodlaundromat.org/MD5
[3] Metadata gebruiken
Endpoint
[4] Data gebruiken
http://ldf.lodlaundromat.org/MD5
[5] Data vinden
http://index.lodlaundromat.org
[6] Data applicaties bouwen

Thank you for your attention!