for Data Quality
September 12th, 2016
The value of a network is proportional
to the square of the number of connected nodes
So... how many connected nodes does the SW have?
Data growth is exponential
SW growth is linear
What is the problem?
After 15 years most data cannot be automatically:
- reasoned over
Many PhD students' worse nightmare…
Problem 1: Most data cannot be found
SotA findability comparable to 1995 Yahoo!
Problem 2: Most data cannot be read
E.g., Freebase <10% syntactically correct.
Current approaches are inherently slow: standards,
guidelines, best practices, tools, education.
This takes decades!
Why is data dirty?
- Character encoding issues
- Socket errors
- Protocol errors
- Corrupted archives
- Authentication problems
- Syntax errors
- Wrong metadata
- Lexical form ↛ value
- Non-canonical lexical form
- Logically inconsistent
Problem 3: Most data cannot be queried
Problem 4: Most data cannot be reasoned over
- Web-scale reasoning is only performed in the lab
- Federation does not scale to thousands of
- Real-world reasoning immediately goes ex falso
How to solve this?
Beek & Rietveld & Bazoobandi & Wielemaker
& Schlobach “LOD laundromat: A Uniform Way of
Publishing Other People’s Dirty Data” ISWC,
How to query >30B statements (1/2)
How to query >30B statements (2/2)
Rietveld & Verborgh & Beek & Vander Sande
& Schlobach, “Linked Data-as-a-Service: The Semantic
Web Redeployed” ESWC 2015.
Part III: Why bother?
How generalizable is SW research?
ISWC 2014 Research Track:
- 17 datasets overall; avg. 2 per per paper (avg. 2)
- Data was cleaned locally and deleted afterwards
Reproducing “Linked Data Best Practices”
L. Rietveld & W. Beek & S. Schlobach, “LOD
Lab: Experiments at LOD Scale”, International Semantic
Web Conference, 2015 (Best Paper Award).
Calculate a metric over all WoD documents:
frank documents --downloadUri |
Beek & Rietveld, ‘Frank: The LOD Cloud at your
Fingertips’ in ESWC Developers Workshop,
Large-scale Data Quality Improvement (1/2):
Large-scale Data Quality Improvement (2/2): Language
Beek & Ilievski & Debattista & Sclobach
& Wielemaker, ‘Literally Better: Analyzing and
Improving the Quality of Literals’ under submission.
Evaluation results for ±600,000 datasets
De Rooij & Beek & Bloem & Schlobach &
Van Harmelen, ‘Are Names Meaningful? Quantifying Social
Meaning on the Semantic Web’ in ISWC, 2016.
Semantic Search Engine
Ilievski & Beek & Van Erp & Rietveld &
Schlobach, ‘LOTUS: Adaptive Text Search for Big Linked
Data’, ESWC 2016.
Part IV: Is semantic data quality
what we think it
owl:sameAs has 2 meanings
$$a = b \,\longleftrightarrow\, (\forall P)(Pa = Pb)$$
“Include links to other URIs, to discover more things.”
Contextual semantics for
Beek & Schlobach & Van Harmelen, ‘A
Contextualised Semantics for
in International Semantic Web Conference,
p. 405--419, 2016.