The Dataset
The UK government was one of the pioneers of releasing its data as open data. Yet, the Open Knowledge Foundation works on making sense of these UK Messy Open Data. The source files of this dataset can be retrieved here.
The dataset consists of a collection of disparate Excel files that are very difficult to query in a systematic way. We usually refer to such collection as a messy spreadsheet collection (MSC).
Dataset Integration Workflow
To integrate this MSC on the Web, we use the Integrator, a framework for integrating any kind of MSC using Web technology. The dataset is converted to RDF and integrated using Web vocabularies. The result is served through this SPARQL endpoint (use the named graphs <urn:graph:uk-messy:raw-data>, <urn:graph:uk-messy:rules> and <urn:graph:uk-messy:release>). The following example query shows the resulting integration for a subset of the MSC:
PREFIX tablink: <http://bit.ly/cedar-tablink#>
PREFIX cedar: <http://bit.ly/cedar#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX sdmx-dimension: <http://purl.org/linked-data/sdmx/2009/dimension#>
SELECT ?obs ?year ?force ?type ?offence
FROM <urn:graph:uk-messy:release>
WHERE {
?obs a qb:Observation ;
sdmx-dimension:refPeriod ?year ;
sdmx-dimension:refArea ?force ;
cedar:offenceType ?type ;
cedar:population ?offence .
}
Data Location Definitions
Data in this MSC is arbitrary located in several places of the spreadsheet layout. To precisely define where data observations and dimensions are located, we mark up with styles the source data. A sample of this markup can be found here.
Conciliation Rules
Dimensions in this MSC are implicitly defined. In order to make them explicit, we generate a set of context-aware conciliation mappings that can be easily written by domain experts. Example mapping files on several dimensions can be found here. A master metadata file is used as an index to all defined mapping files.