The Dataset

The International Institute of Social History gathers and publishes an important dataset on historical prices, wages and welfare. The source files of this dataset can be retrieved here.

The dataset consists of a collection of disparate Excel files that are very difficult to query in a systematic way. We usually refer to such collection as a messy spreadsheet collection (MSC). An example is shown in the picture below, showing two of such spreadsheets where rice prices of Gambia and Japan in the 19th century can only be compared through laborious data munging.

Dataset Integration Workflow

To integrate this MSC on the Web, we use the Integrator, a framework for integrating any kind of MSC using Web technology. The dataset is converted to RDF and integrated using Web vocabularies. The result is served through this SPARQL endpoint (use the named graphs <urn:graph:prices-wages:raw-data>, <urn:graph:prices-wages:rules> and <urn:graph:prices-wages:release>). The following example query shows the resulting integration for a subset of the MSC:

PREFIX tablink: <http://bit.ly/cedar-tablink#>
PREFIX cedar: <http://bit.ly/cedar#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX sdmx-dimension: <http://purl.org/linked-data/sdmx/2009/dimension#>

SELECT ?obs ?year ?good ?price
FROM <urn:graph:prices-wages:release>
WHERE {
    ?obs a qb:Observation ;
         sdmx-dimension:refPeriod ?year ;
         cedar:goodPrice ?good ;
         cedar:population ?price .
}

Data Location Definitions

Data in this MSC is arbitrary located in several places of the spreadsheet layout. To precisely define where data observations and dimensions are located, we mark up with styles the source data. A sample of this markup can be found here.

Conciliation Rules

Dimensions in this MSC are implicitly defined. In order to make them explicit, we generate a set of context-aware conciliation mappings that can be easily written by domain experts. Example mapping files on several dimensions can be found here. A master metadata file is used as an index to all defined mapping files.

Measurement Transformations

Data in this MSC needs to be transforomed in order to be fully integrated. To this end, we extend SPARQL into SPARQLSS (SPARQL Speaks Statistics), a Web friendly way of transforming Web data without quiting the Web ecosystem. An example implementation that calculates currency and volume transformations for this MSC can be found here.

Error Detection

To detect data errors and inconsistencies after the execution of previous stages, we use Linked Edit Rules (LER). LER allow us to encode domain constraints that the data cubes with which the data cubes must be compliant. First, we generate this set of LER using domain expert knowledge. The results of checking these LER against the data, containing 368 observations that do not meet these LER, are available here.