Data Reconciliation Using OpenRefine

Introduction
Downloading and installing OpenRefine
Using OpenRefine to prepare, update, and standardize your data
Using OpenRefine to reconcile and validate your data prior to upload
1. Initial set-up stage
2. Issues stage
3. Preview stage
Resources

Introduction

OpenRefine (previously Google Refine) is “a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.” (Source: https://openrefine.org/) Most crucially for projects intending to use Wikidata as a final data repository, OpenRefine allows direct manipulation of data in Wikidata via a reconciliation service and an editing extension, all available within a graphical user interface and requiring no coding skills.

Downloading and installating OpenRefine

OpenRefine is a desktop tool, which users need to download and work with locally on their own machines. It is currently not available as a cloud-based service for security reasons – working with your data only on your own machine ensures complete privacy of your data until you are ready to share it to an open platform such as Wikidata.

The official releases of OpenRefine can be downloaded directly from the app’s website here: https://openrefine.org/download.html. Once OpenRefine is downloaded, you can follow a simple set of installation instructions to set it up. Although OpenRefine is a desktop app that runs locally, it is actually a browser-based tool: launching it will open a new tab in your default browser and you will be able to access OpenRefine’s user interface directly via a browser tab.

There are useful tutorials to get started which are available on the OpenRefine website, as well as many user-generated tutorials available on YouTube. We recommend starting with the ones available on the homepage: https://openrefine.org/.

Using OpenRefine to prepare, update and standardize your data

OpenRefine can be used to work with a range of data formats: e.g. csv, excel, json, rdf, xml, among others. OpenRefine can also be used with data from GoogleSheets by providing a URL to the spreadsheet. Once a file is uploaded or a URL submitted, you can create a new project in OpenRefine and use it to perform even more sophisticated data cleaning operations than possible with Excel or GoogleSheets. Data inconsistencies in spelling, abbreviations, or even date formats can easily be standardized across the sheet with basic commands in OpenRefine. Columns containing multiple concepts separated by a standard symbol such as a comma or semicolon can also be split into separate columns, so that there is always a single entity or concept per field, ready for reconciliation with Wikidata. One important limitation worth keeping in mind is that while OpenRefine is a sophisticated tool for cleaning and standardizing data, it can only manipulate columns, but not rows in your tables; so if you need to add new data, you will need to start a new project. Therefore, it is recommended that you only start working with OpenRefine once all of the main “subjects” of your data have been entered in your original data source files.

Using OpenRefine to reconcile and validate your data prior to upload

Lastly to complete your linked open data workflow with OpenRefine you would need to reconcile your data to your chosen repository database. This means you need to connect concepts from your dataset, such as artists or artworks, to the same concepts in the repository, identified by unique URIs. In some cases, if those concepts don’t exist yet, they will need to be created, but only after the reconciliation process has proven that there is no existing entry for the artist or artwork in question. The goal is to avoid redundancy and maximise efficiency (in storage and search) through data reuse.

In the case of the JJKB, we are using OpenRefine to reconcile our dataset to Wikidata’s vast repository (although it is possible to configure custom API endpoints to other open data repositories, e.g. an independent instance of Wikibase, and reconcile your data against these). Wikidata already comes as a default reconciliation service with official OpenRefine releases. In addition, the Wikidata extension in OpenRefine enables users to log in directly to their Wikidata user account, build a custom schema from their data in correspondence with Wikidata’s data model, and upload new or edit existing data in Wikidata. The schema-building tool consists of three helpful stages:

1. Initial set-up stage

An initial set up stage where a drag-and-drop interface allows users to match columns from their dataset to corresponding fields in Wikidata statements (i.e. item, property, or value fields). Descriptions and aliases for new items can be added here, too.

2. Issues stage

An Issues stage flags potential problems with your data before upload, and allows you to correct invalid statements, so that the upload will not result in mistakes.

3. Preview stage

Lastly there is a Preview stage, which lets you see how your edits will look in Wikidata and allows for final checks for mistakes before upload.

NB: Please note that at some point as you progress through your data collection and data reconciliation workflows, your data might get “out of sync”. This can happen if new research reveals additional data after you completed the earlier reconciliation work, and maybe even after you uploaded most of your data. Due to the limitation of not being able to add new rows to OpenRefine projects, this means you may end up doing manual edits to data in Wikidata while keeping spreadsheets with the new rows that don’t match your original OpenRefine projects. In such cases, you could recreate your OpenRefine projects from your updated spreadsheets and reconcile the data following the same steps as outlined above. You could also just make new OpenRefine projects for any additional data revealed during your course of research. Keep in mind that yourOpenRefine projects do not have to be your canonical data source. Once the data are uploaded to Wikidata, the database can be that canonical source, and you can download the most up-to-date data directly from Wikidata via the SPARQL endpoint in a convenient format, to keep alongside your original spreadsheets. Read more about making your own data queries here.

Resources

Groves, A. 2016. Beyond Excel: how to start cleaning data with OpenRefine. Multimedia Information & Technology 42(2): 18-22.

Sterner, E. 2019. Cleaning Collections Data Using OpenRefine. Issues in Science and Technology Librarianship, (92).

Go back to section overview:
Workflow: Working with Cultural Heritage Datasets

Previous page:
Data Collection

Next page:
Data Uploads to Wikidata