Data Collection

Introduction
Preparing your spreadsheets
1. Concept simplicity
2. Thinking in triples
3. One concept per field
4. One sheet per entity type
5. Descriptions before data
Assigning Properties

Introduction

Ideally you will start thinking about how your data should be prepared for a linked data environment while you’re collecting it. This will save potentially redundant work later on while you’re editing and preparing your data for upload. The most common format for data collection is a tabular or spreadsheet format – either prepared in MS-Excel, or a different software application that allows for CSV output. (“CSV” refers to comma-separated values where the data for each entity is on its own line; and the values are always in the same order, separated by commas.)

Preparing your spreadsheets

When collecting and preparing data to be uploaded to a machine-readable database, there are several principles that are important to keep in mind.

1. Concept simplicity

It is important to realize that the full nuance and complexity of human knowledge cannot be meaningfully reduced to machine-readable data, nor is it necessary to do so in order to extract useful information when querying the database. To take maximum advantage of the affordances of the semantic environment of Wikidata, concepts should be kept relatively simple, however every mention of a concept could be referenced to a particular textual narrative source, where richer detail may be provided. The concept serves as the hook that can be queried for. For example, many of the works in JJKB feature complex arrangements of objects, which have been documented in varying degrees of detail. It may not be possible (or at least practical) to document the minutiae of every object characteristic or position in relation to an artwork instantiation in the database. However, it is possible to document that specific works had floor plans or scripts, and the latter can be referenced to specific textual documents. One can also document that specific works used mirrors, masks or metal cones as part of the installation or performance, in order to draw out common objects that may occur across works, without going into further detail as to the physical characteristics of these objects. Simple concepts like a floor plan or a mirror already exist in Wikidata and are therefore readily usable across multiple data records. Attempting to create data records for highly specific concepts which may only ever be used in relation to one artwork are not meaningful in the context of a semantic environment, where the richness of connections across records creates the conditions for useful querying.

2. Thinking in triples

A spreadsheet is a flat table without the capacity to represent rich network connections, however the rows and columns of a spreadsheet can be reimagined as triple relationships in order to make the data modeling and eventual data ingest process easier. The first column in the spreadsheet in this scenario serves as the primary “subject” of the triple. The heading of each following column can serve as the “predicate” or the link (also referred to as the “property”) in the triple. Finally, the value of each cell in these further columns can serve as the “object” to be linked to the subject. (illustration below)

Here is a excerpt using JJKB data:

3. One concept per field

For both subjects and objects, each field, or cell, in the spreadsheet should only hold a single entity or concept. For example instead of listing all contributors to an artwork performance in the same cell in the column for contributors, each contributor name should be listed in a separate cell, and the column header can be repeated as many times as necessary. Predicates and Objects can be repeated within and across Subjects as many times as necessary. Only subjects (in this case, in column A) must remain unique within a given spreadsheet.

4. One sheet per entity type

The more heterogeneous the data in a spreadsheet, the more difficult data reconciliation and upload will be in later stages of the workflow (see below), so keeping a particular type of entity, or “record”, in the same spreadsheet will make the process more efficient later. For example, separate spreadsheets should be used to collect data for artworks, artists, exhibitions, etc. What counts as “subjects” in one spreadsheet may become “objects” in another.

5. Descriptions before data

Lastly, there is a particularity to Wikidata’s linked data model that is worth noting, even if it is not relevant to other linked data workflows. Each entity in Wikidata has several ways of being identified besides its title (or label). These include: a unique ID number (starting with the letter Q followed by a unique combination of digits), which is generated automatically by Wikidata upon the creation of the entity; a short description (250 characters), which is used to disambiguate across entities with the same title or label (e.g. the city of Paris in France vs the city of Paris in Texas in the US); and finally a list of alternate titles or aliases (e.g. NYC for the city of New York). The unique ID number, in combination with the short description, are very useful features of Wikidata, as they allow for subjects to have the exact same label (or title) and at the same time be unique, clearly distinguishable entities in the linked data database. Besides use-cases such as the city of Paris, this is particularly helpful for artworks, since many artworks bear the same name (e.g. Untitled), or—in the case of performance art—are distinct variations of the same artwork.

When preparing the sheets of data, the two columns following the first column defining the “subjects” of the triples, could be dedicated to the description and alias fields to be associated with each “subject” entity during upload to Wikidata. This makes the upload process more efficient, and also allows one to quickly distinguish between works bearing the same title (illustration below).

Here is an excerpt using JJKB data:

Assigning Properties

Once you have started collecting your data into spreadsheets and formatting as advised above, it is important to make sure that the columns you are choosing as your “predicates” or links between subjects and objects actually match up to established properties in Wikidata. Sometimes this process can be straightforward, e.g. if you want to link an artwork to its artist, the correct property is the self-evident “creator”. Other properties may be less obvious, and require familiarization with the way vocabularies are developed and used within Wikidata communities, e.g. the correct property to indicate the date of creation for an artwork is titled “inception”. The best way to identify which properties are used, and how, in your corresponding domain is to look up the relevant WikiProject page. There is an index of the available pages related to cultural projects here: https://www.wikidata.org/wiki/Category:Cultural_WikiProjects.

Some of these (e.g. the Visual Arts and the Digital and Performative Arts pages) have already been mentioned in the Introduction to Linked Open Data. It is worth noting that even if a particular property is listed as appropriate to use within a WikiProject page, occasionally some properties—and their application—may need to be treated creatively to fit the specific needs of unique or non-standard objects and scenarios. Reusing existing properties as much as possible ensures that query results are returned consistently across different collections submitted to Wikidata by various institutions. At the same time, some properties may need to be used differently across different contexts, and on occasion even new properties may need to be proposed following a specific procedure established within the Wikidata community.

An example of the former is the property “material used”. It is suggested under the Visual Arts vocabulary as a property suitable to indicate materials used in an artwork. At first glance this appears to suit the need to indicate that some of Joan Jonas’ installations use materials such as metal cones, masks, etc. as part of the artwork. However, due to a preset constraint of the Wikidata schema, the property “material used” can actually only be applied to a “base material” such as wood or canvas, but not more complex objects. An alternative property that does not have this constraint is simply “uses”, which is a super-property of “material used” – in other words it plays a similar role in the overall Wikidata data schema, but it applies to a broader range of values.

A further example of the need to work with properties creatively is the application of the inverse properties “part of” and “has part” in the JJKB dataset. Typically parts are only applied to a series of works. However, the unique variations across instantiations of Joan Jonas’ performances meant that the work Mirage for example can be considered both as a work and as a group of works—the group consisting of each individual instantiation of the work at a particular venue or exhibition. The individual instantiations can be considered parts of the whole group, or parts of Mirage as a singular artwork, while being artworks in their own right, too. Linked data provides the flexibility to express such complex, even purposefully ambiguous relationships. Still, it is worth noting that researchers using the dataset for querying will need to be made aware of such particular uses of common expressions such as the “part of”/ “has part” set of properties.

Note that no new properties were suggested for addition to Wikidata’s existing vocabularies as a result of uploading data from this project into Wikidata. However, if other arts or cultural organizations choose to collaborate on the expression of complex performative or time-based media works in the future, new properties could be proposed. There is a higher chance of property proposals being approved by the Wikidata administrators if these are coordinated collectively and there are numerous supporters behind each proposal. New properties coordinated across several institutions or collections could then serve the unique needs of their artworks, and address the limitations of Wikidata’s current set of properties which have been developed primarily for traditional arts.

Go back to section overview:
Workflow: Working with Cultural Heritage Datasets

Next page:
Data Reconciliation Using OpenRefine