Improving data quality on data.gov.au

Pia Waugh | 30 Nov 2015

One of the greatest challenges facing data users when trying to use government data has been the extraordinary diversity in data types, formats, quality, currency and other attributes of the data. In the first instance just finding data that suits the user’s need can be hard, then it is often not machine readable, or up to date. Even the most committed data users can give up at some point.

data.gov.au has been on a journey over the past two years. Not just to publish more and higher quality data, while making public data more easily discoverable across jurisdictions, but to improve data literacy, internal publishing practices and the broader public sector culture around data. We are proud of what we have achieved to date. More agencies are now publishing more data more often, and we are seeing higher levels of data literacy skills, outside of the traditional data specialists, and this is contributing to a greater data-driven public service. There remains much more to be done and with the new Public Data agenda being driven from the Department of Prime Minister and Cabinet we intend to take data.gov.au to the next level.

We want to help data users quickly identify what datasets are of high quality and raise private sector confidence in building commercial products on government data. At the same we’ll help agencies identify specifically how to improve the quality of the data and APIs they publish. Below is a draft methodology for measuring some data quality aspects that data users care about. With your feedback, we will implement this over the coming months and then iterate as required.

This Data Quality Framework will apply to all Federal Government datasets. We would also be happy to work with any of our State and Territory colleagues should they wish to be involved. Our intention is to implement an almost fully automated approach, making it a light touch approach for Data Custodians. Quality systems that rely on human input are generally not consistent across large catalogues whereas an automated approach would ensure a more consistent approach across the entire collection.

Data Quality Framework

Below is our draft methodology to measure the most basic aspects of data quality that data users care about. With your help this foundation can be built upon over time, including the possibility of adding specialist data quality metrics for particular data types (spatial, health, statistics) later on. The following four criteria would be rated out of 5 stars and clearly visible on each dataset:

Format quality out of 5 stars – starting with a localised version of the Tim Berners-Lee’s (TBL) 5 star plugin used by the UK to give a basic understanding of data formats. We will tweak the model slightly to take into account that a machine readable XLS is just as good as machine readable CSV from a data user perspective. Under the default model, a non-machine readable CSV would get a higher score than a machine readable XLS which is suboptimal from a data user’s perspective. We will iterate where useful.
Metadata quality out of 5 stars – we intend to check the metadata that matters most to data users. As a starting point we intend to award a star for meeting each of the below:
- whether the last actual update aligns to the update schedule indicated in the metadata;
- whether a valid spatial context is indicated;
- whether any data models, vocabularies, ontologies or other documentation is provided anywhere in the dataset; and
- whether the licence is one recognised by the Open Definition as an open licence (http://opendefinition.org/licenses/).
API quality out of 5 stars –APIs are critical for serious data users to build persistent analysis, applications or visualisations on government data. So to raise public confidence in using government data, we need to start looking at API quality. We actively host tabular, spatial and some relational data on the data.gov.au platform. We are looking at a 5 star quality ranking based on latency and uptime.
Public quality score out of 5 stars –The final quality indicator will give you, the data user, a 1-5 star ranking.

We welcome your feedback to this rating methodology and are keen to read your views in the comments below.

Please note, our metadata standard for data.gov.au is a DCAT profile mapped to ISO19115 (spatial) and a local metadata standard called AGLS. The schema is now used by most Australian Government portals at the Federal and Regional levels and has been mapped to the Australian National Data Service.

Background reading

We have considered the following quality frameworks in preparing the above:

The Common Assessment Framework for Open Data by the Web Foundation – useful indicators across a broad range of indicators. The quality metrics and performance metrics above are nicely aligned to this work, but a lot of the current version would require human intervention or interpretation, so we will look to extend future iterations on the work as it matures.
The ABS Data Quality Framework – largely focused on metadata quality and statistics specific quality indicators. We have considered the metadata aspects in the framework above and we will consider this Framework for statistical specific datasets in future iterations.
Draft quality framework work being done to look at spatial data by the CRC for Spatial Information – we will consider this Framework for spatial specific datasets in future iterations.
The Open Data Maturity Model by the Open Data Institute – relies on a lot of useful but interpretative analysis by humans. We will consider this Model in future planning.