Improving data quality on data.gov.au

One of the greatest challenges facing data users when trying to use government data has been the extraordinary diversity in data types, formats, quality, currency and other attributes of the data. In the first instance just finding data that suits the user’s need can be hard, then it is often not machine readable, or up to date. Even the most committed data users can give up at some point.

data.gov.au has been on a journey over the past two years. Not just to publish more and higher quality data, while making public data more easily discoverable across jurisdictions, but to improve data literacy, internal publishing practices and the broader public sector culture around data. We are proud of what we have achieved to date.  More agencies are now publishing more data more often, and we are seeing higher levels of data literacy skills, outside of the traditional data specialists, and this is contributing to a greater data-driven public service. There remains much more to be done and with the new Public Data agenda being driven from the Department of Prime Minister and Cabinet we intend to take data.gov.au to the next level.

We want to help data users quickly identify what datasets are of high quality and raise private sector confidence in building commercial products on government data. At the same we’ll help agencies identify specifically how to improve the quality of the data and APIs they publish. Below is a draft methodology for measuring some data quality aspects that data users care about. With your feedback, we will implement this over the coming months and then iterate as required.

This Data Quality Framework will apply to all Federal Government datasets. We would also be happy to work with any of our State and Territory colleagues should they wish to be involved. Our intention is to implement an almost fully automated approach, making it a light touch approach for Data Custodians. Quality systems that rely on human input are generally not consistent across large catalogues whereas an automated approach would ensure a more consistent approach across the entire collection.

Data Quality Framework

Below is our draft methodology to measure the most basic aspects of data quality that data users care about.  With your help this foundation can be built upon over time, including the possibility of adding specialist data quality metrics for particular data types (spatial, health, statistics) later on.  The following four criteria would be rated out of 5 stars and clearly visible on each dataset:

  • Format quality out of 5 stars – starting with a localised version of the Tim Berners-Lee’s (TBL)  5 star plugin used by the UK to give a basic understanding of data formats. We will tweak the model slightly to take into account that a machine readable XLS is just as good as machine readable CSV from a data user perspective. Under the default model, a non-machine readable CSV would get a higher score than a machine readable XLS which is suboptimal from a data user’s perspective. We will iterate where useful.

  • Metadata quality out of 5 stars – we intend to check the metadata that matters most to data users.  As a starting point we intend to award a star for meeting each of the below:

    • whether the last actual update aligns to the update schedule indicated in the metadata;

    • whether a valid spatial context is indicated;

    • whether any data models, vocabularies, ontologies or other documentation is provided anywhere in the dataset; and

    • whether the licence is one recognised by the Open Definition as an open licence (http://opendefinition.org/licenses/).

  • API quality out of 5 stars –APIs are critical for serious data users to build persistent analysis, applications or visualisations on government data. So to raise public confidence in using government data, we need to start looking at API quality. We actively host tabular, spatial and some relational data on the data.gov.au platform. We are looking at a 5 star quality ranking based on latency and uptime.

  • Public quality score out of 5 stars –The final quality indicator will give you, the data user, a 1-5 star ranking.

We welcome your feedback to this rating methodology and are keen to read your views in the comments below.

Please note, our metadata standard for data.gov.au is a DCAT profile mapped to ISO19115 (spatial) and a local metadata standard called AGLS. The schema is now used by most Australian Government portals at the Federal and Regional levels and has been mapped to the Australian National Data Service.

Background reading

We have considered the following quality frameworks in preparing the above:

  • The Common Assessment Framework for Open Data by the Web Foundation – useful indicators across a broad range of indicators. The quality metrics and performance metrics above are nicely aligned to this work, but a lot of the current version would require human intervention or interpretation, so we will look to extend future iterations on the work as it matures.

  • The ABS Data Quality Framework – largely focused on metadata quality and statistics specific quality indicators. We have considered the metadata aspects in the framework above and we will consider this Framework for statistical specific datasets in future iterations.

  • Draft quality framework work being done to look at spatial data by the CRC for Spatial Information – we will consider this Framework for spatial specific datasets in future iterations.

  • The Open Data Maturity Model by the Open Data Institute – relies on a lot of useful but interpretative analysis by humans. We will consider this Model in future planning.


Comments

@TheDataStarter (not verified) 18 December 2015

Hi Pia,
Very informative post, I somehow missed it when it first came out. Some thoughts on the above:
- I don't necessarily agree that XLS is as good as CSV - XLS is a proprietary (albeit widely-used) format. Under the 5-star linked data schema, proprietary formatting is the limiting factor that prevents data scoring above 2 stars; in fact, it's the exact example they give to differentiate between 2-star and 3-star data. I note your point that a machine readable XLS is a better outcome than non-machine readable CSV, but to someone who cannot open or manipulate the data, I think that's a moot point. I would also be cautious about subverting the intention of the 5-star schema, especially when it's only one factor in measuring the quality of open data (as you wonderfully demonstrate above).

- Have you considered using an existing global standard for measuring metadata quality, such as the Open Data Institute's Open Data Certificates? (https://certificates.theodi.org/). Again, if the goal is comparable, useful and interoperable data, I would be in favour of adopting an existing and widely-used standard before creating something proprietary. We're even currently localising these certificates to Australia, working with the Open Data Institute Queensland (http://www.odiqueensland.org.au/certificates)

- With regards to the Open Data Maturity model, it's more focused on the maturity of Open Data publishers, rather than the specific data being published. It does, however, address a number of aspects of maturity that have a direct correlation to the quality and trustworthiness of the data being published. While I agree that applying the model in it's raw form can be a subjective exercise, I cannot recommend the Open Data Pathway assessment tool enough (http://pathway.theodi.org/). It is the maturity model interpreted by way of a simple, interactive survey, complete with (mostly-)excellent guidance text and examples. There are a number of Queensland Government Departments that have been through this process to create a maturity baseline (and have developed a maturity action plan on activities to improve for the next, annual assessment). We are seeing great value in having targeted, apples-to-apples discussions between organisations and at a Whole of Government level, guided by our maturity in this model. Personally, I would love to see it more widely adopted throughout Australia as a way of encouraging open discussions, better collaboration and working together to address the common issues we all encounter.

- I would also like to see adoption of open standards being recognised as an important element of interoperability for end users - how great would a national adoption of the Open Council Data approach be? (http://opencouncildata.org/)

Thanks again for engaging!
Dave in Queensland
@TheDataStarter


Pia Waugh 21 December 2015

Thanks for the comments Dave, much appreciated! I've addressed each of your points below:

1) We are not advocating XLS over CSV. Obviously an open standard is preferable to a closed standard. The point made in the blog post is that a machine readable spreadsheet of any format is better than a non-machine readable spreadsheet. Unfortunately the default 5 star approach by TBL places value of format over usability, and we are coming from a data user needs perspective specifically. So the order of value might be non-machine readable xls, then non-machine readable csv (open over closed) then machine readable xls and finally machine readable csv (machine readable over non-machine readable). Hope that makes sense.

2) We have considered a number of different standards already. The ODI Open Data Certificates was a model we have looked at for a while but the problem of that model is it requires heavy human intervention and validation for each dataset. We are trying in the first instance to infer quality from some automatable and technical components we can test across the entire catalogue in a consistent way, and we can add additional quality features secondarily. Anything that requires human intervention will never be applied consistently and completely across a collection, especially a collection as diverse and large as on data.gov.au. We are not creating anything proprietary, indeed, our entire methodology will be public and is mapped to best practices from several of the models above. The problem is there is no existing quality methodology that is fully automatable, and several other governments are following our work closely to solve the same problem for them.

3) The Open Data Maturity model and survey is great, but again a human intervention system. Perhaps the sustainable approach would be to have the basic technical quality indicators (as we are developing) and then to get data publishers to assess their maturity as organisations to create quality indicators we can infer for individual datasets, and to adopt additional quality indicators from existing systems such as the ODI certificates for high value datasets where additional quality is desired and useful to data users. thoughts?

4) We also support the adoption of open standards, as do most governments in Australia (including the Commonwealth) where possible. The challenge of standards is always trying to get traction in adoption, and we find agencies will adopt standards either in the system design phase (which many try to do) or in legacy systems where there is a business benefit to do so. Many agencies are publishing data in machine readable open formats on data.gov.au not just because it is a good thing to do, but because then they get automatically generated APIs which they can use for other purposes. We try to follow a path of encouraging technical excellence by making the path of technical excellence the path of least resistance. It is working, but it takes time. Please keep us in the loop for the open council data standard, and we will help promote it to Councils we work with.

Cheers,
Pia and the Data Infrastructure team
Public Data Branch


@TheDataStarter (not verified) 24 December 2015

Thanks Pia, really appreciate the response!

Completely agree with your reasoning for the hierarchy for machine-readable over not, but I'd still be cautious about altering an existing 5-star standard away from what it was established to achieve to accommodate this aspect. If we take it's goal as measuring progress towards linked open data, but only use it to measure dataset format, why are we looking to adopt it? Do we achieve the same outcome with a simple, binary "machine-readable - yes or no" score?

If the emphasis is on scoring the format (rather than progress towards linked open data), I think it would make sense to employ something targeted, rather than risking someone from the UK, the US, NZ, and the like to see a 5-star schema rating and make poor assumptions about what that score represents? Not to mention data publishers at the State Government, Local Government and academia who are adopting the 5-star schema as it's published; are we introducing barriers to inter-operability for remixing Commonwealth data? For publishers using automated methods for awarding stars to datasets under this schema, do we want them to have to manually adjust scores up for 2-star XLS or down for 3-star CSV?

Not that I'm saying it's a slippery slope towards adopting PDF, but... =p

I'm also not sure what you mean about the ODI Certificates being a heavily human dependent model? Of the 151,470 ODI certificates currently awarded (as of 24/12/2015), 150,854 were awarded automatically through an assessment of the metadata with no human interaction or intervention. That's roughly 99.6% of all awarded certificates. There are extensions for both CKAN and Socrata portals to bulk-assess datasets to award certificates, assuming the metadata is structured correctly. It also includes support for independent auditing of any dataset at any time, which we're found is a great way of sparking in-depth discussions about datasets where they otherwise aren't generating a lot of interest.

With regards to the Open Data Maturity Model, we haven't been looking to employ it on a per-dataset level; we have been using ODI Certificates for that. Instead, we use it to measure maturity of the organisation itself and to plan action plans for the organisation to adopt. It provides an easily consumed baseline measure that engages executives and leaders within the organisation and provides a structured way to have detailed conversations about what they are investing in with their open data programs. I completely agree that this is a very human-dependent system - from our experiences in deploying it across ten Government agencies so far, that is actually it's biggest strength, as a education and awareness tool within our Departments.

With regards to open standards - completely agree, but the sense I get is more that there are people enthusiastic and excited about discussing and agreeing standards to give certainty, rather than viewing them reluctantly as an additional overhead. If we could connect even those Departments (across all levels of Government, not to mention other publishers such as Universities and research organisations) passionate about this sort of thing together, I would be surprised if we don't rapidly start building a critical mass of adoption that makes it easier for outliers and stragglers to see the benefits.

Happy to share examples of our maturity model assessments or the action plans they have spawned, how we are structuring our metadata to support automated awarding of ODI Certificates or our experiences on any of the above.

Dave in Queensland
@TheDataStarter


Pia Waugh 24 December 2015

Thanks TheDataStarter for the comments. Our understanding of the ODI Certificates may be out of date, we'll review in the new year and will consider your suggestions moving forward.

Our primary goal is to make it easier for data users to quickly identify data they can use and rely upon. Data that gets used is data that gets prioritised and that drives agency change. You'll see below we do prioritise open standards above proprietary in the current draft below, but unstructured open standards would get the same rating as unstructured proprietary, as both are not very useful. We are starting from the bare essentials that makes the data usable (which are technical attributes) and then moving on to less automatable less technical metrics down the track. A lot of government data is good quality in many respects, but if the APIs don’t work, if it isn't kept up to date, if it isn't machine readable then it can't be used. So the data is only as good as the weakest link in the supply chain.

If we have to differentiate between our 5 star model and others, then we'll do that. Below is some more detail we've been playing with for comment:

Format quality out of 5 stars – starting with a localised version of the Tim Berners-Lee’s (TBL) 5 star plugin used by the UK to give a basic understanding of data formats. Tweak the model slightly to reflect:
* Specific stars are associated with format type as per the list at https://github.com/okfn/ckan-barnet/wiki/Data-quality
* Small tweaks are broadly:
 No resources working = 0 stars
 Anything posted = 1 star (anything)
 Structured but proprietary formats = 2 stars (XLS, XSLX, SPSS, etc)
 Structured open format = 3 star (Database, KML, SHP, CSV, TXT)
 API available = 4 star (any type of API, you already have a test for APIs as we count API enabled resources on the front page, so you should be able to count datasets with at least one API enabled resource. This includes CSVs and XSLs that are machine readable and thus API enabled)
 Linked data available = 5 star (I think this will need to be a manually checked thing and not automatically tested which is fine. Very few datasets are linked data).

Metadata quality out of 5 stars - A star for meeting each of the below to give cumulative score:
* whether the last actual update aligns to the update schedule indicated in the metadata = 1 star
* whether a valid spatial context is indicated = 1 star
* whether any data models, vocabularies, ontologies or other documentation is provided anywhere in the dataset = 1 star (perhaps identify any external links or documents in the dataset as a starting point?)
* whether the licence is one recognised by the Open Definition as an open licence (http://opendefinition.org/licenses/) = 1 star (should be automatable)
* Last star to be confirmed. Any suggestions?

API quality out of 5 stars – We are looking at a 5 star quality ranking based on latency and uptime.
* API down = 0 stars
* Latency more than 8 seconds AND Uptime less than 70% = 1 star
* Latency 6-8 seconds AND Uptime 70%-80% = 2 star
* Latency 3-5 seconds AND Uptime 80%-90% = 3 star
* Latency 2-3 seconds AND Uptime 90%-95% = 4 star
* Latency less than 1 second AND Uptime over 95% = 5 star

Public quality score out of 5 stars –The final quality indicator will give you, the data user, a 1-5 star ranking.
* Obviously this is just a public rating out of 5.

Cheers,
Pia and the Data Infrastructure team


Stephen Gates (not verified) 24 December 2015

Great to see the measurement of open data progress shifting from quantity to quality.

Any automated measurement effort will be limited by the machine-readable metadata provided.

Open Data Certificates can be automatically awards based on the metadata data exposed via your CKAN API. Over 150,000 certificates have been automatically awarded but these are mostly at the lowest of the four certificate levels due to the lack of metadata or the lack of standards for certain types of metadata.

The Open Data Monitor (http://www.opendatamonitor.eu) takes a similar approach reading metadata from open data portals across Europe to report on open data quality and more.

Perhaps there's an opportunity for a two stage data quality check:
1. Automatic assessment of all datasets based on metadata and perhaps some defaults set for the publishing portal where metadata isn't available.
2. Manual assessment for high value datasets to demonstrate higher quality levels and provide increased confidence for data re-users.

This is the approach taken by open data certificates with a facility for data re-users to verify or flag the manual assessment. This provides further confidence either by the community agreeing that the assessment is correct or seeing the publisher correcting flagged assessment errors.

The W3C Data on the Web Best Practices (http://www.w3.org/TR/dwbp/) is a useful resource to consider in improving data quality. The Data on the Web Best Practices Working Group (http://www.w3.org/2013/dwbp/wiki/Main_Page) are also creating missing vocabularies that may further assist in the automation of data quality measurement.

My personal preference would be to adopt a standard and help improve it rather than invent a new one. That said, any effort to improve data quality is very welcome.

Thanks
Stephen Gates


@TheDataStarter (not verified) 24 December 2015

Thanks again Pia, enjoying the discourse! =)

Does the format hierarchy imply API is preferred to bulk data in all cases? I would challenge that assumption as well, based on the reasoning employed here - https://opengovdata.io/2014/bulk-data-an-api/

"A data API must do everything that bulk data does, plus much more. Data APIs alone also typically do not meet the principles of open government data. Data APIs often require registration first (violating principle 6), and because APIs are live services, “rate limiting” is usually employed to ensure that a consumer does not overly tax the underlying system. But rate limiting can also make it impossible for any single consumer to retrieve the complete dataset (violating principle 4: access in bulk).

Therefore government agencies should walk before they run: build good bulk open data first, validate that it meets the needs of users, learn how to do that well, and only after validation and learning invest in building a data API to address additional use cases."

I strongly believe that good APIs are incredibly valuable to getting the most out of our publishing efforts (and in building the Government-as-a-Platform approach (which you covered extremely well at the Open Data and Digital Services – Foundations for a New Information Economy event)), but don't think it's going to be the right answer in all cases, such as slowly-changing, large volume data. I think the infrastructure required to support a data API servicing calls for LIDAR data (https://en.wikipedia.org/wiki/Lidar), for example, would be prohibitively expensive, difficult to use and not offer a meaningful return on investment. It would also put a lot of strain on the infrastructure, which is likely to be supporting other APIs. I agree that the data is only as good as it's weakest link, but APIs inherently introduce complexity into the chain (which, again, will absolutely be worth the trade-off in a huge number of cases, but not all cases).

I'd also note that the 4-star format score is even further divorced from the requirements for 4-star linked data under the TBL model - no problems if the intention is to create an Federal-specific assessment, but I would be removing references to the TBL schema to avoid confusion. I would note that some of the primary benefits of using URIs as column-level identifiers are about ease of combining data easily and discovering related datasets, and an API doesn't necessarily service either goal (especially if the definition of API includes machine-readable CSV files interpreted through the platform software). Again, no problems if that's not what the proposed measurement is about, but it's diverged pretty significantly from the base schema in both spirit and practice.

Will all datasets have an update schedule? What about data published as a one-off or as a point-in-time, which are considered complete at the time of publication? Would they automatically get the star for refresh, or be forever unable to achieve one?

I'm also not sure that having an open license count for a single star under metadata makes sense - if I were to publish a dataset that is otherwise perfect in every form, great for consumers, available with negligible latency and 99% uptime, but gave it a Copyright - all Rights Reserved license, it could get 14/15 stars under this assessment?

To answer your specific question about suggestions for another star, what about provenance metadata? It seems to be the easiest one to miss entirely, but can make it impossible for the due diligence that should be used when using this data in anger.

Hope everyone there in PM&C has a great Xmas, very much looking forward to discussing more in the new year.
Cheers,

Dave in Queensland
@TheDataStarter


Leave a Comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.