WofG Web Reporting Service - Crawl Data

The Digital Transformation Agency (DTA) works to deliver better, faster, simpler digital services. As part of our charter, we’ve undertaken work to examine the adoption of DTA’s guidance, products and services by agencies on publicly-facing Australian Government websites.

This service is in an early beta stage, and we’ve recently gathered over 5 million URLs as part of this exercise.

The dataset

The DTA has published the Whole-of-Australian Government Web Crawl dataset on data.gov.au as Web ARChive (WARC) files, in parts and whole.

The dataset is large - the largest so far on data.gov.au. We’ve made it available both as a single 66GB WARC file, and as a series, split into 57 smaller WARC files. We’d suggest you download a smaller WARC file (approx. 1.1GB) as a sample first.

WARC files are a recognised ISO standard for web archiving. We’re planning to filter these WARC files further to examine metrics like:

  • Size (number of domains, websites and URLs published by a given agency or portfolio, seeded by the Australian Government Organisations Register)
  • Technology (examining whether agencies may be leveraging whole-of-government platforms, services and products, like GovCMS, GA360 and Design System)
  • Quality (examining whether agencies are applying guidance from the Content Guide or the Style manual)
  • Accessibility (examining readability levels, machine-checkable portions of WCAG 2.1, use of CAPTCHA, and non-English usage)
  • Usage (to ensure, where feasible,  any efforts spent on improvements are focused on the most-frequently-used content)

Once filtered, each URL’s metadata will be injected as JSON into an ElasticSearch stack, visualised by Kibana. This will allow DTA’s policy and product owners to visualise and explore a large, complex time series of reporting metrics in a browser-based reporting environment, ensuring changes to products and policies are backed by sound, reproducible evidence.

Flow diagram indicating current state of service (1-4) and next steps (5-9)Future plans

We’re planning to conduct three more crawls - one every 90 days - examining environmental changes over time.

There’s likely to be several other uses for this raw WARC store - in discussions with research organisations and universities, we’re anticipating that this snapshot will be used for reporting on Linked Data usage, the generation of government ontologies, and informing corpus work for Australian dictionaries.

As always, we’d love to hear your comments and feedback on this dataset - feel free to join the discussion at Open Data’s communities of practice.

Gordon Grace is the product owner for the Whole-of-Australian Government Web Reporting Service at the Digital Transformation Agency.