Gordon Grace | 07 Sep 2018
The Digital Transformation Agency (DTA) works to deliver better, faster, simpler digital services. As part of our charter, we’ve undertaken work to examine the adoption of DTA’s guidance, products and services by agencies on publicly-facing Australian Government websites.
This service is in an early beta stage, and we’ve recently gathered over 5 million URLs as part of this exercise.
The DTA has published the Whole-of-Australian Government Web Crawl dataset on data.gov.au as Web ARChive (WARC) files, in parts and whole.
The dataset is large - the largest so far on data.gov.au. We’ve made it available both as a single 66GB WARC file, and as a series, split into 57 smaller WARC files. We’d suggest you download a smaller WARC file (approx. 1.1GB) as a sample first.
WARC files are a recognised ISO standard for web archiving. We’re planning to filter these WARC files further to examine metrics like:
Once filtered, each URL’s metadata will be injected as JSON into an ElasticSearch stack, visualised by Kibana. This will allow DTA’s policy and product owners to visualise and explore a large, complex time series of reporting metrics in a browser-based reporting environment, ensuring changes to products and policies are backed by sound, reproducible evidence.
We’re planning to conduct three more crawls - one every 90 days - examining environmental changes over time.
There’s likely to be several other uses for this raw WARC store - in discussions with research organisations and universities, we’re anticipating that this snapshot will be used for reporting on Linked Data usage, the generation of government ontologies, and informing corpus work for Australian dictionaries.
As always, we’d love to hear your comments and feedback on this dataset - feel free to join the discussion at Open Data’s communities of practice.
Gordon Grace is the product owner for the Whole-of-Australian Government Web Reporting Service at the Digital Transformation Agency.