All posts by angusv

Free as in trams: using text analytics to analyse public submissions

The opportunity

As documented elsewhere on this blog, I recently spent four years of my life playing with computational methods for analysing text, hoping to advance, in some small way, the use of such methods within social science. Along the way, I became interested in using topic models and related techniques to assist the development of public policy. Governments regularly invite public comment on things like policy proposals, impact assessments, and inquiries into controversial issues. Sometimes, the public’s response can be overwhelming, flooding a government department or parliamentary office with hundreds or thousands of submissions, all of which the government is obliged to somehow ‘consider’.

Not having been directly at the receiving end of this process, I’m not entirely sure how the teams responsible go about ‘considering’ thousands of public submissions. But this task strikes me as an excellent use-case for computational techniques that, with minimal supervision, can reveal thematic structures within large collections of texts. I’m not suggesting that we can delegate to computers the task of reading public submissions: that would be wrong even if it were possible. What we can do, however, is use computers to assist the process of navigating, interpreting and organising an overwhelming number of submissions.

A few years back, I helped a panellist on the Northern Territory’s Scientific Inquiry into Hydraulic Fracturing to analyse concerns about social impacts expressed in more than 600 public submissions. Rather than manually reading every submission to see which ones were relevant, I used a computational technique called probabilistic topic modelling to automatically index the submissions according to the topics they discussed. I was then able to focus my attention on those submissions that discussed social impacts, making the job a whole lot easier than it otherwise would have been. In addition, the topic model helped me to categorise the submissions according to the types of social impacts they discussed, and provided a direct measurement of how much attention each type of impact had received.

This experience proved that computational text analysis methods can indeed be useful for assessing public input to policy processes. However, it was far from perfect case study, as I was operating only on the periphery of the assessment process. The value of computational methods could be even greater if they were incorporated into the process from the outset. In that case, for example, I could have indexed the submissions against topics besides social impacts. As well as making life easier for the panellists responsible for other topics, a more complete topical index would have enabled an easy analysis of which issues were of most interest to each category of stakeholder, or to all submitters taken together.

In this post, I want to illustrate how topic modelling and other computational text analysis methods can contribute to the assessment of public submissions to policy issues. I do this by performing a high-level analysis of submissions to the Victorian parliament about a proposal to expand Melbourne’s ‘free tram zone’. I chose this particular inquiry because it has not yet concluded (submissions have closed, but the report is not due until December) and because it received more than 400 hundred submissions, which although perhaps not an overwhelming number, is surely enough to create a sense of foreboding in the person who has to read them all.

This analysis meant to be demonstrative rather than definitive. The methods I’ve used are experimental and could be refined. More importantly, these methods are not supposed to stand on their own, but rather should be integrated into the rest of the analytical process, which obviously I am not doing, since I do not work for the Victorian Government. In other words, my aim here is not to provide an authoritative take on the content of the submissions, but to demonstrate how certain computational methods could assist the task of analysing these submissions. Continue reading

HeatTraKR – A Knime workflow for exploring Australian climate data

Recently, I decided to crunch some data from the Australian Bureau of Meteorology (which I’ll just call BoM) to assess some of my own perceptions about how the climate in my home city of Brisbane had changed throughout my lifetime. As always, I performed the analysis in Knime, a free and open software platform that allows you to do highly sophisticated and repeatable data analyses without having to learn how to code. Along the way, I also took the opportunity to sharpen my skills at using R as a platform for making data visualisations, which is something that Knime doesn’t do quite as well.

The result of this process is HeatTraKR, a Knime workflow for analysing and visualising climate data from the Australian Bureau of Meteorology, principally the Australian Climate Observations Reference Network – Surface Air Temperature (ACORN-SAT) dataset, which has been developed specifically to monitor climate variability and change in Australia. The workflow uses Knime’s native functionality to download, prepare and manipulate the data, but calls upon R to create the visual outputs. (The workflow does allow you to create the plots with Knime’s native nodes, but they are not as nice as the R versions.)

I’ve already used the HeatTraKR to produce this post about how the climate in Brisbane and Melbourne (my new home city) is changing. But the workflow has some capabilities that are not showcased in that post, and I will take the opportunity to demonstrate these a little later in the present post.

Below I explain how to install and use the HeatTraKR, and take a closer look at some of its  outputs that I have not already discussed in my other post. Continue reading

Confessions of a climate deserter

For so long, climate change has been discussed in Australia (and indeed elsewhere) as if it were an abstract concept, a threat that looms somewhere in the future. Not anymore. In 2019, climate change became a living nightmare from which Australia may never awake.

While I prepared this post in the dying weeks of 2019 and the beginning of 2020, there was not a day when some part of the country was not on fire. As at 24 January, more than 7.7 million hectares — that’s an area about the size of the Czech Republic — have burned. Thirty-three people have died. Towns have been destroyed. Old-growth forests have burned. Around a billion animals have been killed. Whole species have probably been lost.

The effects were not only felt in the bush. Capital cities such as Sydney, Melbourne and Canberra endured scorching temperatures while choking in smoke. Newspaper front pages (except those of the Murdoch press) became a constant variation on the theme of red. The country entered a state of collective trauma, as if at war with an unseen and invincible enemy.

The connection between the bushfires and climate change has been accepted by nearly everyone, with the notable exception of certain denialists who happen to be running the country–and even they are starting to change their tune (albeit to one of ‘adaptation and resilience’). One thing that is undeniable is that 2019 was both the hottest and driest year Australia has experienced since records began, and by no small margin. In December, the record for the country’s hottest day was smashed twice in a single week. And this year was not an aberration. Eight of the ten hottest years on record occurred in the last 10 years.  Environmentally, politically, and culturally, the country is in uncharted territory.

Climate deserters

I watched this nightmare unfold from my newly adopted city of Melbourne, to which which I moved from Brisbane with my then-fiancée-now-wife in January 2019. As far as I can tell, Melbourne has been one of the better places in the country to have been in the past few months. The summer here has been pleasantly mild so far, save for a few horrific days when northerly winds baked the city and flames lapped at the northern suburbs. It seems that relief from the heat is never far away in Melbourne: the cool change always comes, tonight or tomorrow if not this afternoon. During the final week of 2019, as other parts of Victoria remained an inferno, Melbourne reverted to temperatures in the low 20s. We even got some rain. It was almost embarrassing.

Finding relief from the heat is one of the reasons my wife and I moved to Melbourne. Having lived in Brisbane all of our lives, we were used to its subtropical summers, but the last few pushed us over the edge. To be sure, Brisbane rarely sees extreme heat. In summer, the maximums hover around 30 degrees, and rarely get beyond the mid-30s. But as Brisbanites are fond of saying (especially to southerners ), it’s not the heat, it’s the humidity that gets you. The temperature doesn’t have to be much about 30 degrees in Brisbane before comfort levels become thoroughly unreasonable. Continue reading

TroveKleaner: a Knime workflow for correcting OCR errors

In a nutshell:

  • I created a Knime workflow — the TroveKleaner — that uses a combination of topic modelling, string matching and other methods to correct OCR errors in large collections of texts. You can download it from GitHub.
  • It works, but does not correct all errors. It doesn’t even attempt to do so. Instead of examining every word in the text, it builds a dictionary of high-confidence errors and corrections, and uses the dictionary to make substitutions in the text.
  • It’s worth a look if you plan to perform computational analyses on a large collection of error-ridden digitised texts. It may also be of interest if you want to learn about topic modelling, string matching, ngrams, semantic similarity measures, and how all these things can be used in combination.


This post discusses the second in a series of Knime workflows that I plan to release for the purpose of mining newspaper texts from Trove, that most marvellous collection of historical newspapers and much more maintained by the National Library of Australia. The end-game is to release the whole process for geo-parsing and geovisualisation that I presented in this post on my other blog. But revising those workflows and making them fit for public consumption will be a big job (and not one I get paid for), so I’ll work towards it one step at a time.

Already, I have released the Trove KnewsGetter, which interfaces with the Trove API to allow you to download newspaper texts in bulk. But what do you do with 20,000 newspaper articles from Trove?

Before you even think about how to analyse this data, the first thing you will probably do is cast your eyes over it, just to see what it looks like.

Cue horror.

A typical reaction upon seeing Trove’s OCR-derived text for the first time. Continue reading