Tag Archives: KNIME

TroveKleaner: a Knime workflow for correcting OCR errors

In a nutshell:

  • I created a Knime workflow — the TroveKleaner — that uses a combination of topic modelling, string matching and other methods to correct OCR errors in large collections of texts. You can download it from GitHub.
  • It works, but does not correct all errors. It doesn’t even attempt to do so. Instead of examining every word in the text, it builds a dictionary of high-confidence errors and corrections, and uses the dictionary to make substitutions in the text.
  • It’s worth a look if you plan to perform computational analyses on a large collection of error-ridden digitised texts. It may also be of interest if you want to learn about topic modelling, string matching, ngrams, semantic similarity measures, and how all these things can be used in combination.

O-C-aarghh!

This post discusses the second in a series of Knime workflows that I plan to release for the purpose of mining newspaper texts from Trove, that most marvellous collection of historical newspapers and much more maintained by the National Library of Australia. The end-game is to release the whole process for geo-parsing and geovisualisation that I presented in this post on my other blog. But revising those workflows and making them fit for public consumption will be a big job (and not one I get paid for), so I’ll work towards it one step at a time.

Already, I have released the Trove KnewsGetter, which interfaces with the Trove API to allow you to download newspaper texts in bulk. But what do you do with 20,000 newspaper articles from Trove?

Before you even think about how to analyse this data, the first thing you will probably do is cast your eyes over it, just to see what it looks like.

Cue horror.

A typical reaction upon seeing Trove’s OCR-derived text for the first time. Continue reading

KnewsGetter: a Knime workflow for downloading newspaper texts from Trove

NOTE: This post discusses the most recent version (v2.0) of the Trove KnewsGetter. You can obtain the latest version from the GitHub page.

Around about this time last year, I hatched a side-project to keep me amused while finishing my PhD thesis (which is still being examined, thanks for asking). Keen to apply my new skills in text analytics to something other than my PhD case study (a corpus of news texts about coal seam gas), I decided to try my hand at analysing historical newspapers. In the process, I finally brought my PhD back into contact with the project that led me to commence a PhD in the first place.

I’m talking here about my other blog, which explores (albeit very rarely, these days) the natural history of the part of Brisbane in which I grew up. Pivotal to the inception of that blog was the publicly available collection of historical newspapers on Trove, a wondrous online resource maintained by the National Library of Australia. Having never studied history before, I became an instant deskchair historian when I discovered how easily I could search 100 years of newspapers for the names of streets, waterways, parks — and yes, even people. I trawled Trove for everything I could find about Western Creek and its surrounds, so that I could tell the story how this waterway and its catchment had been transformed by urbanisation.

The wonder that is Trove. This is the search that started me on the slippery slope towards creating a local history blog.

How anyone found the time and patience to study history before there were digitised resources like Trove is beyond me. I cannot even imagine how many person-hours would be needed to replicate the work performed by a single keyword search of Trove’s collection. The act of digitising and indexing textual archives has revolutionised the way in which historical study can be done.

But keyword searches, as powerful as they are, barely scratch the surface of what can be done nowadays with digitised texts. In the age of algorithms, it is possible to not merely index keywords, but to mine textual collections in increasingly sophisticated ways. For example, there are algorithms that can tell the difference between ordinary words and different kinds of named entities, like places or people. Another class of algorithms goes beyond counting individual keywords and instead detect topics — collections of related words that correspond with recurring themes in a collection of texts.

My PhD thesis was largely a meditation on these latter types of algorithms, known as topic models. Along the way, I also used named entity recognition techniques to identify place names and relate them to topics, ultimately enabling me to map the geographic reach of topics in the text.

These were the sorts of techniques that I wanted to bring to apply to Trove’s historical newspapers through my side-project last year. The outcome of this project was a paper that I presented at the Australian Digital Humanities conference in Adelaide in September 2018. To this day, it remains a ‘paper’ in name only, existing only as a slideshow and a lengthy post on my other blog. Releasing some more tangible outputs from this project is on my to-do list for 2019.

An output from last year’s side-project. The map shows associations between words and places mentioned in the Brisbane Courier between 1880 and 1885.

In this post, I am going to share the first in what will hopefully be a series of such outputs. This output is a workflow that performs the foundational step in any data analysis — namely, acquiring the data. I hereby introduce the KnewsGrabber — a Knime workflow for harvesting newspaper articles from Trove. Continue reading

Facteaser: a Knime workflow for parsing Factiva outputs

Most of the cool kids in communication and cultural studies these days are studying social media. Fake news on Facebook, Russian bots on Twitter, maladjusted manboys on Reddit — these are the kinds of research topics that are likely to score you a spot in one of the popular sessions at that big conference that everyone will be going to this year. And for the most part, rightly so, since these platforms have become an integral component of the networked public sphere in which popular culture and political discourse now unfold.

But lurking at the back of the conference programme, in the Friday afternoon sessions when the cool kids have already left for the pub or the airport, you might find some old-timers and young misfits who, for one reason or another, continue to study more traditional, less sexy forms of media. Like newspapers, for example. Or television news. Not so long ago, these were the go-to sources of data if you wanted to make claims about the state of public discourse or the public sphere.

If you don’t member member berries, you need to track down episode 268 of South Park.

Never one to follow the cool kids, I structured my whole PhD around a dataset comprising around 24,000 newspaper articles supplemented with texts from similarly uncool sources like media releases and web pages. One reason for choosing this kind of data is that it enabled me to construct a rich timeline of an issue (coal seam gas development in Australia) that reached back to a time before Twitter and Facebook even existed (member?). Another reason is that long-form texts provided good fodder for the computational methods I was interested in exploring. Topic models tends to work best when applied to texts that are much longer than 140 characters, or even the 280 that Twitter now allows. And even if you are interested primarily in social media, mainstream media can be hard to ignore, because it provides so much of the content that people share and react to on social media anyway.

So there are in fact plenty of reasons why you might still want to study texts from newspapers or news websites in the age of social media. But if you want to keep up with your trending colleagues who boast about their datasets of millions of tweets or Facebook posts assembled through the use of official platform APIs (member?), you might be in for some disappointment. Because while news texts also exist in their millions, sometimes even within single consolidated databases, you will rarely find them offered for download in large quantities or in formats that are amenable to computational analyses. The data is all there, but it is effectively just out of reach. Continue reading