Tag Archives: topic modelling

TroveKleaner: a Knime workflow for correcting OCR errors

In a nutshell:

  • I created a Knime workflow — the TroveKleaner — that uses a combination of topic modelling, string matching and other methods to correct OCR errors in large collections of texts. You can download it from GitHub.
  • It works, but does not correct all errors. It doesn’t even attempt to do so. Instead of examining every word in the text, it builds a dictionary of high-confidence errors and corrections, and uses the dictionary to make substitutions in the text.
  • It’s worth a look if you plan to perform computational analyses on a large collection of error-ridden digitised texts. It may also be of interest if you want to learn about topic modelling, string matching, ngrams, semantic similarity measures, and how all these things can be used in combination.

O-C-aarghh!

This post discusses the second in a series of Knime workflows that I plan to release for the purpose of mining newspaper texts from Trove, that most marvellous collection of historical newspapers and much more maintained by the National Library of Australia. The end-game is to release the whole process for geo-parsing and geovisualisation that I presented in this post on my other blog. But revising those workflows and making them fit for public consumption will be a big job (and not one I get paid for), so I’ll work towards it one step at a time.

Already, I have released the Trove KnewsGetter, which interfaces with the Trove API to allow you to download newspaper texts in bulk. But what do you do with 20,000 newspaper articles from Trove?

Before you even think about how to analyse this data, the first thing you will probably do is cast your eyes over it, just to see what it looks like.

Cue horror.

A typical reaction upon seeing Trove’s OCR-derived text for the first time. Continue reading

A thesis relived: using text analytics to map a PhD journey

 

Your thesis has been deposited.

Is this how four years of toil was supposed to end? Not with a bang, but with a weird sentence from my university’s electronic submission system? In any case, this confirmation message gave me a chuckle and taught me one new thing that could be done to a thesis. A PhD is full of surprises, right till the end.

But to speak of the end could be premature, because more than two months after submission, one thing that my thesis hasn’t been yet is examined. Or if it has been, the examination reports are yet to be deposited back into the collective consciousness of my grad school.

The lack of any news about my thesis is hardly keeping me up at night, but it does make what I am about to do in this post a little awkward. Following Socrates, some people would argue that an unexamined thesis is not worth reliving. At the very least, Socrates might have cautioned against saying too much about a PhD experience that might not yet be over. Well, too bad: I’m throwing that caution to the wind, because what follows is a detailed retrospective of my PhD candidature.

Before anyone starts salivating at the prospect of reading sordid details about about existential crises, cruel supervisors or laboratory disasters, let me be clear that what follows is not a psychodrama or a cautionary tale. Rather, I plan to retrace the scholastic journey that I took through my PhD candidature, primarily by examining what I read, and when.

I know, I know: that sounds really boring. But bear with me, because this post is anything but a literature review. This is a data-driven, animated-GIF-laden, deep-dive into the PhD Experience. Continue reading

Tracking and comparing regional coverage of coal seam gas

In the last post, I started looking at how the level of coverage of specific regions changed over time — an intersection of the Where and When dimensions of the public discourse on coal seam gas. In this post I’ll continue along this line of analysis while also incorporating something from the Who dimension. Specifically, I’ll compare how news and community groups cover specific regions over time.

Regional coverage by news organisations

One of the graphs in my last post compared the ratio of coverage of locations in Queensland to that of locations in New South Wales. Figure 1 below takes this a step further, breaking down the data by region as well. What this graph shows is the level of attention given to each region by the news sources in my database (filtered to ensure complete coverage for the period — see the last post) over time. In this case, I have calculated the “level of attention” for a given region by counting the number of times a location within that region appears in the news coverage, and then aggregating these counts within a moving 90-day window. Stacking the tallies to fill a fixed height, as I have done in Figure 1, reveals the relative importance of each region, regardless of how much news is generated overall (to see how the overall volume of coverage changes over time, see the previous post). The geographic boundaries that I am using are (with a few minor changes) the SA4 level boundaries defined by the Australian Bureau of Statistics. You can see these boundaries by poking around on this page of the ABS website.

The regions in Figure 1 are shaded so that you can see the division at the state level. The darker band of blue across the lower half of the graph corresponds with regions in Queensland. The large lighter band above that corresponds with regions in New South Wales. Above that, you can see smaller bands representing Victoria and Western Australia. (The remaining states are there too, but they have received so little coverage that I haven’t bothered to label them.) I have added labels for as many regions as I can without cluttering up the chart.

Figure 1. Coverage of geographic regions in news stories about coal seam gas, measured by the number of times locations from each region are mentioned in news stories within a moving 90-day window. The blue shadings group the regions by state. Hovering over the image shows a colour scheme suited to identifying individual regions. You can see larger versions of these images by clicking here and here.

Continue reading