Tag Archives: digital humanities

TroveKleaner: a Knime workflow for correcting OCR errors

June 3, 2019Uncategoriseddigital humanities, KNIME, lda, newspapers, text analysis, topic modellingangusv

In a nutshell:

I created a Knime workflow — the TroveKleaner — that uses a combination of topic modelling, string matching and other methods to correct OCR errors in large collections of texts. You can download it from the KNIME Hub.
It works, but does not correct all errors. It doesn’t even attempt to do so. Instead of examining every word in the text, it builds a dictionary of high-confidence errors and corrections, and uses the dictionary to make substitutions in the text.
It’s worth a look if you plan to perform computational analyses on a large collection of error-ridden digitised texts. It may also be of interest if you want to learn about topic modelling, string matching, ngrams, semantic similarity measures, and how all these things can be used in combination.

O-C-aarghh!

This post discusses the second in a series of Knime workflows that I plan to release for the purpose of mining newspaper texts from Trove, that most marvellous collection of historical newspapers and much more maintained by the National Library of Australia. The end-game is to release the whole process for geo-parsing and geovisualisation that I presented in this post on my other blog. But revising those workflows and making them fit for public consumption will be a big job (and not one I get paid for), so I’ll work towards it one step at a time.

Already, I have released the Trove KnewsGetter, which interfaces with the Trove API to allow you to download newspaper texts in bulk. But what do you do with 20,000 newspaper articles from Trove?

Before you even think about how to analyse this data, the first thing you will probably do is cast your eyes over it, just to see what it looks like.

Cue horror.

A typical reaction upon seeing Trove’s OCR-derived text for the first time. Continue reading TroveKleaner: a Knime workflow for correcting OCR errors →

KnewsGetter: a Knime workflow for downloading newspaper texts from Trove

March 21, 2019Uncategoriseddigital humanities, KNIME, workflowangusv

NOTE: This post discusses the most recent version (v2.0) of the Trove KnewsGetter. You can obtain the latest version from the GitHub page.

Around about this time last year, I hatched a side-project to keep me amused while finishing my PhD thesis (which is still being examined, thanks for asking). Keen to apply my new skills in text analytics to something other than my PhD case study (a corpus of news texts about coal seam gas), I decided to try my hand at analysing historical newspapers. In the process, I finally brought my PhD back into contact with the project that led me to commence a PhD in the first place.

I’m talking here about my other blog, which explores (albeit very rarely, these days) the natural history of the part of Brisbane in which I grew up. Pivotal to the inception of that blog was the publicly available collection of historical newspapers on Trove, a wondrous online resource maintained by the National Library of Australia. Having never studied history before, I became an instant deskchair historian when I discovered how easily I could search 100 years of newspapers for the names of streets, waterways, parks — and yes, even people. I trawled Trove for everything I could find about Western Creek and its surrounds, so that I could tell the story how this waterway and its catchment had been transformed by urbanisation.

The wonder that is Trove. This is the search that started me on the slippery slope towards creating a local history blog.

How anyone found the time and patience to study history before there were digitised resources like Trove is beyond me. I cannot even imagine how many person-hours would be needed to replicate the work performed by a single keyword search of Trove’s collection. The act of digitising and indexing textual archives has revolutionised the way in which historical study can be done.

But keyword searches, as powerful as they are, barely scratch the surface of what can be done nowadays with digitised texts. In the age of algorithms, it is possible to not merely index keywords, but to mine textual collections in increasingly sophisticated ways. For example, there are algorithms that can tell the difference between ordinary words and different kinds of named entities, like places or people. Another class of algorithms goes beyond counting individual keywords and instead detect topics — collections of related words that correspond with recurring themes in a collection of texts.

My PhD thesis was largely a meditation on these latter types of algorithms, known as topic models. Along the way, I also used named entity recognition techniques to identify place names and relate them to topics, ultimately enabling me to map the geographic reach of topics in the text.

These were the sorts of techniques that I wanted to bring to apply to Trove’s historical newspapers through my side-project last year. The outcome of this project was a paper that I presented at the Australian Digital Humanities conference in Adelaide in September 2018. To this day, it remains a ‘paper’ in name only, existing only as a slideshow and a lengthy post on my other blog. Releasing some more tangible outputs from this project is on my to-do list for 2019.

An output from last year’s side-project. The map shows associations between words and places mentioned in the Brisbane Courier between 1880 and 1885.

In this post, I am going to share the first in what will hopefully be a series of such outputs. This output is a workflow that performs the foundational step in any data analysis — namely, acquiring the data. I hereby introduce the KnewsGrabber — a Knime workflow for harvesting newspaper articles from Trove. Continue reading KnewsGetter: a Knime workflow for downloading newspaper texts from Trove →

Coding without code: Knime as a tool for digital humanities and computational social science

February 7, 2019General pontificationscoding, computational social science, digital humanities, KNIME, networks, text analysis, topic modellingangusv

The rise of the machines

Over the past several years, I’ve stumbled into two fields of research which, although differing in their subject matter, are united by the recent entrance of computational methods into domains that were previously dominated by manual modes of analysis.

One of these two fields is computational social science. This is the broad category into which my all-but-examined PhD thesis, and most of the posts on this blog, could be placed. Of course, my work, which focuses on the application of text analytics to communication studies, occupies a tiny niche within this larger field, which encompasses the use of all kinds of computational methods to the study of social phenomena, including network science and social simulations using agent-based models.

The other field to which I can claim some degree of membership is the digital humanities. This tag is one way to describe what I do on my other blog, in which I have drawn on digitised maps and newspapers to produce visualisations that explore the history of a suburban environment. History is just one field of the humanities that has been opened up to new possibilities through the extensive digitisation of source materials. Scholars of literature have joined the party as well, while those operating in the GLAM sector, consisting of galleries, libraries, and museums, can now use digital platforms to manage and share their collections.

Though often lumped together in the same faculty (or indeed the same school at my last university), the humanities and social sciences don’t generally have a lot to do with one another. They do, however, have commonalities that lend themselves to similar computational methods. One is the study of text, often in large quantities. Within the social sciences, communication scholars analyse (st)reams of news texts and social media feeds; qualitative sociologists often amass hours of interview transcripts; political scientists analyse speeches and parliamentary transcripts. In the humanities, scholars of history and literature now have at their disposal huge archives of digitised novels, journals, newspapers, and government records. Of course, there is nothing compelling scholars in these fields to analyse all of the available data at once. They are free to sample or cherry-pick as they please. But any scholar wishing to work with large textual datasets in their totality can hardly afford to ignore computational tools like topic models, named entity extraction, sentiment analysis, and domain-specific dictionaries such as LIWC.

Another analytical method that is applicable within both the social sciences and humanities is the study of networks. Sociologists and historians have long used networks to represent and analyse social ties. More recently, social media platforms have hard-wired the logic of networks into social relationships, explicitly representing all of us as nodes tied to one another by virtue of our acquaintances, interactions, and browsing histories. Accordingly, it is becoming increasingly difficult to study social phenomena without invoking the concept of networks, if not also the science of network analysis. By no means restricted to real-world phenomena, network analysis techniques are equally applicable to fictitious societies, whether those in Game of Thrones or the plays of Shakespeare. As with the analysis of text, network analysis doesn’t have to be done with a computer; but as the size of a network grows, so too does the need for computational tools.

Besides an interest in certain computational methods, something else that the digital humanities and computational social sciences have in common is that many of the scholars trying to break into these fields are doing so without any training in data science or computer programming. This is especially the case in the humanities, where traditionally, quantitative methods have been virtually taboo. In some branches of the social sciences, quantitative methods are well established, but these are more likely to take the form of spreadsheets and statistical tests rather than machine learning and network analyses.

(Of course, the story is very different for those researchers entering these fields from the other direction, such as the many physicists, mathematicians and computer scientists who have somehow claimed the territory of computational social science. What these researchers lack, although many of them seem not to realise it, is any training in understanding the actual substance of what they are studying.)

The bottom line is that for many researchers, embracing the new opportunities presented by the digital humanities and computational social sciences means learning skills that are totally outside of their disciplinary training. At a minimum, researchers hoping to use computational methods in these fields will need to learn how to calculate statistics and use spreadsheets, and perhaps also how to use network visualisation software like Gephi or text-cleaning software like OpenRefine. Increasingly though, getting your hands dirty in computational social science or the computational end of the digital humanities ¹ means learning something far worse — something that is commonly denoted by a certain four-letter word. Continue reading Coding without code: Knime as a tool for digital humanities and computational social science →

Notes:

There really is a difference between the digital in digital humanities and the computational in computational social science. Much of what falls under the digital humanities pertains only to the use and curation of digitised resources, rather than the use of computational methods to analyse these resources. Computational social science, on the other hand, seems to be driven primarily by the desire to analyse social data through computational methods. ↩