All posts by angusv

The day after the week before: mapping the Twitter discourse about Australia day

An ode to the 27th of January

The 27th of January is an important day in the Australian calendar. As the fog rises from the Christmas break and the last public holiday for at least eight weeks, this date marks the resumption of business as usual, the start of the new year proper. It is also the moment when millions of Australians breathe a sigh of relief, knowing that the divisive and tiresome debate about the date of Australia Day will now subside for another 358 days, give or take.

The 27th of January is the day after the day when, in 1788, Captain Arthur Phillip sailed into into Sydney Cove and planted a British flag. Trailing behind him was a fleet of 11 ships carrying an assortment of convicts, civil officers and free settlers, the first members of a new colonial outpost that would ultimately become the nation state of Australia. Watching the arrival from the shore were the land’s indigenous human inhabitants, custodians of more than 40,000 years of continuous culture and occupation.

Long recognised as the anniversary of the colony’s foundation, the day before the 27th of January was in 1935 adopted by all states and territories as Australia’s national day of celebration. For many years, most Australians were happy with this arrangement. Australia Day was a day of national pride and innocent celebration, a day to have a barbecue, drink some beer, listen to the Hottest 100 countdown and play some beach or backyard cricket — often all at the same time. But now, as the country has finally began to confront some of the darker chapters of its colonial past, the 26th of January is losing its lustre as a day when such simple pleasures can be enjoyed, let alone pursued in the name of national pride. It turns out that it is rather difficult to drink beer, play cricket and enjoy the last year’s top songs while at the same time contemplating the country’s legacy of dispossession and genocide against its first peoples. (Indeed, Triple J moved its Hottest 100 countdown from Australia Day to the fourth weekend of January in 2018, and this year Cricket Australia  chose not to mention Australia Day in its promotion of matches held on 26 January.)

And so, in the third decade of the 20th century, the 27th of January is now the day after tens of thousands of people partake in Invasion Day rallies to plead for meaningful reconciliation and to advocate changing the date of the national day, or to abolish it altogether. The 27th is the day after a day of exasperated commentary about the recipients of Australia Day honours, which in 2015 included Prince Phillip, inexplicably knighted by the then prime minister, Toby Abbott; and which this year included Margaret Court, whose legacy as one of the greatest ever tennis players has in recent times been overshadowed by her outspoken and controversial views about homosexuality, gay marriage and transgender people. In short, the 27th of January is the day after a wave of difficult, awkward, and at times ugly public debate peaks and subsides. Until next year. Continue reading The day after the week before: mapping the Twitter discourse about Australia day

TextKleaner – a Knime workflow for preparing large text datasets for analysis

This post describes the motivation for the TextKleaner workflow and provides instructions on how to use it. You can obtain the TextKleaner workflow from the KNIME Hub.

Confronting the first law of text analytics

The computational analysis of text — or text analytics for short — is a field that has come into its own in recent years. While computational tools for analysing text have been around for decades — the first notable example being The General Inquirer, developed in the 1960s — the need for such tools has become greater as the amount of textual data that permeates everyday life has increased. Websites, social media and other digital communication technologies have created vast and ever-expanding repositories of text, recording all kinds of human interactions. Meanwhile, more and more texts from previous eras are finding their way into digital form. While all kinds of scholarly, commercial and creative rewards await those who can make sense of this wealth of data, its sheer volume means that it cannot be comprehended in the old fashioned way (otherwise known as reading). Just as computers are largely responsible for generating and transmitting this data, they are indispensable for managing and understanding it.

Thankfully, computers and the people who program them have both risen to the challenge of grappling with Big Text. As computers have become more and more powerful, the ways in which we use them have become more and more sophisticated. No longer a synonym for glorified word counting, computational text analysis now includes or intersects with fields such as natural language processing (NLP), machine learning and artificial intelligence. Simple formulas that compare word frequencies now work alongside complex algorithms that parse sentences, recognise names, detect sentiments, classify topics, and even compose original texts.

Such technologies are not the sole domain of tech giants and elite scientists, even if they do sit at the core of digital mega-infrastructures such as Google’s search engine or Amazon’s hivemind of personal assistants. Many of them are available as free or open-source software libraries, accessible to anyone with a well-specced laptop and a basic competence in computer science.

And yet, no matter how powerful and accessible these tools are, they have not altered a fact that, in my opinion, could be enshrined as the first law of text analytics — namely, that analysing text with computers is really hard. Continue reading TextKleaner – a Knime workflow for preparing large text datasets for analysis

Qualitative evaluation of topic models: a methodological offering

Topic models: a Pandora’s Black Box for social scientists

Probabilistic topic modelling is an improbable gift from the field of machine learning to the social sciences and humanities. Just as social scientists began to confront the avalanche of textual data erupting from the internet, and historians and literary scholars started to wonder what they might do with newly digitised archives of books and newspapers, data scientists unveiled a family of algorithms that could distil huge collections of texts into insightful lists of words, each indexed precisely back to the individual texts, all in less time than it takes to write a job ad for a research assistant. Since David Blei and colleagues published their seminal paper on latent Dirichlet allocation (the most basic and still the most widely used topic modelling technique) in 2003, topic models have been put to use in the analysis of everything from news and social media through to political speeches and 19th century fiction.

Grateful for receiving such a thoughtful gift from a field that had previously expressed little interest or affection, social scientists have returned the favour by uncovering all the ways in which machine learning algorithms can reproduce and reinforce existing biases and inequalities in social systems. While these two fields have remained on speaking terms, it’s fair to say that their relationships status is complicated.

Even topic models turned out to be as much a Pandora’s Box as a silver bullet for social scientists hoping to tame Big Text. In helping to solve one problem, topic models created another. This problem, in a word, is choice. Rather than providing a single, authoritative way in which to interpret and code a given textual dataset, topic models present the user with a landscape of possibilities from which to choose. This landscape is defined in part by the model parameters that the user must set. As well as the number of topics to include in the model, these parameters include values that reflect prior assumptions about how documents and topics are composed (these parameters are known as alpha and beta in LDA). 1 Each unique combination of these parameters will result in a different (even if subtly different) set of topics, which in turn could lead to different analytical pathways and conclusions. To make matters worse, merely varying the ‘random seed’ value that initiates a topic modelling algorithm can lead to substantively different results.

Far from narrowing down the number of possible schemas with which to code and analyse a text, topic models can therefore present the user with a bewildering array of possibilities from which to choose. Rather than lending a stamp of authority or objectivity to a textual analysis, topic models leave social scientists in the familiar position of having to justify the selection of one model of reality over another. But whereas a social scientist would ordinarily be able to explain in detail the logic and assumptions that led them to choose their analytical framework, the average user of a topic model will have only a vague understanding of how their model came into being. Even if the mathematics of topics models are well understood by their creators, topic models will always remain something of a ‘black box’ to many end-users.

This state of affairs is incompatible with any research setting that demands a high degree of rigour, transparency and repeatability in textual analyses. 2 If social scientists are to use topic models in such settings, they need some way to justify their selection of one possible classification scheme over the many others that a topic modelling algorithm could produce, 3 and to account for the analytical opportunities foregone in doing so.

If you’ve ever tried to interpret even a single set of topic model outputs, you’ll know that this is a big ask. Each run of a topic modelling algorithm produces maybe dozens of topics (the exact number is set by the user), each of which in turn consists of dozens (or maybe even hundreds) of relevant words whose collective interpretation constitutes the ‘meaning’ of the topic. Some topics present an obvious interpretation. Some can be interpreted only with the benefit of domain expertise, cross-referencing with original texts, and perhaps even some creative licence. Some topics are distinct in their meaning, while others overlap with each other, or vary only in subtle or mysterious ways. Some topics are just junk.

If making sense of a single topic model 4 is a complex task, comparing one model with another is doubly so. Comparing many models at a time is positively Herculean. How, then, is anyone supposed to compare and evaluate dozens of candidate models sampled from all over the configuration space? Continue reading Qualitative evaluation of topic models: a methodological offering

Notes:

  1. The generative model of LDA assumes that each document in a collection is generated from a mixture of hidden variables (topics) from which words are selected to populate the document. The number of topics in the model is a parameter that must be set by the user. The proportions by which topics are mixed to create documents, and by which words are mixed to define topics, are presumed to conform to specific distributions which are sampled from the Dirichlet distribution, which is essentially a distribution of distributions. The shape of these two prior distributions is determined by two parameters—often referred to as hyperparameters to distinguish them from the internal components of the model—which are usually denoted as alpha (α) and beta (β). Whereas alpha controls the presumed specificity of documents (a smaller value means that fewer topics are prominent within a document), beta controls the presumed specificity of topics (a smaller value means that fewer words within a topic are strongly weighted). Like the number of topics, these hyperparameters are set by the user, ideally with some regard for the style and composition of the texts being analysed.
  2. It’s important to recognise that criteria such as transparency and repeatability are not applicable to all textual analysis traditions. Some traditions assume a degree of interpretation and subjectivity that render such criteria all but irrelevant. The probabilistic nature of topic models presents a very different set of challenges and opportunities to such traditions, at least insofar as practitioners are inclined to use them.
  3. That is, assuming that only one fitted topic model is used in the analysis. Conceivably, an analysis could use and compare several models.
  4. In this post, as in much of the literature on topic modelling, the term ‘topic model’ may describe one of two things. The more general sense of the term refers to a particular generative model of text, which may or may not be paired with a specific inference algorithm. In this sense, LDA is one example of a topic model, and the structural topic model is another. The second sense of the term refers to the outputs, in the form of term distributions and document allocations, obtained by applying a topic model in the first sense to a particular collection of texts. (These outputs may also be referred to as a ‘fitted topic model’.) The relevant sense of the term will usually be evident from the context in which it is used.

TweetKollidR – A Knime workflow for creating text-rich visualisations of Twitter data

Several weeks ago, I posted an analysis of tweets about the restrictions imposed on Melbourne residents in an effort to control an outbreak of Covid-19. That analysis was essentially a road-test of a Knime workflow that I had been piecing together for some time, but that was not quite ready to share. Since writing that post, I have revised and tidied up the workflow so that anyone can use it, and I have made it available on the Knime Hub.

In the present post, I provide a thorough description of the workflow, which I have named the TweetKollidR, and demonstrate its use through a case study of yet another dataset of tweets about Melbourne’s lockdown (which, as I write this, still has not ended, although it has been eased). 1

Continue reading TweetKollidR – A Knime workflow for creating text-rich visualisations of Twitter data

Notes:

  1. As you will see from the search queries in Figure 3, this dataset includes some keywords that relate to Victoria more generally, rather than just Melbourne. However, since most of the content concerns the Melbourne lockdown, I will continue to refer to it as such.