Tag Archives: topic modelling

Qualitative evaluation of topic models: a methodological offering

October 22, 2020PhD related, Workflowscovid-19, data visualisation, KNIME, lda, lockdown, PhD, R, text analysis, topic modelling, Twitterangusv

Topic models: a Pandora’s Black Box for social scientists

Probabilistic topic modelling is an improbable gift from the field of machine learning to the social sciences and humanities. Just as social scientists began to confront the avalanche of textual data erupting from the internet, and historians and literary scholars started to wonder what they might do with newly digitised archives of books and newspapers, data scientists unveiled a family of algorithms that could distil huge collections of texts into insightful lists of words, each indexed precisely back to the individual texts, all in less time than it takes to write a job ad for a research assistant. Since David Blei and colleagues published their seminal paper on latent Dirichlet allocation (the most basic and still the most widely used topic modelling technique) in 2003, topic models have been put to use in the analysis of everything from news and social media through to political speeches and 19th century fiction.

Grateful for receiving such a thoughtful gift from a field that had previously expressed little interest or affection, social scientists have returned the favour by uncovering all the ways in which machine learning algorithms can reproduce and reinforce existing biases and inequalities in social systems. While these two fields have remained on speaking terms, it’s fair to say that their relationships status is complicated.

Even topic models turned out to be as much a Pandora’s Box as a silver bullet for social scientists hoping to tame Big Text. In helping to solve one problem, topic models created another. This problem, in a word, is choice. Rather than providing a single, authoritative way in which to interpret and code a given textual dataset, topic models present the user with a landscape of possibilities from which to choose. This landscape is defined in part by the model parameters that the user must set. As well as the number of topics to include in the model, these parameters include values that reflect prior assumptions about how documents and topics are composed (these parameters are known as alpha and beta in LDA). ¹ Each unique combination of these parameters will result in a different (even if subtly different) set of topics, which in turn could lead to different analytical pathways and conclusions. To make matters worse, merely varying the ‘random seed’ value that initiates a topic modelling algorithm can lead to substantively different results.

Far from narrowing down the number of possible schemas with which to code and analyse a text, topic models can therefore present the user with a bewildering array of possibilities from which to choose. Rather than lending a stamp of authority or objectivity to a textual analysis, topic models leave social scientists in the familiar position of having to justify the selection of one model of reality over another. But whereas a social scientist would ordinarily be able to explain in detail the logic and assumptions that led them to choose their analytical framework, the average user of a topic model will have only a vague understanding of how their model came into being. Even if the mathematics of topics models are well understood by their creators, topic models will always remain something of a ‘black box’ to many end-users.

This state of affairs is incompatible with any research setting that demands a high degree of rigour, transparency and repeatability in textual analyses. ² If social scientists are to use topic models in such settings, they need some way to justify their selection of one possible classification scheme over the many others that a topic modelling algorithm could produce, ³ and to account for the analytical opportunities foregone in doing so.

If you’ve ever tried to interpret even a single set of topic model outputs, you’ll know that this is a big ask. Each run of a topic modelling algorithm produces maybe dozens of topics (the exact number is set by the user), each of which in turn consists of dozens (or maybe even hundreds) of relevant words whose collective interpretation constitutes the ‘meaning’ of the topic. Some topics present an obvious interpretation. Some can be interpreted only with the benefit of domain expertise, cross-referencing with original texts, and perhaps even some creative licence. Some topics are distinct in their meaning, while others overlap with each other, or vary only in subtle or mysterious ways. Some topics are just junk.

If making sense of a single topic model ⁴ is a complex task, comparing one model with another is doubly so. Comparing many models at a time is positively Herculean. How, then, is anyone supposed to compare and evaluate dozens of candidate models sampled from all over the configuration space? Continue reading Qualitative evaluation of topic models: a methodological offering →

Notes:

The generative model of LDA assumes that each document in a collection is generated from a mixture of hidden variables (topics) from which words are selected to populate the document. The number of topics in the model is a parameter that must be set by the user. The proportions by which topics are mixed to create documents, and by which words are mixed to define topics, are presumed to conform to specific distributions which are sampled from the Dirichlet distribution, which is essentially a distribution of distributions. The shape of these two prior distributions is determined by two parameters—often referred to as hyperparameters to distinguish them from the internal components of the model—which are usually denoted as alpha (α) and beta (β). Whereas alpha controls the presumed specificity of documents (a smaller value means that fewer topics are prominent within a document), beta controls the presumed specificity of topics (a smaller value means that fewer words within a topic are strongly weighted). Like the number of topics, these hyperparameters are set by the user, ideally with some regard for the style and composition of the texts being analysed. ↩
It’s important to recognise that criteria such as transparency and repeatability are not applicable to all textual analysis traditions. Some traditions assume a degree of interpretation and subjectivity that render such criteria all but irrelevant. The probabilistic nature of topic models presents a very different set of challenges and opportunities to such traditions, at least insofar as practitioners are inclined to use them. ↩
That is, assuming that only one fitted topic model is used in the analysis. Conceivably, an analysis could use and compare several models. ↩
In this post, as in much of the literature on topic modelling, the term ‘topic model’ may describe one of two things. The more general sense of the term refers to a particular generative model of text, which may or may not be paired with a specific inference algorithm. In this sense, LDA is one example of a topic model, and the structural topic model is another. The second sense of the term refers to the outputs, in the form of term distributions and document allocations, obtained by applying a topic model in the first sense to a particular collection of texts. (These outputs may also be referred to as a ‘fitted topic model’.) The relevant sense of the term will usually be evident from the context in which it is used. ↩

Free as in trams: using text analytics to analyse public submissions

July 29, 2020Uncategorisedgovernment, lda, public submissions, text analysis, topic modellingangusv

The opportunity

As documented elsewhere on this blog, I recently spent four years of my life playing with computational methods for analysing text, hoping to advance, in some small way, the use of such methods within social science. Along the way, I became interested in using topic models and related techniques to assist the development of public policy. Governments regularly invite public comment on things like policy proposals, impact assessments, and inquiries into controversial issues. Sometimes, the public’s response can be overwhelming, flooding a government department or parliamentary office with hundreds or thousands of submissions, all of which the government is obliged to somehow ‘consider’.

Not having been directly at the receiving end of this process, I’m not entirely sure how the teams responsible go about ‘considering’ thousands of public submissions. But this task strikes me as an excellent use-case for computational techniques that, with minimal supervision, can reveal thematic structures within large collections of texts. I’m not suggesting that we can delegate to computers the task of reading public submissions: that would be wrong even if it were possible. What we can do, however, is use computers to assist the process of navigating, interpreting and organising an overwhelming number of submissions.

A few years back, I helped a panellist on the Northern Territory’s Scientific Inquiry into Hydraulic Fracturing to analyse concerns about social impacts expressed in more than 600 public submissions. Rather than manually reading every submission to see which ones were relevant, I used a computational technique called probabilistic topic modelling to automatically index the submissions according to the topics they discussed. I was then able to focus my attention on those submissions that discussed social impacts, making the job a whole lot easier than it otherwise would have been. In addition, the topic model helped me to categorise the submissions according to the types of social impacts they discussed, and provided a direct measurement of how much attention each type of impact had received.

This experience proved that computational text analysis methods can indeed be useful for assessing public input to policy processes. However, it was far from perfect case study, as I was operating only on the periphery of the assessment process. The value of computational methods could be even greater if they were incorporated into the process from the outset. In that case, for example, I could have indexed the submissions against topics besides social impacts. As well as making life easier for the panellists responsible for other topics, a more complete topical index would have enabled an easy analysis of which issues were of most interest to each category of stakeholder, or to all submitters taken together.

In this post, I want to illustrate how topic modelling and other computational text analysis methods can contribute to the assessment of public submissions to policy issues. I do this by performing a high-level analysis of submissions to the Victorian parliament about a proposal to expand Melbourne’s ‘free tram zone’. I chose this particular inquiry because it has not yet concluded (submissions have closed, but the report is not due until December) and because it received more than 400 hundred submissions, which although perhaps not an overwhelming number, is surely enough to create a sense of foreboding in the person who has to read them all.

This analysis is meant to be demonstrative rather than definitive. The methods I’ve used are experimental and could be refined. More importantly, these methods are not supposed to stand on their own, but rather should be integrated into the rest of the analytical process, which obviously I am not doing, since I do not work for the Victorian Government. In other words, my aim here is not to provide an authoritative take on the content of the submissions, but to demonstrate how certain computational methods could assist the task of analysing these submissions. Continue reading Free as in trams: using text analytics to analyse public submissions →

TroveKleaner: a Knime workflow for correcting OCR errors

June 3, 2019Uncategoriseddigital humanities, KNIME, lda, newspapers, text analysis, topic modellingangusv

In a nutshell:

I created a Knime workflow — the TroveKleaner — that uses a combination of topic modelling, string matching and other methods to correct OCR errors in large collections of texts. You can download it from the KNIME Hub.
It works, but does not correct all errors. It doesn’t even attempt to do so. Instead of examining every word in the text, it builds a dictionary of high-confidence errors and corrections, and uses the dictionary to make substitutions in the text.
It’s worth a look if you plan to perform computational analyses on a large collection of error-ridden digitised texts. It may also be of interest if you want to learn about topic modelling, string matching, ngrams, semantic similarity measures, and how all these things can be used in combination.

O-C-aarghh!

This post discusses the second in a series of Knime workflows that I plan to release for the purpose of mining newspaper texts from Trove, that most marvellous collection of historical newspapers and much more maintained by the National Library of Australia. The end-game is to release the whole process for geo-parsing and geovisualisation that I presented in this post on my other blog. But revising those workflows and making them fit for public consumption will be a big job (and not one I get paid for), so I’ll work towards it one step at a time.

Already, I have released the Trove KnewsGetter, which interfaces with the Trove API to allow you to download newspaper texts in bulk. But what do you do with 20,000 newspaper articles from Trove?

Before you even think about how to analyse this data, the first thing you will probably do is cast your eyes over it, just to see what it looks like.

Cue horror.

A typical reaction upon seeing Trove’s OCR-derived text for the first time. Continue reading TroveKleaner: a Knime workflow for correcting OCR errors →

Coding without code: Knime as a tool for digital humanities and computational social science

February 7, 2019General pontificationscoding, computational social science, digital humanities, KNIME, networks, text analysis, topic modellingangusv

The rise of the machines

Over the past several years, I’ve stumbled into two fields of research which, although differing in their subject matter, are united by the recent entrance of computational methods into domains that were previously dominated by manual modes of analysis.

One of these two fields is computational social science. This is the broad category into which my all-but-examined PhD thesis, and most of the posts on this blog, could be placed. Of course, my work, which focuses on the application of text analytics to communication studies, occupies a tiny niche within this larger field, which encompasses the use of all kinds of computational methods to the study of social phenomena, including network science and social simulations using agent-based models.

The other field to which I can claim some degree of membership is the digital humanities. This tag is one way to describe what I do on my other blog, in which I have drawn on digitised maps and newspapers to produce visualisations that explore the history of a suburban environment. History is just one field of the humanities that has been opened up to new possibilities through the extensive digitisation of source materials. Scholars of literature have joined the party as well, while those operating in the GLAM sector, consisting of galleries, libraries, and museums, can now use digital platforms to manage and share their collections.

Though often lumped together in the same faculty (or indeed the same school at my last university), the humanities and social sciences don’t generally have a lot to do with one another. They do, however, have commonalities that lend themselves to similar computational methods. One is the study of text, often in large quantities. Within the social sciences, communication scholars analyse (st)reams of news texts and social media feeds; qualitative sociologists often amass hours of interview transcripts; political scientists analyse speeches and parliamentary transcripts. In the humanities, scholars of history and literature now have at their disposal huge archives of digitised novels, journals, newspapers, and government records. Of course, there is nothing compelling scholars in these fields to analyse all of the available data at once. They are free to sample or cherry-pick as they please. But any scholar wishing to work with large textual datasets in their totality can hardly afford to ignore computational tools like topic models, named entity extraction, sentiment analysis, and domain-specific dictionaries such as LIWC.

Another analytical method that is applicable within both the social sciences and humanities is the study of networks. Sociologists and historians have long used networks to represent and analyse social ties. More recently, social media platforms have hard-wired the logic of networks into social relationships, explicitly representing all of us as nodes tied to one another by virtue of our acquaintances, interactions, and browsing histories. Accordingly, it is becoming increasingly difficult to study social phenomena without invoking the concept of networks, if not also the science of network analysis. By no means restricted to real-world phenomena, network analysis techniques are equally applicable to fictitious societies, whether those in Game of Thrones or the plays of Shakespeare. As with the analysis of text, network analysis doesn’t have to be done with a computer; but as the size of a network grows, so too does the need for computational tools.

Besides an interest in certain computational methods, something else that the digital humanities and computational social sciences have in common is that many of the scholars trying to break into these fields are doing so without any training in data science or computer programming. This is especially the case in the humanities, where traditionally, quantitative methods have been virtually taboo. In some branches of the social sciences, quantitative methods are well established, but these are more likely to take the form of spreadsheets and statistical tests rather than machine learning and network analyses.

(Of course, the story is very different for those researchers entering these fields from the other direction, such as the many physicists, mathematicians and computer scientists who have somehow claimed the territory of computational social science. What these researchers lack, although many of them seem not to realise it, is any training in understanding the actual substance of what they are studying.)

The bottom line is that for many researchers, embracing the new opportunities presented by the digital humanities and computational social sciences means learning skills that are totally outside of their disciplinary training. At a minimum, researchers hoping to use computational methods in these fields will need to learn how to calculate statistics and use spreadsheets, and perhaps also how to use network visualisation software like Gephi or text-cleaning software like OpenRefine. Increasingly though, getting your hands dirty in computational social science or the computational end of the digital humanities ¹ means learning something far worse — something that is commonly denoted by a certain four-letter word. Continue reading Coding without code: Knime as a tool for digital humanities and computational social science →

Notes:

There really is a difference between the digital in digital humanities and the computational in computational social science. Much of what falls under the digital humanities pertains only to the use and curation of digitised resources, rather than the use of computational methods to analyse these resources. Computational social science, on the other hand, seems to be driven primarily by the desire to analyse social data through computational methods. ↩