Category Archives: PhD related

Qualitative evaluation of topic models: a methodological offering

Topic models: a Pandora’s Black Box for social scientists

Probabilistic topic modelling is an improbable gift from the field of machine learning to the social sciences and humanities. Just as social scientists began to confront the avalanche of textual data erupting from the internet, and historians and literary scholars started to wonder what they might do with newly digitised archives of books and newspapers, data scientists unveiled a family of algorithms that could distil huge collections of texts into insightful lists of words, each indexed precisely back to the individual texts, all in less time than it takes to write a job ad for a research assistant. Since David Blei and colleagues published their seminal paper on latent Dirichlet allocation (the most basic and still the most widely used topic modelling technique) in 2003, topic models have been put to use in the analysis of everything from news and social media through to political speeches and 19th century fiction.

Grateful for receiving such a thoughtful gift from a field that had previously expressed little interest or affection, social scientists have returned the favour by uncovering all the ways in which machine learning algorithms can reproduce and reinforce existing biases and inequalities in social systems. While these two fields have remained on speaking terms, it’s fair to say that their relationships status is complicated.

Even topic models turned out to be as much a Pandora’s Box as a silver bullet for social scientists hoping to tame Big Text. In helping to solve one problem, topic models created another. This problem, in a word, is choice. Rather than providing a single, authoritative way in which to interpret and code a given textual dataset, topic models present the user with a landscape of possibilities from which to choose. This landscape is defined in part by the model parameters that the user must set. As well as the number of topics to include in the model, these parameters include values that reflect prior assumptions about how documents and topics are composed (these parameters are known as alpha and beta in LDA). 1 Each unique combination of these parameters will result in a different (even if subtly different) set of topics, which in turn could lead to different analytical pathways and conclusions. To make matters worse, merely varying the ‘random seed’ value that initiates a topic modelling algorithm can lead to substantively different results.

Far from narrowing down the number of possible schemas with which to code and analyse a text, topic models can therefore present the user with a bewildering array of possibilities from which to choose. Rather than lending a stamp of authority or objectivity to a textual analysis, topic models leave social scientists in the familiar position of having to justify the selection of one model of reality over another. But whereas a social scientist would ordinarily be able to explain in detail the logic and assumptions that led them to choose their analytical framework, the average user of a topic model will have only a vague understanding of how their model came into being. Even if the mathematics of topics models are well understood by their creators, topic models will always remain something of a ‘black box’ to many end-users.

This state of affairs is incompatible with any research setting that demands a high degree of rigour, transparency and repeatability in textual analyses. 2 If social scientists are to use topic models in such settings, they need some way to justify their selection of one possible classification scheme over the many others that a topic modelling algorithm could produce, 3 and to account for the analytical opportunities foregone in doing so.

If you’ve ever tried to interpret even a single set of topic model outputs, you’ll know that this is a big ask. Each run of a topic modelling algorithm produces maybe dozens of topics (the exact number is set by the user), each of which in turn consists of dozens (or maybe even hundreds) of relevant words whose collective interpretation constitutes the ‘meaning’ of the topic. Some topics present an obvious interpretation. Some can be interpreted only with the benefit of domain expertise, cross-referencing with original texts, and perhaps even some creative licence. Some topics are distinct in their meaning, while others overlap with each other, or vary only in subtle or mysterious ways. Some topics are just junk.

If making sense of a single topic model 4 is a complex task, comparing one model with another is doubly so. Comparing many models at a time is positively Herculean. How, then, is anyone supposed to compare and evaluate dozens of candidate models sampled from all over the configuration space? Continue reading

Notes:

  1. The generative model of LDA assumes that each document in a collection is generated from a mixture of hidden variables (topics) from which words are selected to populate the document. The number of topics in the model is a parameter that must be set by the user. The proportions by which topics are mixed to create documents, and by which words are mixed to define topics, are presumed to conform to specific distributions which are sampled from the Dirichlet distribution, which is essentially a distribution of distributions. The shape of these two prior distributions is determined by two parameters—often referred to as hyperparameters to distinguish them from the internal components of the model—which are usually denoted as alpha (α) and beta (β). Whereas alpha controls the presumed specificity of documents (a smaller value means that fewer topics are prominent within a document), beta controls the presumed specificity of topics (a smaller value means that fewer words within a topic are strongly weighted). Like the number of topics, these hyperparameters are set by the user, ideally with some regard for the style and composition of the texts being analysed.
  2. It’s important to recognise that criteria such as transparency and repeatability are not applicable to all textual analysis traditions. Some traditions assume a degree of interpretation and subjectivity that render such criteria all but irrelevant. The probabilistic nature of topic models presents a very different set of challenges and opportunities to such traditions, at least insofar as practitioners are inclined to use them.
  3. That is, assuming that only one fitted topic model is used in the analysis. Conceivably, an analysis could use and compare several models.
  4. In this post, as in much of the literature on topic modelling, the term ‘topic model’ may describe one of two things. The more general sense of the term refers to a particular generative model of text, which may or may not be paired with a specific inference algorithm. In this sense, LDA is one example of a topic model, and the structural topic model is another. The second sense of the term refers to the outputs, in the form of term distributions and document allocations, obtained by applying a topic model in the first sense to a particular collection of texts. (These outputs may also be referred to as a ‘fitted topic model’.) The relevant sense of the term will usually be evident from the context in which it is used.

A thesis relived: using text analytics to map a PhD journey

 

Your thesis has been deposited.

Is this how four years of toil was supposed to end? Not with a bang, but with a weird sentence from my university’s electronic submission system? In any case, this confirmation message gave me a chuckle and taught me one new thing that could be done to a thesis. A PhD is full of surprises, right till the end.

But to speak of the end could be premature, because more than two months after submission, one thing that my thesis hasn’t been yet is examined. Or if it has been, the examination reports are yet to be deposited back into the collective consciousness of my grad school.

The lack of any news about my thesis is hardly keeping me up at night, but it does make what I am about to do in this post a little awkward. Following Socrates, some people would argue that an unexamined thesis is not worth reliving. At the very least, Socrates might have cautioned against saying too much about a PhD experience that might not yet be over. Well, too bad: I’m throwing that caution to the wind, because what follows is a detailed retrospective of my PhD candidature.

Before anyone starts salivating at the prospect of reading sordid details about about existential crises, cruel supervisors or laboratory disasters, let me be clear that what follows is not a psychodrama or a cautionary tale. Rather, I plan to retrace the scholastic journey that I took through my PhD candidature, primarily by examining what I read, and when.

I know, I know: that sounds really boring. But bear with me, because this post is anything but a literature review. This is a data-driven, animated-GIF-laden, deep-dive into the PhD Experience. Continue reading

Tracking and comparing regional coverage of coal seam gas

In the last post, I started looking at how the level of coverage of specific regions changed over time — an intersection of the Where and When dimensions of the public discourse on coal seam gas. In this post I’ll continue along this line of analysis while also incorporating something from the Who dimension. Specifically, I’ll compare how news and community groups cover specific regions over time.

Regional coverage by news organisations

One of the graphs in my last post compared the ratio of coverage of locations in Queensland to that of locations in New South Wales. Figure 1 below takes this a step further, breaking down the data by region as well. What this graph shows is the level of attention given to each region by the news sources in my database (filtered to ensure complete coverage for the period — see the last post) over time. In this case, I have calculated the “level of attention” for a given region by counting the number of times a location within that region appears in the news coverage, and then aggregating these counts within a moving 90-day window. Stacking the tallies to fill a fixed height, as I have done in Figure 1, reveals the relative importance of each region, regardless of how much news is generated overall (to see how the overall volume of coverage changes over time, see the previous post). The geographic boundaries that I am using are (with a few minor changes) the SA4 level boundaries defined by the Australian Bureau of Statistics. You can see these boundaries by poking around on this page of the ABS website.

The regions in Figure 1 are shaded so that you can see the division at the state level. The darker band of blue across the lower half of the graph corresponds with regions in Queensland. The large lighter band above that corresponds with regions in New South Wales. Above that, you can see smaller bands representing Victoria and Western Australia. (The remaining states are there too, but they have received so little coverage that I haven’t bothered to label them.) I have added labels for as many regions as I can without cluttering up the chart.

Figure 1. Coverage of geographic regions in news stories about coal seam gas, measured by the number of times locations from each region are mentioned in news stories within a moving 90-day window. The blue shadings group the regions by state. Hovering over the image shows a colour scheme suited to identifying individual regions. You can see larger versions of these images by clicking here and here.

Continue reading

It’s time

The last two posts have updated my progress in understanding the Where and the Who of public discourse on coal seam gas, but didn’t say much about the When. Analysing the temporal dynamics of public discourse — in other words, how things change — has been one of my driving interests all along in this project, so to complete this series of stock-taking articles, I will now review where I’m up to in analysing the temporal dimension.

At least, I had hoped to complete the stock-taking process with this post. But in the course of putting this post together, I made some somewhat embarrassing discoveries about the temporal composition of my data — discoveries that have significant implications for all of my analyses. This post is dedicated mostly to dealing with this new development. I’ll present the remainder of what I planned to talk about in a second installment.

The experience I describe here contains important lessons for anyone planning to analyse data obtained from news aggregation services as Factiva.

The moving window of time

The first thing to mention — and this is untainted by the embarrassment that I will discuss shortly — is that I’ve changed the way I’m making temporal graphs. Whereas previously I was simply aggregating data into monthly or quarterly chunks, I am now using KNIME’s ‘Moving Aggregation’ node to calculate moving averages over a specified window of time. This way, I can tailor the level of aggregation to the density of the data and the purpose of the graph. And regardless of the size of the time window, the time increments by which the graph is plotted can be as short as a week or a day, so the curve is smoother than a simple monthly or quarterly plot.

One reason why this feature is so useful is that the volume of news coverage on coal seam gas over time is very peaky, as shown in Figure 1 (and even the 30-day window hides a considerable degree of peakiness). Smoothing out the peaks to see long-term trends is all well and good, but it’s important never to lose touch with the fact that the data doesn’t really look that way.

Figure 1. The number of articles in my corpus over time, aggregated to a 30-day moving window. Hovering over the image shows the same data aggregated to a 90-day window. Continue reading