Tag Archives: topic modelling

Looking for letters

In the posts I’ve written to date, I’ve learned some interesting things about my corpus of 40,000 news articles. I’ve seen how the articles are distributed over time and space. I’ve seen the locations they talk about, and how this shifts over time. And I’ve created a thematic index to see what it’s all about. But I’ve barely said anything about the articles themselves. I’ve written nothing, for example, about how they vary in their format, style, and purpose.

To some extent, such concerns are of secondary importance to me, since they are not very accessible to the methods I am employing, and (not coincidentally) are not central to the questions I will be investigating, which relate more to the thematic and conceptual aspects of the text. But even if these things are not the objects of my analysis, they are still important because they define what my corpus actually is. To ignore these things would be like surveying a large sample of people without recording what population or cohort those people represent. As with a survey, the conclusions I draw from my textual analysis will have no real-world validity unless I know what kinds of things in the real world my data represent.

In this post, I’m going to start paying attention to such things. But I’m not about to provide a comprehensive survey of the types of articles in my corpus. Instead I will focus on just one categorical distinction — that between in-house content generated by journalists and staff writers, and contributed or curated content in the form of readers’ letters and comments. Months ago, when I first started looking at the articles in my corpus, I realised that many of the articles are not news stories at all, but are collections of letters, text messages or Facebook posts submitted by readers. I wondered if perhaps this reader-submitted content should be kept separate from the in-house content, since it represents a different ‘voice’ to that of the newspapers themselves. Or then again, maybe reader’s views can be considered just as much a part of a newspaper’s voice as the rest of the content, since ultimately it is all vetted and curated by the newspaper’s editors.

As usual, the relevance of this distinction will depend on what questions I want to ask, and what theoretical frameworks I employ to answer them. But there is also a practical consideration — namely, can I even separate these types of content without sacrificing too much of my time or sanity? 40,000 documents is a large haystack in which to search for needles. Although there is some metadata in my corpus inherited from the Factiva search (source publication, author, etc.), none of it is very useful for distinguishing letters from other articles. To identify the letters, then, I was going to have to use information within the text itself. Continue reading

What’s it all about? Indexing my corpus using LDA.

Months ago, I assembled a dataset containing around 40,000 Australian news articles discussing coal seam gas. My ultimate aim is to analyse these articles, along with other text data from the web, so as to learn something about the structure and dynamics of the public discourse about coal seam gas in Australia. I’m interested in dissecting how different parties talk about this topic, and how this ‘configuration’ of the public discourse changes over time.

Although I didn’t originally plan to, I’ve focussed much of my energy so far on exploring the geographic dimension of the news articles. I’ve looked at where the news has come from and what places it talks about. This is all important stuff to know when studying such a geographically defined issue as coal seam gas development. But I also need to know what is being talked about, not just where. Now, finally, I am ready to turn my attention to exploring the thematic content of the articles.

Well, almost. I’m ready, but the data isn’t. The dataset that I have been playing with all this time is stuffed with articles that I don’t want, and is missing many that I do. This is because the search parameters that I used to retrieve the articles from Factiva were very broad — I obtained every article that mentioned coal seam gas or CSG anywhere even just once — and because I applied a rather rudimentary method — keyword counts — for filtering out the less relevant articles. The dataset has served its purpose as a testing ground, but if I am to use it to actually say things about the world, I need to know what it contains. And more than that, I need the ability to customise what it contains to suit the specific questions that I decide to explore.

In other words, I need an index to my corpus. I need to know what every article is about, so I can include or exclude it at my discretion. In this post I’ll describe how I have created that index using a method of topic modelling called Latent Dirichlet Allocation, or LDA. Happily, this is the very same method that I was planning to use to analyse the thematic content of my corpus. So by creating an index for my corpus, I am already starting on the process of understanding what it’s all about. Continue reading

What do you do with a thousand place names?

My previous post was all about turning place names in news articles into dots on a map. Using a fairly straightforward method, I matched the place names in a collection of 26,863 news articles against the names and geographic coordinates in the Australian Gazetteer 2012, which lists and locates virtually every named place in Australia. Using such a comprehensive list created a fair amount of extra work, but resulted in a very rich and satisfying visualisation of how the news coverage about coal seam gas has moved over time. Ultimately however, I want to translate these rich visualisations into simpler narratives and numerical descriptions. And to do this, individual statistics for every one of the 1,448 places on my list will not be of much help. I will need some way of aggregating the locations into relevant regions or locales.

To achieve this, one could perhaps use some technique to group the locations based on spatial proximity — something akin to drawing fences around the places that form discrete clusters on the map. But there might be reasons besides proximity to group places together. Spatially distinct places might be united by common issues or events, just as proximate places might be subject to separate laws and controversies. Given that my ultimate object of study is public discourse, such non-geographical unifying factors may prove to be as important as geographical ones.

Latent Deary What?

Only some of these thoughts had crossed my mind when the idea hit me to use a topic modelling technique called Latent Dirichlet Allocation (LDA) to bring some order to my large list of locations. LDA is a technique that automatically identifies topics in large collections of documents, with a ‘topic’ in this context being defined as a set of words that tend to occur together in the documents that you are analysing. LDA uses some clever assumptions and iterative processes to find sets of words that, in most cases at least, correspond remarkably well with meaningful topics in the text. It is widely used for automated document categorisation and indexing, and more recently it has been applied to fields such as history and literary studies under the banner of the digital humanities. If you’re fluent in hieroglyphics, the Wikipedia page might be a good place to start if you want to know more about LDA. If you’re a mere mortal, pages like this one and this one offer a softer introduction.

Like many computational text analysis methods, LDA views each document as an unordered ‘bag of words’. (This might sound like the surest way to render a document meaningless, but the payoff is that it makes the text amenable to all kinds of statistical techniques.) So I figured, why not instead feed the LDA algorithm bags of places, which is exactly what I had created from my collection of news articles when preparing my last post. I saw no reason why LDA couldn’t turn this data into groups of locations that were both spatially and discursively meaningful. Places that are mentioned together in articles are likely to be physically close to one another, linked by social context, or most likely, both. Meaningful groupings of these places could be called geographic topics, or geotopics for short. Continue reading