What do you do with a thousand place names?


My previous post was all about turning place names in news articles into dots on a map. Using a fairly straightforward method, I matched the place names in a collection of 26,863 news articles against the names and geographic coordinates in the Australian Gazetteer 2012, which lists and locates virtually every named place in Australia. Using such a comprehensive list created a fair amount of extra work, but resulted in a very rich and satisfying visualisation of how the news coverage about coal seam gas has moved over time. Ultimately however, I want to translate these rich visualisations into simpler narratives and numerical descriptions. And to do this, individual statistics for every one of the 1,448 places on my list will not be of much help. I will need some way of aggregating the locations into relevant regions or locales.

To achieve this, one could perhaps use some technique to group the locations based on spatial proximity — something akin to drawing fences around the places that form discrete clusters on the map. But there might be reasons besides proximity to group places together. Spatially distinct places might be united by common issues or events, just as proximate places might be subject to separate laws and controversies. Given that my ultimate object of study is public discourse, such non-geographical unifying factors may prove to be as important as geographical ones.

Latent Deary What?

Only some of these thoughts had crossed my mind when the idea hit me to use a topic modelling technique called Latent Dirichlet Allocation (LDA) to bring some order to my large list of locations. LDA is a technique that automatically identifies topics in large collections of documents, with a ‘topic’ in this context being defined as a set of words that tend to occur together in the documents that you are analysing. LDA uses some clever assumptions and iterative processes to find sets of words that, in most cases at least, correspond remarkably well with meaningful topics in the text. It is widely used for automated document categorisation and indexing, and more recently it has been applied to fields such as history and literary studies under the banner of the digital humanities. If you’re fluent in hieroglyphics, the Wikipedia page might be a good place to start if you want to know more about LDA. If you’re a mere mortal, pages like this one and this one offer a softer introduction.

Like many computational text analysis methods, LDA views each document as an unordered ‘bag of words’. (This might sound like the surest way to render a document meaningless, but the payoff is that it makes the text amenable to all kinds of statistical techniques.) So I figured, why not instead feed the LDA algorithm bags of places, which is exactly what I had created from my collection of news articles when preparing my last post. I saw no reason why LDA couldn’t turn this data into groups of locations that were both spatially and discursively meaningful. Places that are mentioned together in articles are likely to be physically close to one another, linked by social context, or most likely, both. Meaningful groupings of these places could be called geographic topics, or geotopics for short. Continue reading

How the news moves


Don’t feel like reading? Fine, skip to the pictures!

My last post explored the spatial and temporal dynamics of news production, looking at how the intensity of news coverage about coal seam gas varied over time across regional newspapers. In this post, I will look instead at the geographic content of news coverage: which places do news articles about coal seam gas discuss, and how has the geographic focus changed over time?

Coal seam gas development in Australia has become a matter of national interest, at least insofar as it has a place (albeit a shrinking one) on the federal political agenda, and has featured (albeit to varying degrees) in news coverage and public debate across the country. But it’s hard to talk sensibly about coal seam gas — whether you are talking about the industry itself, its social and environmental impacts, or how the community has responded to it —  without grounding the discussion in specific locations. From one gas field to another, the structures and dynamics of underground systems vary just as much as the social systems on the surface. I am convinced that any meaningful analysis of CSG-related matters must be highly sensitive to geographic context. (My very first PhD-related post on this blog, an analysis of hyperlinks on CSG-related web pages, pointed to the same conclusion.)

Most news stories about coal seam gas are ultimately about some place or another (or several), whether it be the field where the gas is produced, the power plant where it is used, the port from which it is exported, the environment or community affected, or the place where people gather to protest or blockade. Keeping track of which places are mentioned in the news could provide one way of tracking how the public discourse about coal seam gas develops. And the most logical way to present and explore this kind of information is with a map. In theory, every place mentioned in an article could be translated to a dot on a map. Mapping all of the dots from all of the articles should reveal the geographical extent and focus of news about coal seam gas.

Why do this? (Other than because I can, and it might be fun?) Firstly, because I’m still a little sketchy about how coal seam gas development and its attendant controversies have moved around the country over the last decade or two. I’m reasonably familiar with what has transpired in Queensland, but much less so with the situation in New South Wales. As for the other states, where there has been much less industry activity, I know virtually nothing about where and when coal seam gas has been discussed. So a map (especially one that can show time as well) of CSG-related news would provide a handy reference for understanding both the national and local geographic dimensions of the issue.

The other reason to map the news in this manner is that it may provide a way to both generate and answer interesting questions about the news landscape (or the public discourse more broadly) around coal seam gas — and this is, after all, what my PhD needs to do. Continue reading

Mapping the news

Where did the last 12 months go? All I can really remember is something about being confirmed as a PhD candidate. I read a lot, and wrote a lot, but did very little of what I originally set out to do — namely, visualising and analysing text data. Now, finally, I am back in the sandpit. I’ve amassed a truckload of data in the form of news articles and blogs about coal seam gas development in Australia, and I intend to spend the next short while sifting through it and seeing what sort of sandcastles I can build before the tide of my next PhD milestone forces me to construct something more substantial.

The ultimate aim of my PhD is to explore how computational text analysis techniques such as topic modelling can assist in the analysis of public discourse. But for now, my objective is to get acquainted with my data. This data is divided into two piles, each representing a part of the discursive landscape around coal seam gas (or CSG) in Australia (if you’re American, think coalbed methane). One pile of data consists of texts published on the web by a range of actors (the sociology kind, not the Hollywood kind) including community groups, activists, lobbyists and politicians. I’ve siphoned these texts from a variety of websites using a data-crawling tool called import.io. The second, much larger, pile of data consists of news articles from hundreds of Australian mainstream media publications, from the national broadsheet right down to the local rags. I gathered these articles from the online news database Factiva, with the help of a script, available at the website for the conversation analysis tool Discursis, which converts Factiva’s HTML outputs into tabular format in the form of CSV files.

This post is devoted to exploring the second pile of data — the many thousands of news articles that I gathered from Factiva. Without attempting any fancy text analysis, I aim to get a first look at the overall volume, scope and diversity of the content. The focus in this post is on the overall volume and the geographic distribution of the content. In a future post, I plan to explore the the specific news sources in more detail. Continue reading

Adventures in harmonic space

Long, long ago, I studied music. In fact, when I finished high school, music was all I wanted to study. To be sure, I didn’t just want to study it: I wanted to compose it as well. 1 But I soon discovered that music theory was something worthy of study in itself, quite apart from the grounding it provided for composition. Music theory, especially the analysis of harmonies and harmonic progressions, provided a way to pop the hood on a piece of music (or even a whole genre) and learn what makes it tick. As if that weren’t exciting enough, I sensed that there were more profound truths waiting to be teased out of these harmonic structures. For if they offered clues about what makes music tick, then surely they said something about what makes us tick as well.

I never did pursue my vision of a grand unified theory of tonal harmony and psychoacoustics. I soon found that there were also other things worth studying, many of which came with the bonus incentive of career prospects. One thing led to another, and for better or worse, I ended up working for the government. And not as a music theorist. But to this day, I can’t help hearing a piece of music and thinking about what makes it tick. The theorist within me is always plugging away, even while the rest of me is just enjoying the tune.

Unsurprisingly then, when I started playing with network graphs about 18 months ago, among the first things I asked myself is what application they might have for music theory. The beauty of network graphs is that they can be used to represent just about anything. Any system or community of inter-related parts can be turned into a network of nodes and connections. So far on this blog I’ve used network graphs to explore the linkages among websites related to coal seam gas, and to identify clusters of documents containing duplicated text. On my other blog, I used network graphs to see how the names of different people and places featured across a collection of my posts.

In this post, I will use network graphs to visualise the relationships among chords within a piece of music. You could examine melodies in much the same way, by breaking them down to their individual notes and tracking which notes pair up and cluster together most often. But I suspect that there is more to be gained from visualising the harmonic relationships. Continue reading

Notes:

  1. Eventually, years later, I did get around to writing some music. And I have finally published some of the results onto Youtube.