Where did the last 12 months go? All I can really remember is something about being confirmed as a PhD candidate. I read a lot, and wrote a lot, but did very little of what I originally set out to do — namely, visualising and analysing text data. Now, finally, I am back in the sandpit. I’ve amassed a truckload of data in the form of news articles and blogs about coal seam gas development in Australia, and I intend to spend the next short while sifting through it and seeing what sort of sandcastles I can build before the tide of my next PhD milestone forces me to construct something more substantial.
The ultimate aim of my PhD is to explore how computational text analysis techniques such as topic modelling can assist in the analysis of public discourse. But for now, my objective is to get acquainted with my data. This data is divided into two piles, each representing a part of the discursive landscape around coal seam gas (or CSG) in Australia (if you’re American, think coalbed methane). One pile of data consists of texts published on the web by a range of actors (the sociology kind, not the Hollywood kind) including community groups, activists, lobbyists and politicians. I’ve siphoned these texts from a variety of websites using a data-crawling tool called import.io. The second, much larger, pile of data consists of news articles from hundreds of Australian mainstream media publications, from the national broadsheet right down to the local rags. I gathered these articles from the online news database Factiva, with the help of a script, available at the website for the conversation analysis tool Discursis, which converts Factiva’s HTML outputs into tabular format in the form of CSV files.
This post is devoted to exploring the second pile of data — the many thousands of news articles that I gathered from Factiva. Without attempting any fancy text analysis, I aim to get a first look at the overall volume, scope and diversity of the content. The focus in this post is on the overall volume and the geographic distribution of the content. In a future post, I plan to explore the the specific news sources in more detail. Continue reading