Tag Archives: text analysis

Tracking and comparing regional coverage of coal seam gas

In the last post, I started looking at how the level of coverage of specific regions changed over time — an intersection of the Where and When dimensions of the public discourse on coal seam gas. In this post I’ll continue along this line of analysis while also incorporating something from the Who dimension. Specifically, I’ll compare how news and community groups cover specific regions over time.

Regional coverage by news organisations

One of the graphs in my last post compared the ratio of coverage of locations in Queensland to that of locations in New South Wales. Figure 1 below takes this a step further, breaking down the data by region as well. What this graph shows is the level of attention given to each region by the news sources in my database (filtered to ensure complete coverage for the period — see the last post) over time. In this case, I have calculated the “level of attention” for a given region by counting the number of times a location within that region appears in the news coverage, and then aggregating these counts within a moving 90-day window. Stacking the tallies to fill a fixed height, as I have done in Figure 1, reveals the relative importance of each region, regardless of how much news is generated overall (to see how the overall volume of coverage changes over time, see the previous post). The geographic boundaries that I am using are (with a few minor changes) the SA4 level boundaries defined by the Australian Bureau of Statistics. You can see these boundaries by poking around on this page of the ABS website.

The regions in Figure 1 are shaded so that you can see the division at the state level. The darker band of blue across the lower half of the graph corresponds with regions in Queensland. The large lighter band above that corresponds with regions in New South Wales. Above that, you can see smaller bands representing Victoria and Western Australia. (The remaining states are there too, but they have received so little coverage that I haven’t bothered to label them.) I have added labels for as many regions as I can without cluttering up the chart.

Figure 1. Coverage of geographic regions in news stories about coal seam gas, measured by the number of times locations from each region are mentioned in news stories within a moving 90-day window. The blue shadings group the regions by state. Hovering over the image shows a colour scheme suited to identifying individual regions. You can see larger versions of these images by clicking here and here.

Continue reading

Playing with page numbers

When was the last time you read a newspaper? I mean an actual, physical newspaper? Can you look at your fingertips and picture them smudged with ink, or remember trying to turn and fold those large and unwieldy pages? These are fading memories for me, and are probably totally foreign to many younger people today. Like many people, I consume virtually all of my news these days via the internet or, on rare occasion, the television. As far as I am concerned, newspapers are fast becoming nothing more than historical artifacts.

And yet, newspaper articles account for the bulk of the news data that I am analysing in my PhD project. To be sure, most of these newspaper articles were also published online, and would have been consumed that way by a lot of people. But I feel I can’t ignore the fact that these articles were also produced and consumed in a physical format. Unfortunately, there’s not much I can do to account for the physical presentation of the articles. My database doesn’t include the accompanying images or captions. Nor does it record how the articles were laid out on the page, or what other content surrounded them. But the metadata provided by Factiva does include one piece of information about each article’s physical manifestation: the page number of the newspaper in which it appeared.

From the very beginning of the explorations documented on this blog, I have completely ignored the page number field in my dataset. I figured that I was analysing text, not newspapers, and in any case I couldn’t see how I would incorporate page numbers into the kind of analysis that I was planning to do. But after hearing a colleague remark that ‘article-counting studies’ like mine are often unsatisfactory precisely because they fail to account for this information, I decided to give it some more thought. Continue reading

What’s it all about? Indexing my corpus using LDA.

Months ago, I assembled a dataset containing around 40,000 Australian news articles discussing coal seam gas. My ultimate aim is to analyse these articles, along with other text data from the web, so as to learn something about the structure and dynamics of the public discourse about coal seam gas in Australia. I’m interested in dissecting how different parties talk about this topic, and how this ‘configuration’ of the public discourse changes over time.

Although I didn’t originally plan to, I’ve focussed much of my energy so far on exploring the geographic dimension of the news articles. I’ve looked at where the news has come from and what places it talks about. This is all important stuff to know when studying such a geographically defined issue as coal seam gas development. But I also need to know what is being talked about, not just where. Now, finally, I am ready to turn my attention to exploring the thematic content of the articles.

Well, almost. I’m ready, but the data isn’t. The dataset that I have been playing with all this time is stuffed with articles that I don’t want, and is missing many that I do. This is because the search parameters that I used to retrieve the articles from Factiva were very broad — I obtained every article that mentioned coal seam gas or CSG anywhere even just once — and because I applied a rather rudimentary method — keyword counts — for filtering out the less relevant articles. The dataset has served its purpose as a testing ground, but if I am to use it to actually say things about the world, I need to know what it contains. And more than that, I need the ability to customise what it contains to suit the specific questions that I decide to explore.

In other words, I need an index to my corpus. I need to know what every article is about, so I can include or exclude it at my discretion. In this post I’ll describe how I have created that index using a method of topic modelling called Latent Dirichlet Allocation, or LDA. Happily, this is the very same method that I was planning to use to analyse the thematic content of my corpus. So by creating an index for my corpus, I am already starting on the process of understanding what it’s all about. Continue reading

How the news moves

Don’t feel like reading? Fine, skip to the pictures!

My last post explored the spatial and temporal dynamics of news production, looking at how the intensity of news coverage about coal seam gas varied over time across regional newspapers. In this post, I will look instead at the geographic content of news coverage: which places do news articles about coal seam gas discuss, and how has the geographic focus changed over time?

Coal seam gas development in Australia has become a matter of national interest, at least insofar as it has a place (albeit a shrinking one) on the federal political agenda, and has featured (albeit to varying degrees) in news coverage and public debate across the country. But it’s hard to talk sensibly about coal seam gas — whether you are talking about the industry itself, its social and environmental impacts, or how the community has responded to it —  without grounding the discussion in specific locations. From one gas field to another, the structures and dynamics of underground systems vary just as much as the social systems on the surface. I am convinced that any meaningful analysis of CSG-related matters must be highly sensitive to geographic context. (My very first PhD-related post on this blog, an analysis of hyperlinks on CSG-related web pages, pointed to the same conclusion.)

Most news stories about coal seam gas are ultimately about some place or another (or several), whether it be the field where the gas is produced, the power plant where it is used, the port from which it is exported, the environment or community affected, or the place where people gather to protest or blockade. Keeping track of which places are mentioned in the news could provide one way of tracking how the public discourse about coal seam gas develops. And the most logical way to present and explore this kind of information is with a map. In theory, every place mentioned in an article could be translated to a dot on a map. Mapping all of the dots from all of the articles should reveal the geographical extent and focus of news about coal seam gas.

Why do this? (Other than because I can, and it might be fun?) Firstly, because I’m still a little sketchy about how coal seam gas development and its attendant controversies have moved around the country over the last decade or two. I’m reasonably familiar with what has transpired in Queensland, but much less so with the situation in New South Wales. As for the other states, where there has been much less industry activity, I know virtually nothing about where and when coal seam gas has been discussed. So a map (especially one that can show time as well) of CSG-related news would provide a handy reference for understanding both the national and local geographic dimensions of the issue.

The other reason to map the news in this manner is that it may provide a way to both generate and answer interesting questions about the news landscape (or the public discourse more broadly) around coal seam gas — and this is, after all, what my PhD needs to do. Continue reading