Tag Archives: data visualisation

Playing with page numbers

When was the last time you read a newspaper? I mean an actual, physical newspaper? Can you look at your fingertips and picture them smudged with ink, or remember trying to turn and fold those large and unwieldy pages? These are fading memories for me, and are probably totally foreign to many younger people today. Like many people, I consume virtually all of my news these days via the internet or, on rare occasion, the television. As far as I am concerned, newspapers are fast becoming nothing more than historical artifacts.

And yet, newspaper articles account for the bulk of the news data that I am analysing in my PhD project. To be sure, most of these newspaper articles were also published online, and would have been consumed that way by a lot of people. But I feel I can’t ignore the fact that these articles were also produced and consumed in a physical format. Unfortunately, there’s not much I can do to account for the physical presentation of the articles. My database doesn’t include the accompanying images or captions. Nor does it record how the articles were laid out on the page, or what other content surrounded them. But the metadata provided by Factiva does include one piece of information about each article’s physical manifestation: the page number of the newspaper in which it appeared.

From the very beginning of the explorations documented on this blog, I have completely ignored the page number field in my dataset. I figured that I was analysing text, not newspapers, and in any case I couldn’t see how I would incorporate page numbers into the kind of analysis that I was planning to do. But after hearing a colleague remark that ‘article-counting studies’ like mine are often unsatisfactory precisely because they fail to account for this information, I decided to give it some more thought. Continue reading

Looking for letters

In the posts I’ve written to date, I’ve learned some interesting things about my corpus of 40,000 news articles. I’ve seen how the articles are distributed over time and space. I’ve seen the locations they talk about, and how this shifts over time. And I’ve created a thematic index to see what it’s all about. But I’ve barely said anything about the articles themselves. I’ve written nothing, for example, about how they vary in their format, style, and purpose.

To some extent, such concerns are of secondary importance to me, since they are not very accessible to the methods I am employing, and (not coincidentally) are not central to the questions I will be investigating, which relate more to the thematic and conceptual aspects of the text. But even if these things are not the objects of my analysis, they are still important because they define what my corpus actually is. To ignore these things would be like surveying a large sample of people without recording what population or cohort those people represent. As with a survey, the conclusions I draw from my textual analysis will have no real-world validity unless I know what kinds of things in the real world my data represent.

In this post, I’m going to start paying attention to such things. But I’m not about to provide a comprehensive survey of the types of articles in my corpus. Instead I will focus on just one categorical distinction — that between in-house content generated by journalists and staff writers, and contributed or curated content in the form of readers’ letters and comments. Months ago, when I first started looking at the articles in my corpus, I realised that many of the articles are not news stories at all, but are collections of letters, text messages or Facebook posts submitted by readers. I wondered if perhaps this reader-submitted content should be kept separate from the in-house content, since it represents a different ‘voice’ to that of the newspapers themselves. Or then again, maybe reader’s views can be considered just as much a part of a newspaper’s voice as the rest of the content, since ultimately it is all vetted and curated by the newspaper’s editors.

As usual, the relevance of this distinction will depend on what questions I want to ask, and what theoretical frameworks I employ to answer them. But there is also a practical consideration — namely, can I even separate these types of content without sacrificing too much of my time or sanity? 40,000 documents is a large haystack in which to search for needles. Although there is some metadata in my corpus inherited from the Factiva search (source publication, author, etc.), none of it is very useful for distinguishing letters from other articles. To identify the letters, then, I was going to have to use information within the text itself. Continue reading

Adventures in harmonic space

Long, long ago, I studied music. In fact, when I finished high school, music was all I wanted to study. To be sure, I didn’t just want to study it: I wanted to compose it as well. 1 But I soon discovered that music theory was something worthy of study in itself, quite apart from the grounding it provided for composition. Music theory, especially the analysis of harmonies and harmonic progressions, provided a way to pop the hood on a piece of music (or even a whole genre) and learn what makes it tick. As if that weren’t exciting enough, I sensed that there were more profound truths waiting to be teased out of these harmonic structures. For if they offered clues about what makes music tick, then surely they said something about what makes us tick as well.

I never did pursue my vision of a grand unified theory of tonal harmony and psychoacoustics. I soon found that there were also other things worth studying, many of which came with the bonus incentive of career prospects. One thing led to another, and for better or worse, I ended up working for the government. And not as a music theorist. But to this day, I can’t help hearing a piece of music and thinking about what makes it tick. The theorist within me is always plugging away, even while the rest of me is just enjoying the tune.

Unsurprisingly then, when I started playing with network graphs about 18 months ago, among the first things I asked myself is what application they might have for music theory. The beauty of network graphs is that they can be used to represent just about anything. Any system or community of inter-related parts can be turned into a network of nodes and connections. So far on this blog I’ve used network graphs to explore the linkages among websites related to coal seam gas, and to identify clusters of documents containing duplicated text. On my other blog, I used network graphs to see how the names of different people and places featured across a collection of my posts.

In this post, I will use network graphs to visualise the relationships among chords within a piece of music. You could examine melodies in much the same way, by breaking them down to their individual notes and tracking which notes pair up and cluster together most often. But I suspect that there is more to be gained from visualising the harmonic relationships. Continue reading

Notes:

  1. Eventually, years later, I did get around to writing some music. And I have finally published some of the results onto Youtube.

Mapping concepts, comparing texts

In the previous post, I explored the use of function words — that is, words without semantic content, like it and the — as a way of fingerprinting documents and identifying sets that are composed largely of the same text. I was inspired to do this when I realised that the dataset that I was exploring — a collection of nearly 900 public submissions to an inquiry by the New South Wales parliament into coal seam gas — contained several sets of documents that were nearly identical. The function-word fingerprinting technique that I used was far from perfect, but it did assist in the process of fishing out these recycled submissions.

That exercise was really a diversion from the objective of analysing the semantic content of these submissions — or in other words, what they are actually talking about. Of course, at a broad level, what the submissions are talking about is obvious, since they are all responses to an inquiry into the environmental, health, economic and social impacts of coal seam gas activities. But each submission (or at least each unique one) is bound to address the terms of reference differently, focussing on particular topics and making different arguments for or against coal seam gas development. Without reading and making notes about every individual submission, I wanted to know the scope of topics that the submissions discuss. And further to that, I wanted to see how the coverage of topics varied across the submissions.

Why did I want to do this? I’ll admit that my primary motivation was not to learn about the submissions themselves, but to try my hand at some analytical techniques. Ultimately, I want to use computational methods like text analytics to answer real questions about the social world. But first I need some practice at actually doing some text analytics, and some exposure to the mechanics of how it works. That, more than anything else, was the purpose of the exercise documented below. Continue reading