Tag Archives: data visualisation

2019: My year as a fence

January 26, 2025Workflowsclimate change, data visualisation, KNIME, Melbourneangusv

2019 was quite a year. I proposed to my wife and got married. We relocated from Brisbane to Melbourne. We got a dog. China got Covid. And Australia got burned to a crisp.

I wrote about two of those things exactly five years ago in a post published on 27 January 2020 — a data-driven reflection on the climatic contrasts between Brisbane, where I spent the first four decades of my life, and Melbourne, which I now call home. That post was written on the first-year anniversary of our two-day drive from the humid heat of south-east Queensland into the meteorological madness of Melbourne. In the last few hours of that drive, as we raced to reach the real estate office in time to pick up our key before the the Australia Day weekend, the temperature was over 40°C. By the time we opened the door to our rental in Fitzroy, the temperature had dropped to the mid 20s. A cool change indeed.

When I started writing that post in the dying weeks of 2019, much of the country was on fire. Melbourne, meanwhile, was veering from day to day between baking in smoke and 40-dergree heat or basking in a glorious combination of clear sunshine and cool, crisp southerly breezes. It was really hard not to talk about the weather.

Since 2020, I’ve only published four posts, and since 2022 I’ve posted none. I was nearly ready to declare this blog deceased until I realised that it still has a faint trace of a pulse. That pulse happens to be Australia Day. This is not an expression of nationalism. As my post on January 28 2021 explained, I’m as ambivalent as anyone about what our national day does, can or should mean. But for whatever reason, since my January 2020 post on the anniversary of the previous Australia Day, I’ve managed to post on two successive Australia Day weekends. In keeping with the tone set by the first one, the last of those posts was yet another meditation on Melbourne’s weather.

So this post is partly an effort to keep the tradition — and indeed the blog — alive. But I also happen to I have something worth sharing — something that loops all the way back to that first Australia Day post. As before, it revolves around the visualisation of climatic data, but this time the final product is not a graph on a computer screen.

This time, I painted a fence.

The concepts of a plan

The reason I stopped posting on this blog is pretty simple: I got a job. Not only did that mean I had less time for playing with data and writing, it also meant that I had less inclination to switch on a computer and play with data even when I did have time, because that’s exactly what my job entails. Instead, I’ve filled my spare time with pursuits that are more hands-on, like brewing beer and improving the home that we bought a couple of years ago.

Like many homes in suburban Melbourne, ours is separated from the property behind us by a wooden fence that has seen better days. Being generous, you could say that it perfectly complemented the bleak aesthetic that the yard had when we first moved in. But after we replaced the lawn of quartz pebbles and concrete with a garden and artificial grass (I’d love to plant real grass here, but am not sure it would thrive in shade most of the day), there was no denying that the grey, weathered fence was bringing down the vibe.

The fence originally complemented the bleak aesthetic of the yard…

…but didn’t hold up as well against a garden and lawn.

I got as far as preparing a letter to the neighbours about replacing the fence. But then I wondered if I was letting an opportunity pass me by. For what is the point of owning your own home if you don’t do things to it that a landlord wouldn’t allow? And if the fence’s days were numbered anyway, why not do something creative with it in the meantime?

After months of gazing at the fence and wondering how to improve it, I began to imagine it as a giant data visualisation, where each of the 112 available palings could represent … something. It was a very small leap for me to settle on climatic data as that something. If each paling represented a day (or three or four), the whole fence could depict a year’s worth of temperature data. But which year? That question had one obvious answer.

FenceSim 1.0

How does one go about turning a fence into a bar chart? I’m not sure how many other people have grappled with this question, but for me the obvious place to start was with a computer simulation. There were too many variables to consider any other way. How to divide 112 fence palings across 365 days? How to scale the temperatures? What other data might I be able to include? What colour combinations would work?

So I fired up KNIME, my data science tool of choice, and set about building a fence simulator app. A neat thing about KNIME is that building visualisations and interactivity on top of a bunch of calculations is really easy — easy, at least, compared with doing it in a traditional programming language.

Here’s what the app looks like:

The fence painting simulator I built in KNIME Analytics Platform to find out just how crazy this idea was.

Here’s all you need to know to understand the graph:

Each of the 112 simulated fence palings represents the first of three successive days in 2019, except between 1 May and 1 August, where each paling is the first of five days.
The bottom two data series are minimum and maximum daily temperatures.
The top two series are cloud cover (inferred from solar exposure) and rainfall.
The light blue portion does not reflect any data, and is just there to provide background to the other series.
All data were downloaded from the Bureau of Meteorology, and pertains to the Melbourne (Olympic Park) observation station.

The chart shown here is the one that ended up on the fence. But it was one of many alternatives that I first explored with the app. Among the parameters that I experimented with are:

The size of the rainfall and cloud cover series relative to the temperature (the rainfall and cloud factors), since these are in completely different units of measurement.
The minimum padding to allow between the rainfall and maximum temperature series.
The method for aggregating three days into one paling. I toyed with averages or min/max values, but the results were either too boring or too artificial. I chose to show data for individual days so that each paling represented something concrete in reality.
The points at which the number of days per paling changes. And yes, I know that this completely corrupts the integrity of the graph. But remember, we are painting a fence here, not doing science! I’ll explain further below.

You’ll notice that I even loaded the app with data from different years, even though my heart was already set on 2019. The capture below shows the app in action:

At the end of the day, I was chasing an output that looked interesting and told the story I wanted to tell. This is why, among other things, I divided the graph into sections with different timescales (three days per paling, then five, then three again). My desire to fill all 112 available palings conflicted with the mathematical reality that 365 when divided by 112 does not yield anything close to a whole number. And I chose to compress the second timescale into winter (rather than have a longer period of four days per paling) because there just wasn’t as much day-to-day variation to see in winter. And besides, if any Melburnian could make winter shorter, they surely would.

The other reason for simulating the fence on the computer first was to gain my wife’s approval before transforming the fence — which, amazingly, she granted.

Making it real

Needless to say, simulating a fence is a very different exercise from painting one. The latter required first converting the graph into a spreadsheet with measurements scaled to the size of the fence, then marking up these measurements on the fence, and then finally painting to these measurements. Given that I had never painted a fence (or anything, really) before this one, you can imagine that the last of these steps presented the biggest challenge of them all, but I’m not going to recount that experience here. I’ll just show some pictures.

I’d already painted a few ‘clouds’ before I thought to take a Before picture.

Summer 2019 was the bushfire season from Hell. And Covid starts in there somewhere too.

To fit 365 days into 112 palings, I ‘sped up’ winter, because why wouldn’t you?

That ninth paling is 25 January 2019, when we arrived in Melbourne to see the temperature drop 20 degrees in a few hours.

How could I not use the leftover paint to improve this table? Each plank represents the first of 12 successive days in 2024.

The final result has far exceeded my expectations. Not so much in terms of its execution, which is merely ok if you don’t look too closely, but in terms of its impact on the yard and the house as a whole. The colours, which were chosen to evoke the natural surroundings, make the outdoor space feel calmer and more inviting. They transform the indoor space too, as they are visible from many points of the house, and reflect colour onto the inside walls when bathed in the afternoon sun.

To be sure, there are still some things missing. Some markings to indicate scale, for one. I know that the hottest temperature depicted, on the ninth paling from the left — which happens to be the day we arrived in Melbourne six years ago — is 42.8°C, but visitors don’t. Nor does anyone else know which of these days aligns with our wedding, or when we got our dog, or when Covid-19 was first detected in Wuhan Province, or when our then-Prime Minister Scott Morrison said he didn’t hold a hose, mate while the bushfires raged among those towering maximum temperatures in December. In time, I plan to add some subtle annotations to mark these moments.

Oh, and you might have noticed something else in that last picture: I also painted our old wooden table to match the fence — because how could I not? Finished on New Years Eve just passed, it represents 2024, each paling the first of 12 days.

Now, should I paint the house?

GameStop on Twitter: a quick squiz at the short squeeze

January 31, 2021Uncategorised, Workflowsdata visualisation, KNIME, networks, R, text analysis, Twitterangusv

GameStop the press!

Remember GameStop? You know, the video game retailer whose decaying share price exploded after a bunch of Reddit users bought its stock and succeeded in bankrupting a hedge fund who was trying to short it? Yeah, that was nearly a week ago now, so my memory of it is getting hazy. I mostly remember all the explainers about how the share market works and what a short squeeze is. And the thought pieces about how this kind of coordinated market behaviour is nothing criminal, just ordinary folk playing the big boys at their own game and finally winning. And the memes: who can forget the memes? Well, me, for a start.

Somewhere amid the madness, I decided that I should harvest some Twitter data about this so-called GameStop saga (can something really only be a saga after only three days?) to capture the moment, and to see whose hot takes and snide remarks were winning the day in this thriving online marketplace of shotposts and brainfarts.

I confess that I had another motive for doing this as well, which was to provide some fodder for my TweetKollidR workflow, which turns Twitter datasets into pretty and informative pictures. The TweetKollidR is a workflow for the KNIME Analtyics Platform that I developed while locked down for three months in the latter half of 2020. I’ve made the workflow publicly available on the KNIME Hub, but it is still in need of road-testing, having been used (by me, at least) to analyse only two issues — the Covid-19 lockdown that spurred its genesis, and the wearisome public discourse about Australia Day. I felt that it was time to test the workflow on an issue that was not so close to home.

So, using the TweetKollidR workflow to connect to Twitter’s Search API, ¹ I collected just over 50,000 tweets containing the terms gamestop or game stop. Because I am not paying for premium access to the API, I was only able to grab tweets that were made within about 24 hours of the search (usually you can go back in time up to a week, but the sheer volume of activity around this topic might have shortened the window offered by the API). The 50,000 tweets in the dataset therefore cover just two days, namely 28 and 29 January 2021.

Let’s take a squiz! (By which, for the non-Australians among you, I mean a look or glance, esp an inquisitive one.) Continue reading GameStop on Twitter: a quick squiz at the short squeeze →

Notes:

API stands for application programming interface, which is essentially a protocol by which content can be requested and supplied in a machine-readable format, rather than as eye candy. ↩

The day after the week before: mapping the Twitter discourse about Australia day

January 28, 2021Workflowsdata visualisation, KNIME, networks, text analysis, Twitterangusv

An ode to the 27th of January

The 27th of January is an important day in the Australian calendar. As the fog rises from the Christmas break and the last public holiday for at least eight weeks, this date marks the resumption of business as usual, the start of the new year proper. It is also the moment when millions of Australians breathe a sigh of relief, knowing that the divisive and tiresome debate about the date of Australia Day will now subside for another 358 days, give or take.

The 27th of January is the day after the day when, in 1788, Captain Arthur Phillip sailed into into Sydney Cove and planted a British flag. Trailing behind him was a fleet of 11 ships carrying an assortment of convicts, civil officers and free settlers, the first members of a new colonial outpost that would ultimately become the nation state of Australia. Watching the arrival from the shore were the land’s indigenous human inhabitants, custodians of more than 40,000 years of continuous culture and occupation.

Long recognised as the anniversary of the colony’s foundation, the day before the 27th of January was in 1935 adopted by all states and territories as Australia’s national day of celebration. For many years, most Australians were happy with this arrangement. Australia Day was a day of national pride and innocent celebration, a day to have a barbecue, drink some beer, listen to the Hottest 100 countdown and play some beach or backyard cricket — often all at the same time. But now, as the country has finally began to confront some of the darker chapters of its colonial past, the 26th of January is losing its lustre as a day when such simple pleasures can be enjoyed, let alone pursued in the name of national pride. It turns out that it is rather difficult to drink beer, play cricket and enjoy the last year’s top songs while at the same time contemplating the country’s legacy of dispossession and genocide against its first peoples. (Indeed, Triple J moved its Hottest 100 countdown from Australia Day to the fourth weekend of January in 2018, and this year Cricket Australia chose not to mention Australia Day in its promotion of matches held on 26 January.)

And so, in the third decade of the 20th century, the 27th of January is now the day after tens of thousands of people partake in Invasion Day rallies to plead for meaningful reconciliation and to advocate changing the date of the national day, or to abolish it altogether. The 27th is the day after a day of exasperated commentary about the recipients of Australia Day honours, which in 2015 included Prince Phillip, inexplicably knighted by the then prime minister, Toby Abbott; and which this year included Margaret Court, whose legacy as one of the greatest ever tennis players has in recent times been overshadowed by her outspoken and controversial views about homosexuality, gay marriage and transgender people. In short, the 27th of January is the day after a wave of difficult, awkward, and at times ugly public debate peaks and subsides. Until next year. Continue reading The day after the week before: mapping the Twitter discourse about Australia day →

Qualitative evaluation of topic models: a methodological offering

October 22, 2020PhD related, Workflowscovid-19, data visualisation, KNIME, lda, lockdown, PhD, R, text analysis, topic modelling, Twitterangusv

Topic models: a Pandora’s Black Box for social scientists

Probabilistic topic modelling is an improbable gift from the field of machine learning to the social sciences and humanities. Just as social scientists began to confront the avalanche of textual data erupting from the internet, and historians and literary scholars started to wonder what they might do with newly digitised archives of books and newspapers, data scientists unveiled a family of algorithms that could distil huge collections of texts into insightful lists of words, each indexed precisely back to the individual texts, all in less time than it takes to write a job ad for a research assistant. Since David Blei and colleagues published their seminal paper on latent Dirichlet allocation (the most basic and still the most widely used topic modelling technique) in 2003, topic models have been put to use in the analysis of everything from news and social media through to political speeches and 19th century fiction.

Grateful for receiving such a thoughtful gift from a field that had previously expressed little interest or affection, social scientists have returned the favour by uncovering all the ways in which machine learning algorithms can reproduce and reinforce existing biases and inequalities in social systems. While these two fields have remained on speaking terms, it’s fair to say that their relationships status is complicated.

Even topic models turned out to be as much a Pandora’s Box as a silver bullet for social scientists hoping to tame Big Text. In helping to solve one problem, topic models created another. This problem, in a word, is choice. Rather than providing a single, authoritative way in which to interpret and code a given textual dataset, topic models present the user with a landscape of possibilities from which to choose. This landscape is defined in part by the model parameters that the user must set. As well as the number of topics to include in the model, these parameters include values that reflect prior assumptions about how documents and topics are composed (these parameters are known as alpha and beta in LDA). ¹ Each unique combination of these parameters will result in a different (even if subtly different) set of topics, which in turn could lead to different analytical pathways and conclusions. To make matters worse, merely varying the ‘random seed’ value that initiates a topic modelling algorithm can lead to substantively different results.

Far from narrowing down the number of possible schemas with which to code and analyse a text, topic models can therefore present the user with a bewildering array of possibilities from which to choose. Rather than lending a stamp of authority or objectivity to a textual analysis, topic models leave social scientists in the familiar position of having to justify the selection of one model of reality over another. But whereas a social scientist would ordinarily be able to explain in detail the logic and assumptions that led them to choose their analytical framework, the average user of a topic model will have only a vague understanding of how their model came into being. Even if the mathematics of topics models are well understood by their creators, topic models will always remain something of a ‘black box’ to many end-users.

This state of affairs is incompatible with any research setting that demands a high degree of rigour, transparency and repeatability in textual analyses. ² If social scientists are to use topic models in such settings, they need some way to justify their selection of one possible classification scheme over the many others that a topic modelling algorithm could produce, ³ and to account for the analytical opportunities foregone in doing so.

If you’ve ever tried to interpret even a single set of topic model outputs, you’ll know that this is a big ask. Each run of a topic modelling algorithm produces maybe dozens of topics (the exact number is set by the user), each of which in turn consists of dozens (or maybe even hundreds) of relevant words whose collective interpretation constitutes the ‘meaning’ of the topic. Some topics present an obvious interpretation. Some can be interpreted only with the benefit of domain expertise, cross-referencing with original texts, and perhaps even some creative licence. Some topics are distinct in their meaning, while others overlap with each other, or vary only in subtle or mysterious ways. Some topics are just junk.

If making sense of a single topic model ⁴ is a complex task, comparing one model with another is doubly so. Comparing many models at a time is positively Herculean. How, then, is anyone supposed to compare and evaluate dozens of candidate models sampled from all over the configuration space? Continue reading Qualitative evaluation of topic models: a methodological offering →

Notes:

The generative model of LDA assumes that each document in a collection is generated from a mixture of hidden variables (topics) from which words are selected to populate the document. The number of topics in the model is a parameter that must be set by the user. The proportions by which topics are mixed to create documents, and by which words are mixed to define topics, are presumed to conform to specific distributions which are sampled from the Dirichlet distribution, which is essentially a distribution of distributions. The shape of these two prior distributions is determined by two parameters—often referred to as hyperparameters to distinguish them from the internal components of the model—which are usually denoted as alpha (α) and beta (β). Whereas alpha controls the presumed specificity of documents (a smaller value means that fewer topics are prominent within a document), beta controls the presumed specificity of topics (a smaller value means that fewer words within a topic are strongly weighted). Like the number of topics, these hyperparameters are set by the user, ideally with some regard for the style and composition of the texts being analysed. ↩
It’s important to recognise that criteria such as transparency and repeatability are not applicable to all textual analysis traditions. Some traditions assume a degree of interpretation and subjectivity that render such criteria all but irrelevant. The probabilistic nature of topic models presents a very different set of challenges and opportunities to such traditions, at least insofar as practitioners are inclined to use them. ↩
That is, assuming that only one fitted topic model is used in the analysis. Conceivably, an analysis could use and compare several models. ↩
In this post, as in much of the literature on topic modelling, the term ‘topic model’ may describe one of two things. The more general sense of the term refers to a particular generative model of text, which may or may not be paired with a specific inference algorithm. In this sense, LDA is one example of a topic model, and the structural topic model is another. The second sense of the term refers to the outputs, in the form of term distributions and document allocations, obtained by applying a topic model in the first sense to a particular collection of texts. (These outputs may also be referred to as a ‘fitted topic model’.) The relevant sense of the term will usually be evident from the context in which it is used. ↩