Tag Archives: KNIME

Coding without code: Knime as a tool for digital humanities and computational social science

February 7, 2019General pontificationscoding, computational social science, digital humanities, KNIME, networks, text analysis, topic modellingangusv

The rise of the machines

Over the past several years, I’ve stumbled into two fields of research which, although differing in their subject matter, are united by the recent entrance of computational methods into domains that were previously dominated by manual modes of analysis.

One of these two fields is computational social science. This is the broad category into which my all-but-examined PhD thesis, and most of the posts on this blog, could be placed. Of course, my work, which focuses on the application of text analytics to communication studies, occupies a tiny niche within this larger field, which encompasses the use of all kinds of computational methods to the study of social phenomena, including network science and social simulations using agent-based models.

The other field to which I can claim some degree of membership is the digital humanities. This tag is one way to describe what I do on my other blog, in which I have drawn on digitised maps and newspapers to produce visualisations that explore the history of a suburban environment. History is just one field of the humanities that has been opened up to new possibilities through the extensive digitisation of source materials. Scholars of literature have joined the party as well, while those operating in the GLAM sector, consisting of galleries, libraries, and museums, can now use digital platforms to manage and share their collections.

Though often lumped together in the same faculty (or indeed the same school at my last university), the humanities and social sciences don’t generally have a lot to do with one another. They do, however, have commonalities that lend themselves to similar computational methods. One is the study of text, often in large quantities. Within the social sciences, communication scholars analyse (st)reams of news texts and social media feeds; qualitative sociologists often amass hours of interview transcripts; political scientists analyse speeches and parliamentary transcripts. In the humanities, scholars of history and literature now have at their disposal huge archives of digitised novels, journals, newspapers, and government records. Of course, there is nothing compelling scholars in these fields to analyse all of the available data at once. They are free to sample or cherry-pick as they please. But any scholar wishing to work with large textual datasets in their totality can hardly afford to ignore computational tools like topic models, named entity extraction, sentiment analysis, and domain-specific dictionaries such as LIWC.

Another analytical method that is applicable within both the social sciences and humanities is the study of networks. Sociologists and historians have long used networks to represent and analyse social ties. More recently, social media platforms have hard-wired the logic of networks into social relationships, explicitly representing all of us as nodes tied to one another by virtue of our acquaintances, interactions, and browsing histories. Accordingly, it is becoming increasingly difficult to study social phenomena without invoking the concept of networks, if not also the science of network analysis. By no means restricted to real-world phenomena, network analysis techniques are equally applicable to fictitious societies, whether those in Game of Thrones or the plays of Shakespeare. As with the analysis of text, network analysis doesn’t have to be done with a computer; but as the size of a network grows, so too does the need for computational tools.

Besides an interest in certain computational methods, something else that the digital humanities and computational social sciences have in common is that many of the scholars trying to break into these fields are doing so without any training in data science or computer programming. This is especially the case in the humanities, where traditionally, quantitative methods have been virtually taboo. In some branches of the social sciences, quantitative methods are well established, but these are more likely to take the form of spreadsheets and statistical tests rather than machine learning and network analyses.

(Of course, the story is very different for those researchers entering these fields from the other direction, such as the many physicists, mathematicians and computer scientists who have somehow claimed the territory of computational social science. What these researchers lack, although many of them seem not to realise it, is any training in understanding the actual substance of what they are studying.)

The bottom line is that for many researchers, embracing the new opportunities presented by the digital humanities and computational social sciences means learning skills that are totally outside of their disciplinary training. At a minimum, researchers hoping to use computational methods in these fields will need to learn how to calculate statistics and use spreadsheets, and perhaps also how to use network visualisation software like Gephi or text-cleaning software like OpenRefine. Increasingly though, getting your hands dirty in computational social science or the computational end of the digital humanities ¹ means learning something far worse — something that is commonly denoted by a certain four-letter word. Continue reading Coding without code: Knime as a tool for digital humanities and computational social science →

Notes:

There really is a difference between the digital in digital humanities and the computational in computational social science. Much of what falls under the digital humanities pertains only to the use and curation of digitised resources, rather than the use of computational methods to analyse these resources. Computational social science, on the other hand, seems to be driven primarily by the desire to analyse social data through computational methods. ↩

Looking for letters

June 15, 2016PhD relatedcoal seam gas, csg, data visualisation, KNIME, lda, PhD, R, SVM, topic modellingangusv

In the posts I’ve written to date, I’ve learned some interesting things about my corpus of 40,000 news articles. I’ve seen how the articles are distributed over time and space. I’ve seen the locations they talk about, and how this shifts over time. And I’ve created a thematic index to see what it’s all about. But I’ve barely said anything about the articles themselves. I’ve written nothing, for example, about how they vary in their format, style, and purpose.

To some extent, such concerns are of secondary importance to me, since they are not very accessible to the methods I am employing, and (not coincidentally) are not central to the questions I will be investigating, which relate more to the thematic and conceptual aspects of the text. But even if these things are not the objects of my analysis, they are still important because they define what my corpus actually is. To ignore these things would be like surveying a large sample of people without recording what population or cohort those people represent. As with a survey, the conclusions I draw from my textual analysis will have no real-world validity unless I know what kinds of things in the real world my data represent.

In this post, I’m going to start paying attention to such things. But I’m not about to provide a comprehensive survey of the types of articles in my corpus. Instead I will focus on just one categorical distinction — that between in-house content generated by journalists and staff writers, and contributed or curated content in the form of readers’ letters and comments. Months ago, when I first started looking at the articles in my corpus, I realised that many of the articles are not news stories at all, but are collections of letters, text messages or Facebook posts submitted by readers. I wondered if perhaps this reader-submitted content should be kept separate from the in-house content, since it represents a different ‘voice’ to that of the newspapers themselves. Or then again, maybe reader’s views can be considered just as much a part of a newspaper’s voice as the rest of the content, since ultimately it is all vetted and curated by the newspaper’s editors.

As usual, the relevance of this distinction will depend on what questions I want to ask, and what theoretical frameworks I employ to answer them. But there is also a practical consideration — namely, can I even separate these types of content without sacrificing too much of my time or sanity? 40,000 documents is a large haystack in which to search for needles. Although there is some metadata in my corpus inherited from the Factiva search (source publication, author, etc.), none of it is very useful for distinguishing letters from other articles. To identify the letters, then, I was going to have to use information within the text itself. Continue reading Looking for letters →

What’s it all about? Indexing my corpus using LDA.

June 1, 2016PhD relatedcoal seam gas, KNIME, lda, text analysis, topic modellingangusv

Months ago, I assembled a dataset containing around 40,000 Australian news articles discussing coal seam gas. My ultimate aim is to analyse these articles, along with other text data from the web, so as to learn something about the structure and dynamics of the public discourse about coal seam gas in Australia. I’m interested in dissecting how different parties talk about this topic, and how this ‘configuration’ of the public discourse changes over time.

Although I didn’t originally plan to, I’ve focussed much of my energy so far on exploring the geographic dimension of the news articles. I’ve looked at where the news has come from and what places it talks about. This is all important stuff to know when studying such a geographically defined issue as coal seam gas development. But I also need to know what is being talked about, not just where. Now, finally, I am ready to turn my attention to exploring the thematic content of the articles.

Well, almost. I’m ready, but the data isn’t. The dataset that I have been playing with all this time is stuffed with articles that I don’t want, and is missing many that I do. This is because the search parameters that I used to retrieve the articles from Factiva were very broad — I obtained every article that mentioned coal seam gas or CSG anywhere even just once — and because I applied a rather rudimentary method — keyword counts — for filtering out the less relevant articles. The dataset has served its purpose as a testing ground, but if I am to use it to actually say things about the world, I need to know what it contains. And more than that, I need the ability to customise what it contains to suit the specific questions that I decide to explore.

In other words, I need an index to my corpus. I need to know what every article is about, so I can include or exclude it at my discretion. In this post I’ll describe how I have created that index using a method of topic modelling called Latent Dirichlet Allocation, or LDA. Happily, this is the very same method that I was planning to use to analyse the thematic content of my corpus. So by creating an index for my corpus, I am already starting on the process of understanding what it’s all about. Continue reading What’s it all about? Indexing my corpus using LDA. →

Mapping concepts, comparing texts

November 25, 2014PhD relatedcoal seam gas, data visualisation, heatmaps, KNIME, Leximancer, PhD, text analysisangusv

In the previous post, I explored the use of function words — that is, words without semantic content, like it and the — as a way of fingerprinting documents and identifying sets that are composed largely of the same text. I was inspired to do this when I realised that the dataset that I was exploring — a collection of nearly 900 public submissions to an inquiry by the New South Wales parliament into coal seam gas — contained several sets of documents that were nearly identical. The function-word fingerprinting technique that I used was far from perfect, but it did assist in the process of fishing out these recycled submissions.

That exercise was really a diversion from the objective of analysing the semantic content of these submissions — or in other words, what they are actually talking about. Of course, at a broad level, what the submissions are talking about is obvious, since they are all responses to an inquiry into the environmental, health, economic and social impacts of coal seam gas activities. But each submission (or at least each unique one) is bound to address the terms of reference differently, focussing on particular topics and making different arguments for or against coal seam gas development. Without reading and making notes about every individual submission, I wanted to know the scope of topics that the submissions discuss. And further to that, I wanted to see how the coverage of topics varied across the submissions.

Why did I want to do this? I’ll admit that my primary motivation was not to learn about the submissions themselves, but to try my hand at some analytical techniques. Ultimately, I want to use computational methods like text analytics to answer real questions about the social world. But first I need some practice at actually doing some text analytics, and some exposure to the mechanics of how it works. That, more than anything else, was the purpose of the exercise documented below. Continue reading Mapping concepts, comparing texts →