Orange Project Step 4 – Bigrams

So far I’ve managed to do some very simple keyword identification; nothing too dramatic, and it’s taken a while to get here, what with all the collecting and data cleaning scripts and processes I’ve had to write.

Last night, Mrs Mediaczar asked me why I was doing this. “Surely”, she pointed out, “you’re reinventing the wheel.” This is true, of course — and I’m not even a particularly good wheelwright when you come to it. I muttered a bit but I do have my reasons. Most of these have to do with flexibility; the freedom to create and tweak ad hoc workflows that suit individual routes of enquiry. I also think that it’s important to have a feel for one’s research data; a feel that one can’t get if you’re divorced from the nitty gritty.

But the reality is that I’m not really writing anything. I’m just stringing together a set of Unix tools that are intended for more or less exactly the purpose I’m using them. The Unix command line has a wonderfully powerful tool set for playing with text data, and it’s a pleasure to be able to wield things like grep, sort and wc (I don’t feel the same way about see and awk, but that’s what perl is for in my world.)

Moving along

Anyway, the basic stuff is out of the way, and now’s my chance to do something a little more interesting. In the last article, I moaned about word clouds a little, and I’m going to take the opportunity to do it again with a slightly contrived example. Here’s the word cloud visualisation of ~2.7k tweets I just pulled.

Orange good

And here it is in a Wordle

Orange good wordle

Now both of those could be probably be used by an enthusiastic social media guru to tell a story (although I’d like to see how they explain the double appearance of the keyword “orange” in the Wordle version.)

But what I know (and they couldn’t) is that the search I ran was for orange AND "not good" (did I mention that this was contrived?)

What has happened is obvious: the word “not” is a common stopword, so the visualisation application has removed it, notably changing the story in the process.


“Not good” is a bigram — two tokens (words in this case) that appear together. I’d like to be able to analyse these alongside the simple keyword list I produced last time.

The process I’ve got mapped out is simplicity itself:

  1. identify the bigrams, and
  2. remove the stopwords, being careful to retain the ones I consider important.

The first job is made even easier by a quick Google. Thapelo Otlogetswe has already published the perl code to do this which I’ve modified only very slightly here.

here’s what it does to the orange AND "not good" content:

Screen Shot 2013 01 11 at 22 59 34

Already I can see the difference, but there are still lots of useless stopwords here. For this exercise, I’ve decided that I want to retain the stopwords “not” and “rt” (I’ve just added these to an hash called %ignore, and told the stop word filter to ignore any bigram beginning with those words. I’m sure that this will come back to bite me on the arse later.

So now we see this output (which is more or less where I wanted to get to be):

Screen Shot 2013 01 11 at 23 22 04

Running this process on the data I collected in the first of this series of posts, I get the following list:

Friday, 11 Jan 2013 23:44:33 Oh no! I’ve just discovered that for some reason or another, the code I borrowed from Thapelo is over counting. I’ll come back and fix this. I also need to check whether it’s necessary to lose the apostrophes in the bigram builder, or create apostrophe-less versions in the stopword list. But it’s late now, and the mini-mediaczars wake early. Bed time.

Please tell me what you think.