So far I’ve managed to do some very simple keyword identification; nothing too dramatic, and it’s taken a while to get here, what with all the collecting and data cleaning scripts and processes I’ve had to write.
Last night, Mrs Mediaczar asked me why I was doing this. “Surely”, she pointed out, “you’re reinventing the wheel.” This is true, of course — and I’m not even a particularly good wheelwright when you come to it. I muttered a bit but I do have my reasons. Most of these have to do with flexibility; the freedom to create and tweak ad hoc workflows that suit individual routes of enquiry. I also think that it’s important to have a feel for one’s research data; a feel that one can’t get if you’re divorced from the nitty gritty.
But the reality is that I’m not really writing anything. I’m just stringing together a set of Unix tools that are intended for more or less exactly the purpose I’m using them. The Unix command line has a wonderfully powerful tool set for playing with text data, and it’s a pleasure to be able to wield things like
wc (I don’t feel the same way about
awk, but that’s what
perl is for in my world.)
Anyway, the basic stuff is out of the way, and now’s my chance to do something a little more interesting. In the last article, I moaned about word clouds a little, and I’m going to take the opportunity to do it again with a slightly contrived example. Here’s the word cloud visualisation of ~2.7k tweets I just pulled.
And here it is in a Wordle
Now both of those could be probably be used by an enthusiastic social media guru to tell a story (although I’d like to see how they explain the double appearance of the keyword “orange” in the Wordle version.)
But what I know (and they couldn’t) is that the search I ran was for
orange AND "not good" (did I mention that this was contrived?)
What has happened is obvious: the word “not” is a common stopword, so the visualisation application has removed it, notably changing the story in the process.
The process I’ve got mapped out is simplicity itself:
- identify the bigrams, and
- remove the stopwords, being careful to retain the ones I consider important.
The first job is made even easier by a quick Google. Thapelo Otlogetswe has already published the perl code to do this which I’ve modified only very slightly here.
here’s what it does to the
orange AND "not good" content:
Already I can see the difference, but there are still lots of useless stopwords here. For this exercise, I’ve decided that I want to retain the stopwords “not” and “rt” (I’ve just added these to an hash called
%ignore, and told the stop word filter to ignore any bigram beginning with those words. I’m sure that this will come back to bite me on the arse later.
So now we see this output (which is more or less where I wanted to get to be):
Running this process on the data I collected in the first of this series of posts, I get the following list:
Friday, 11 Jan 2013 23:44:33 Oh no! I’ve just discovered that for some reason or another, the code I borrowed from Thapelo is over counting. I’ll come back and fix this. I also need to check whether it’s necessary to lose the apostrophes in the bigram builder, or create apostrophe-less versions in the stopword list. But it’s late now, and the mini-mediaczars wake early. Bed time.