Orange Project Step 3 – Tokenise and Stopword Removal

Which words are most often associated with the term “orange” in tweets? How might I improve my search construction so that I can focus in on only those mentions that are relevant to my interests, either by inclusion or exclusion? How might I use social signals to improve my keyword planning for SEO or PPC?

Keyword analysis is an interesting, useful, productive practice — but too often we’re at the mercy of third party tools and black box processes. Part of this series of experiments is to help me understand those processes better.

A Sysomos Example

When I’m using Sysomos for ad hoc research, I’ll often go through several iterations of a labour-intensive manual process that goes something like this:

  1. Make a “best guess” keyword search based on experience and nous.
  2. Eyeball the word cloud that illustrates the relative frequency of “associated keywords” as a safety check to make sure I’m on the right track.
  3. Look at the chart that shows volume of mentions over time, identifying notable spikes in volume
  4. Zoom in on one spike, narrowing the search to the period of the spike.
  5. Look at the word cloud for the new sub-period: identify possible reasons for the spike, and modifying the keyword search if necessary. Quite often the spikes will be caused by something unpredictable – a much re-tweeted joke for example, a news story that creates great coverage in the trade or finance blogs, but has little currency with everyday twitter folk, or a “RT to win” promotion from the brand-owner or a third party. Make a note of the date and the reason for the spike in the research log (which is really just a fancy name for the text file of notes and copy-pasted material that I keep open in BBEdit while I’m running the search.)
  6. Zoom back out to the original period of the search.
  7. Repeat steps 4–6 for each spike, bearing in mind that – as we begin to reset the y-axis – new spikes may reveal themselves.

Here’s a super-quick demonstration – a search for English-language tweets mentioning the keyword “orange”. There’s a clear spike around December 2–3.

sysomos spike

Zooming in on those days, and looking at the word cloud, I can see that there’s clearly something happening around the Orange Bowl:

sysomos_wordcloud

It doesn’t take too long to identify the cause:


However, once I’ve adapted my search (“orange AND NOT bowl”), I see that there’s still some noise:

sysomos_wordcloud_NOT_bowl

Oh dear. That’ll be an episode of TOWIE, then. Several thousand UK TV viewers shared a version of this joke:

Once I’ve readjusted my search again (“orange AND NOT (bowl OR wankers)”) then the December 2–3 spike is flattened.

sysomos_flattened

Some problems with this approach

It takes forever, and there’s no clear place to stop. The process I’ve described is horribly open-ended. It’s easy (if time consuming) to flatten out the obvious spikes, but the landscape is more or less fractal – the more I flatten one set of spikes, the more obvious new spikes become. It’s a bit of an art deciding what exclusions to build into your search, and it’s likely that no two practitioners would create identical searches.

Also (and this is more a mismatch between the tools we use and what we want to do with them) we can’t easily see how stories (as identified by keywords like “bowl” and “wankers” in this case) ebb and flow and reappear without lots of manual intervention. It’s hard to create time series.

Sysomos does let me download the data behind the word clouds as a CSV, but expresses those data as percentages, not as absolute numbers, making it harder to see patterns emerging between periods.

But most importantly, word clouds suck.

Jeffery Zeldman notably described tag clouds (the forebear of word clouds) as the mullet of the internet, and I’m afraid that I have to agree. Word clouds suck both as a means of exploring and communicating information, and as user experience. If they form a major part of any of your presentations, they really shouldn’t. I don’t want to go into it too much here; but I’d like hard numbers, not font sizes.

The do it yourself approach

Finally we’re at the meat of this post. It’s taken a while, and if you’re still here, I’d like to commend your stamina.

If I’m going to do any interesting text mining of the tweets I collected and cleaned I’m going to need to be able to do something as basic as identify and count keywords. So this is a good place to start.

Tokenise

It sounds obvious, but the first job I need to do is to break the text down into words. It seems that defining what constitutes a word is more complicated than I might first have believed – so the method I choose will have an effect on the results. Perl (my scripting language of choice) has a nice split() function that would let me split text using spaces as a delimiter – but that can leave a lot of cleaning up to do around things like punctuation and numbers.

After a little experimentation, the regular expression /([a-z][a-z'-]*)/gi seems a pretty good way to identify words (and a great example of why normal people freak out when they see perl and regexes).

It looks for groups of “alphabet characters” ([a-z]) and will allow hyphenations and apostrophes anywhere but the first character ([a-z'-]).

tokenise

That works more or less as expected. So let’s do some counting. I’m still using a mix of shell and perl to get this done (the uniq -c flag handily precedes each output line with the count of the number of times the term occurs)

tokenise 2 small

Which works very nicely indeed. I can modify this a little to do the same thing on a file-by-file basis (this will give me the keyword counts per day.)

Remove Stopwords

Looking at the top keywords in the list, we can immediately see a problem. The most common keywords are a bit – well – pointless.

“The” is the most common word in the English language. “RT”, “t”, “co” and “http” are pretty common on Twitter. For my purposes, these words appear so often as to lose their usefulness as keywords; they are “stopwords” which – as Wikipedia helpfully points out, are not to be confused with “safe words” (for the record, my safe word is “whose turn is it to look after the children?”) Removing stopwords is the next step in the process. Then we’ll have got somewhere.

There are all sorts of libraries and modules in perl, python and R for doing this process — however I believe that Twitter’s stopword list is sufficiently idiosyncratic as to warrant its own custom list, and I’ve started compiling one for my own use (feel free to grab it.) I’ve also had to knock together a simple script to process the output of the tokenisation process discussed above.

So here’s the new output.

So that’s looking a lot better. Time to call it a night.

Please tell me what you think.