Orange Project – The problem with retweets

In my last post, I looked at how to count bigrams, and touched in passing on their value to the keyword researcher.

It’s notable when looking at Twitter data how many of those bigrams are in the form “rt @{username}”, and how they’re distributed. In the 7 days of tweets that I’m using as my sample corpus, one even makes it into the top 10:

If I plot out their occurrence versus their rank, it seems that they follow a Zipf-like distribution.

Zipfplot

The random amplification problem

When we’re looking at social data from Twitter, most tools that I’ve used will take all the retweets into account (there may be a few exceptions, and I’d be grateful if you’d let me know.)

But it seems to me that these retweets should be seen as “Just One (Wo)Man’s Opinion”; and that (in most cases) we should de-dupe them mercilessly.

This tweet from ageing minipop Ariana Grande turns up more than 400 times in my sample corpus (and was retweeted almost 3k times that week.)

Now, it may be that lots of people agree with her about Channel Orange, but her retweets account for around a third of all mentions of the album in my sample set. Does that reflect on the popularity of the album or that of @ArianaGrande herself? How might that affect your predictions of the album’s success? This may be a bad example; after all, it peaked at #2 in the Billboard 200. But you probably know what I mean.

On the other hand, the earned media effect of a 4m follower Twitter account holder like Ariana would have to have some positive effect on sales. So under other circumstances you’d want to know both who was tweeting and how often they were being retweeted (incidentally, Ocean’s label Def Jam is owned by Grande’s label Universal. I’m fairly naive about the record industry, so I’m plumping for “coincidence”. After all, there are kinda sorta only three majors these days, so coincidences will provide a satisfactory explanation.)

The Zayn Malik example

I thought I might do a little further digging into the relationship between fame and retweets. Here’s the plot.

Mentions by follower

It seems fairly inconclusive. Sure, there’s a bit of an uptick as users enter into the realms of the super-Twitter-famous, but equally there are some stinkers.

Take a look at the pink dot at the lower right. I’ve singled that out for special mention. It marks the 5 retweets (over the 7 day period) of a tweet by popular beat combo One Direction’s Zayn Malik. Zayn may have more than 7m followers, but when he says vacuous things like,

RT @zaynmalik : So , if cheese is orange does that mean lemons are green?

then even he must be doomed to obscurity. Clearly some tweets are going to be less retweetable than others, even if you’re cute and famous. So I began to make mental notes for some kind of more complex traction model that took into account both fame and retweet worthiness.

Luckily I checked. Zayn tweeted this two years ago.

Since then, this epigrammatic masterpiece has been shared and reshared more than 5k times, spiking regularly as it touches the souls of new audiences.

Zaynmalik

So here’s a problem that I hadn’t really considered. If a single meaningless tweet from a One Direction band member can live for two years, how’s that going to effect relevance?

Orange Project Step 4 – Bigrams

So far I’ve managed to do some very simple keyword identification; nothing too dramatic, and it’s taken a while to get here, what with all the collecting and data cleaning scripts and processes I’ve had to write.

Last night, Mrs Mediaczar asked me why I was doing this. “Surely”, she pointed out, “you’re reinventing the wheel.” This is true, of course — and I’m not even a particularly good wheelwright when you come to it. I muttered a bit but I do have my reasons. Most of these have to do with flexibility; the freedom to create and tweak ad hoc workflows that suit individual routes of enquiry. I also think that it’s important to have a feel for one’s research data; a feel that one can’t get if you’re divorced from the nitty gritty.

But the reality is that I’m not really writing anything. I’m just stringing together a set of Unix tools that are intended for more or less exactly the purpose I’m using them. The Unix command line has a wonderfully powerful tool set for playing with text data, and it’s a pleasure to be able to wield things like grep, sort and wc (I don’t feel the same way about see and awk, but that’s what perl is for in my world.) Continue reading

Orange Project Step 3 – Tokenise and Stopword Removal

Which words are most often associated with the term “orange” in tweets? How might I improve my search construction so that I can focus in on only those mentions that are relevant to my interests, either by inclusion or exclusion? How might I use social signals to improve my keyword planning for SEO or PPC?

Keyword analysis is an interesting, useful, productive practice — but too often we’re at the mercy of third party tools and black box processes. Part of this series of experiments is to help me understand those processes better. Continue reading

Orange Project Step 2 – Munging & Cleaning

This is the second in a series of posts tracing the evolution of a project. In the first post I downloaded 35k English-language tweets from Sysomos containing the keyword “orange”. Here’s a quick glance at what the data look like:

Sysomos data

A lot of the really interesting data is off screen. I could, of course, load the CSV into Excel and work on it there, but I find that it’s faster to bypass that wherever I can. Not saying it doesn’t work, mind you, only that Excel can introduce all sorts of new problems. Instead, I’ll use the csvcut tool from the csvkit package to peer at the column headings (you should be able to click all the images in this post to embiggen them.) calling csvcut -n tells it that I just want to see a numbered list of the column headings:

using csvcut to look at columns

I downloaded an awful lot of data about each tweet that I don’t really need. In fact, for my purposes, all I really want is the body of the tweet, as contained in the Content column (number 27)

A first glance at the data

I’m using csvkit‘s csvlook tool to look at the data (the -c 27 tells csvcut to select the 27th column):

Screen Shot 2012-12-24 at 20.38.00

Almost immediately I spot a problem; there’s a tweet here with no mention of the word ‘orange’ (highlighted):

no mention of keyword

Sysomos’s search (while fast) isn’t particularly nuanced – after all, it’s a general-purpose platform that has to work with data from multiple sources, including blogs, forums, news sites, Facebook and Twitter. So my search didn’t pull up Tweets containing the keyword ‘orange’, but rather looked for it across the whole record. Because this tweet came from an account whose username contained the word, it was pulled into the pot along with everything else:

So I’m presented with at least one problem: I need to remove tweets that don’t contain the keyword from the data set. Just grepping will do that for me (plus, it has the side benefit of removing the column header.)

I’m using the -i flag so that it’s case insensitive (“orange”, “Orange”, “ORANGE” and “OrAnGe” will all be accepted).

grep -i

So that works nicely. Time to drop it into a loop. In the following code, I’m using a trick I picked up for grabbing the filename part of a file. It’s a bit gratuitous here, if I’m honest — I could have done it with perl or sed just as easily.

I’m saving the data to simple text files; I’ll be loading these into R later.

Looking at one of the files with less I can see that there are some nasty unicode characters:

unicode characters

For some reason cat displays them for what they probably are, emoji from iPhone clients:

emoji

So I need to clean these off. I’ve used iconv in the past, so I’ll just copy and paste that code into the original loop.

Now all the files are nice and clean, and ready for the next stage. That’s two blog posts and I’ve not begun to do anything interesting. I’ve found (like so many others before me) that it pays to get things in order. In fact I’m probably moving at an unhealthily rushed pace here.

Orange Project Step 1 – Data Collection

This is the first in a series of planned posts that track how my workflows evolve and develop around a project. They’re a bit edited and idealised (I’ll only include my errors and dead ends if they might be enlightening, for example.)

I’m collecting together a corpus of tweets to run some experiments over the Christmas holidays. I’ve used Sysomos to collect English-language tweets containing the word “orange” day by day over the past week. I don’t know how Sysomos identifies English (and I really should) – but what I’m planning to do is hard enough without having to involve other languages.

Sysomos has 432,880 such tweets in its database, but Twitter’s firehose ToS mean that I can’t download them all; instead, Sysomos lets me pull a random sample of 5,000 per search. So I’ve been through 7 days, and downloaded 5k per day, giving me 35k tweets which should (I hope) be a good place to start my analysis.

Sysomos's downloaded files don't have useful filenames

The first thing to note is that Sysomos file names give no clue as to their contents, only the day on which they were downloaded (the sequential numbering is added by my OS). This means I have to open each file to see what’s in them. The relevant metadata are on the description line (line 5), and the real data start on line 7.

Useful metadata are on the description line (line 5), and the real data start on line 7

Now it’s fairly simple to get those data at the command line by looping through each file and extracting only the fifth line:

Extract the description line from each file

From here I should be able to construct new file names. For my purposes, I really just want to grab the first half of the second field and rename the file accordingly so that the dates are first. I could do this all with sed or perl but have come to rely on csvkit — a toolbox maintained by Christopher Groskopf. So I’m being a bit lazy here…

Extract only the second column of the description line

Then use perl to do the last bit, cutting-and-pasting to get the final file names.

Create new filename

Now I’ve got the filenames sorted, I can create new files with the new names (experience tells me that renaming with mv is rarely a good idea in these workflows. I’m using the tail -n +7 to trim off the first 6 lines of each file. This gets me into a good position to begin munging and cleaning the data.

Create new files with the new names

So that’s the final line of code that I need to process Sysomos downloads.

This may all look like a lot of work for nothing, particularly as I’m only dealing with a week’s worth of files here. Imagine, though, if I were running this analysis across multiple keywords; or looking at a month’s worth of data.

Seasonal Chocolate

I’m doing some jiggery-buggery at the moment around the general theme of “Social Listening 2.0″. I think we’re all more or less agreed that the promise of the early Social Listening platforms (“Funded by Homeland Security grants! Now available to marketers!”) hasn’t really been borne out in practice. But there are interesting and exciting things happening in academia and the start-ups and I wanted to get my head around the basics of machine learning so that I can make better decisions about how we approach the problem.

Anyway, I’ve taken as my test project twitter mentions of the word “Orange.” I’ve made a (fairly educated) guess that this will be a good way to explore and illustrate the problems and to investigate the potential range of solutions. Yesterday, there were over 100K tweets that included the word. Some of these used the word as an adjective, others as a noun.

orange

Or even:

terrys orange

I’d forgotten about Terry’s Chocolate Oranges. And a quick glance at Google and Twitter trends suggests I’m not the only one. Never having considered it, I hadn’t realised quite how seasonal interest was when it came to certain chocolate products.

Search Seasonality in Chocolate

Tweets mentioning "Terry's Chocolate Orange"

Wonder which tracks better to sales figures, Twitter or Google? Anyway — season’s greetings to you all, hope you have happy hols, and see you in the new year.