Orange Project – The problem with retweets

In my last post, I looked at how to count bigrams, and touched in passing on their value to the keyword researcher.

It’s notable when looking at Twitter data how many of those bigrams are in the form “rt @{username}”, and how they’re distributed. In the 7 days of tweets that I’m using as my sample corpus, one even makes it into the top 10:

If I plot out their occurrence versus their rank, it seems that they follow a Zipf-like distribution.


The random amplification problem

When we’re looking at social data from Twitter, most tools that I’ve used will take all the retweets into account (there may be a few exceptions, and I’d be grateful if you’d let me know.)

But it seems to me that these retweets should be seen as “Just One (Wo)Man’s Opinion”; and that (in most cases) we should de-dupe them mercilessly.

This tweet from ageing minipop Ariana Grande turns up more than 400 times in my sample corpus (and was retweeted almost 3k times that week.)

Now, it may be that lots of people agree with her about Channel Orange, but her retweets account for around a third of all mentions of the album in my sample set. Does that reflect on the popularity of the album or that of @ArianaGrande herself? How might that affect your predictions of the album’s success? This may be a bad example; after all, it peaked at #2 in the Billboard 200. But you probably know what I mean.

On the other hand, the earned media effect of a 4m follower Twitter account holder like Ariana would have to have some positive effect on sales. So under other circumstances you’d want to know both who was tweeting and how often they were being retweeted (incidentally, Ocean’s label Def Jam is owned by Grande’s label Universal. I’m fairly naive about the record industry, so I’m plumping for “coincidence”. After all, there are kinda sorta only three majors these days, so coincidences will provide a satisfactory explanation.)

The Zayn Malik example

I thought I might do a little further digging into the relationship between fame and retweets. Here’s the plot.

Mentions by follower

It seems fairly inconclusive. Sure, there’s a bit of an uptick as users enter into the realms of the super-Twitter-famous, but equally there are some stinkers.

Take a look at the pink dot at the lower right. I’ve singled that out for special mention. It marks the 5 retweets (over the 7 day period) of a tweet by popular beat combo One Direction’s Zayn Malik. Zayn may have more than 7m followers, but when he says vacuous things like,

RT @zaynmalik : So , if cheese is orange does that mean lemons are green?

then even he must be doomed to obscurity. Clearly some tweets are going to be less retweetable than others, even if you’re cute and famous. So I began to make mental notes for some kind of more complex traction model that took into account both fame and retweet worthiness.

Luckily I checked. Zayn tweeted this two years ago.

Since then, this epigrammatic masterpiece has been shared and reshared more than 5k times, spiking regularly as it touches the souls of new audiences.


So here’s a problem that I hadn’t really considered. If a single meaningless tweet from a One Direction band member can live for two years, how’s that going to effect relevance?

Please tell me what you think.