A perl script to create Twitter friend/follow matrices

Geek alert: if the title of this post isn’t a dead giveaway I should tell you — unless you’re interested in APIs and badly-put-together bits of code — this probably isn’t for you.

I’ve recently found myself using a service provided by Damon Clinkscale called DoesFollow. All it does is answer the simple question “does twitter user A follow twitter user B?” Apart from a frill which lets you reverse the order of your question (“does twitter user B follow twitter user A?”) that’s all it does. You can even interrogate it from the address bar like this: http://doesfollow.com/barackobama/mediaczar

does barack obama follow mediaczar?

While I was thinking about how useful a service this is, I was suddenly struck by a moment of clarity. A lot of the research I’ve been doing could be simplified by something like this.

Quite often I want to find out whether MPs or congressmen or PR people follow each other on Twitter.

The way that I’ve been doing this until now is

  1. make a list of the people who I’m interested in researching
  2. for each person on that list, grab the list of all the Twitter people whom they follow
  3. process the list so that only relationships between the people on the list show up

If all I’m doing is checking to see who follows whom, then this is a horribly wasteful way of doing things. The Twitter API limits the number of calls one can make on it — so this wastage leads to things taking much longer.

If only I could cycle all the names I want to check through something like DoesFollow!

Well – it turns out that I can. And in theory it’s not much harder than using DoesFollow. The Twitter API (which is what DoesFollow uses, after all) has a method called friendship/exists. All we have to do is send Twitter the following request:

http://twitter.com/friendships/exists.xml?user_a=barackobama&user_b=mediaczar

and it will come back with the answer:

<friends>true</friends>

or

<friends>false</friends>

Kludge-y perl code

kludge

(This fabulous picture courtesy of There, I Fixed It)

So I tried to do this using Yahoo! Pipes, but there are too many nested loops. You need to do something like this:

get list of names
for each user_a in list {
	for each user_b in list {
		  does connection exist?
     }
}

There’s no easy way to get Pipes to do this, as far as I can see (I’ll keep trying, but if someone else can help, I’d be v. grateful.)

So I’ve pulled together a badly-written perl script to do the work for me.

The script

Where next?

Most of my thinking is included above in the code comments. An obvious mistake I’m making is checking to see whether, say, @mediaczar follows @mediaczar. That wastes n API calls per search. But a more serious mistake is not to be using the new friendships/show method. Because it tells you whether user A follows user B and whether user B follows user A at the same time, it would save me lots of API calls. How many lots? Well take a look at this.

This is what I’m doing at the moment — checking each and every cell in the matrix:

clumsy API call matrix

This is what I’d be doing if I removed the diagonals:

Matrix with diagonals removed

And this is what I’d be doing if I used the newer API call:

Matrix using the new API call

I had to look up the formula for working this out without colouring in little boxes. With a little tweaking (to prevent the diagonals from creeping back in), here it is: \frac{(n-1)^{2}+n-1}{2}

So — for a list of congress people (156 on twitter as at Tuesday July 14, 2009) that’d be \frac{(156-1)^{2}+156-1}{2} API calls. Which is still a lot and will require some careful throttling, but (literally) not half as many as the (156-1)^{2} = 24,336 API calls that I’d need to run it as the script currently stands.

So – back to the drawing board for a while. I really can’t work out a programmatic way of doing this. Hmph.

Comments

  1. says

    Hey Mat

    Glad you like DoesFollow. Have you checked out http://tweepdiff.com ? It allows you to easily compare followers versus from friends for two accounts. It also has a screen that allows you to do multiple at one time. It was created by Brian Deterling (a software developer here in Austin). It may be exactly what you need.

  2. says

    In your original algo, for a list with N names you only make at most N API calls?

    “- make a list of the people who I’m interested in researching [N people]
    -for each person on that list, grab the list of all the Twitter people whom they follow”

    How about you make those N calls, but rather than process the relationships yourself the hard way, just turn all the X is a friend of Y relationships into triples, and upload them to something like 4store [http://4store.org/] or a graph app like NetworkX [http://networkx.lanl.gov/ ] and then run your queries on those graph native representations?

    (I have no idea if this will work – it’s something that’s on my to do list as a ‘maybe try this?’ hunch)

  3. says

    Damon — the TweepDiff thing is interesting, but it doesn’t do what I need!

    I’ve already got tools to dig into people’s publicly-listed friend networks. As you know, when you send Twitter an API request to do this:

    a) it must be authenticated
    b) you receive up to 100 results per call
    c) authenticated API calls are restricted to 100 per hour.

    Let’s say that I have in my list 69 people (#UK MPs currently on Twitter) who mostly have over 900 friends (and some of whom have over 9,000) — I can use up my authenticated calls really very quickly.

    It’s great getting all that data. But when all I’m looking for is directed relationships between that set of 69, it makes more sense to use the friendships/show method.

    Once I have the matrix, it’s pretty easy to drop it into SNA tools like UCINet and do fun network analysis stuff. TweepDiff — with its A/B comparisons — isn’t up to games like this.

  4. says

    Tony — see above. If only it was a matter of N calls. Instead it’s 10 calls for everyone who has more than 900 friends.

    For small value networks (like – say – the network of Porter Novelli twitter users I did back in January) this is quite feasible. But given my interest in “influence” networks, my research tends to gravitate towards people with lots of followers, which makes the original method really wasteful for most exercises.

    As it happens, writing this post cleared my mind, and I’ve now fixed the script in accordance w/ the ‘Where next?’ notes.

    I’d never seen either 4store or NetworkX before – thanks! Fascinating. Must have a play with them…

  5. says

    To save API calls, if you suspect that certain groups of people in the network are highly interconnected, (ie they each follow and are followed by each other), can you get a saving by pulling the friends and follows lists for those people, and processing the connections yourself, rather than making API calls to check each connection between them?

    SO if m people are interconnected, and they aren’t followd by so many people that you need to do more than eg 2*M API calls to get the lists, you have a saving if:

    2m 5?

  6. says

    Arghh:

    ((m-1)^2)+m-1)/2 GT 2m,
    so:
    m GT 5

    Requires hybrid code your end though, which complicates matters?

    But as a heuristic I think it works?! ;-)

Please tell me what you think.