Datanoodling for Fun & Profit (Brighton SEO talk)


Back in January, Kelvin asked me,
“so reckon you could do something like a marketers guide to R?”

This was the most exciting invitation to speak that I’ve ever had, and I’ve had some exciting invitations.

“Ooh. That could be fun.” I said, “I could do an overview of grabbing data from the various (say Facebook) APIs, analysing that in R”

What is R?


R is an amazing free tool for data mining and performing statistical analysis.

It’s free to download and it’s well supported.

There’s a huge and active community of developers building tools for you to use.

In fact, by most estimates, R is the 2nd most popular data and analytics tool out there. The first is Python. Python is also great for datanoodling.

What’s it for? Well, it replaces the sort of hard work that you perform in Excel every day, and frees up your time to do more — and more interesting — things.


Here’s an example of the sort of thing I mean. Let’s say I’m looking at how long my Facebook Page’s Posts remain active in users’ newsfeeds.

This could be quite a useful thing to know when it comes to thinking about posting frequency, for example, or when in the cycle to promote a post.

So I’m looking at comment times as a proxy for that. I’m guessing that when the comments start to tail off, then the post is beginning to disappear from users’ newsfeeds.

Once I’ve got the data into Excel, there are another 22 steps I need to go through, each involving multiple mouse clicks and key presses. All modesty aside, I’m pretty good at Excel; but I still don’t want to have to do this all the time. So I write out recipes that I can give to the team in order that they can learn how to perform this for themselves.

Performing the first analysis, writing out the recipe, walking someone through it: this all takes time. An hour sounds about right.

And every time one of us wants to create a comment delay histogram like this, it takes more time. And if I want to change the recipe, more time.


On the other hand, in the same time or less, I can write a short function in R that will do the same thing.

Once it’s written, it takes no time at all to run.

Instead of writing recipes, I’m creating tools.

And I can modify the code very easily and quickly to make new tools.

So R is very useful and time saving and powerful. But we won’t just be talking about R. We’ll touch on Python and SQL as well…


How many of you have ever written any code? Let’s be generous here, and include HTML and Excel formulae.

Keep your hands up if you write code or do data analysis for a living.

Here’s the thing. I’m not a programmer. I’m not even a trained data analyst.

That’s not good, I thought. I’m going to upset those of you who write code for a living, and I’m going to scare those of you who don’t without edifying you.

But then I realised, “that’s the whole point! That’s why I’m qualified to talk about this”

Because I’m not a developer. I’m a marketer — with a degree in English Literature. And we know that people with English Literature degrees are barely numerate!

If I can do this, so can you! So can anyone! Better than I can!

So listen up; I’m channeling the righteous stuff.


Of course, I have a head start on many of you. This was my first computer.

My father bought it from WHSmith in 1980. It cost about £70 – the equivalent of around £250 today. It had 1 kilobyte of memory and it plugged into our television.

It was almost completely useless. And I loved it.

So men (and it mostly was men) of my age grew up knowing the feeling of typing badly formed code directly into a computer.

We’re not scared of text.


Hands up those of you who’re using a Mac today.

Keep your hands up if you’ve ever done anything in the terminal.

You’re all holding amazing machines 1. In recent years, Apple has done its very best to hide this from its customers, but you’re sitting on some of the best free data analytics tools in the world; the sorts of tools that are used in universities and scientific research labs and hedge funds all over the world. These are the tools that data scientists use. The tools that discovered the Higgs Boson. And they’re on your laptop, or free to download.

And you’re free to noodle with them.


Talking to you today is a bit of a step outside my comfort zone.

Normally I try to share fully-formed insights that illuminate the dark corners of Social Media and digital marketing.


But today, I’m going to talk about the process that we go through to get there.

The reality is that a good insight looks perfectly formed and common-sensical, but it takes an awful lot of trial and error (and mostly error) and messiness and wrong turns to get there.

Well, it does me, anyway. So why would I do this? Why would I expose you to the seamy underside?


The world is changing faster than marketers can change.

You know everything I’m about to say as well as I do.

  1. There’s more data than ever before. And — judging by some infographics I’ve seen – not all of us know how to process those data to find the signal among the noise.
  2. And marketers are beginning to rely on data more than ever; so it’s reasonable to say that this is part of the job we’re being paid to do. But we’re trusting tools and algorithms to optimise our campaigns for us; or our campaigns are at the mercy of black box algorithms we barely understand.
  3. Of course, there are plenty of tools to help us sort through the guff. Nearly all of them share this quality: they’re using readily — even freely — available data, adding a layer of UX, and selling it back to us at a grossly inflated rate. Another quality: they’re created by engineers, not marketers. The engineers are guessing at what we want to do with the data, and then building tools. And we’re letting those tools define how we do our business.
  4. What we need is freedom to play. To make mistakes. To feel the data flow through our fingers; to develop an understanding. And — I’d suggest — we can only do this by writing code to collect and manipulate data.


The reality is that I can only really write code because of Google and the communities it enables.

Here’s a schematic view of what happens when I write code.

  1. I write something that I think should work.
  2. If it works, great. If not, I Google the error. I don’t need to remember how to do things any more. I used to look at my notes and my old code to see how I’d solved a problem in the past. Now I just Google it.

The downside of this, of course, is that I can barely write code without an internet connection.

But using Google to write code works for me, and it will work for you.


There’s kind of a process to datanoodling.

A lot of it involves taking the data and turning it into charts. Charts are an excellent way to explore data. They’re also a good way to communicate your insights.

So I’m going to take you through a worked example, showing you how that works.


Let’s go back to that original example. How long does a Facebook Post remain active on Facebook? The reality is, of course, different for different Pages. And it changes over time.


Here’s something that those of you who have a Mac or a Linux machine can do. You can pop over to Facebook; get a temporary access token.

Then you can open the Terminal, and type this in.

Bingo! You’ve pulled your first data out of the Facebook API!


From there, it’s only a matter of codenoodling to write a little tool to do it automagically.

I write these tools in Python; principally because it’s the world’s easiest language to learn. I used to write everything in perl, but I had to look up everything all the time.

I’m pulling lots of data out of Odeon Cinema’s Facebook Page; all the Posts, and all the Comments on those Posts.

I use a lovely site called ScraperWiki to do this. For all sorts of reasons, it’s nicer and easier to use ScraperWiki than it is to run the code from my own machine.


One of those reasons is that it makes it easy to look at the data I’ve collected, lets me query those data, and gives me all sorts of ways to download it


But it also stores my data in an SQLite database that I can access. For those of you with Macs, SQLite is the easiest way I’ve found to run SQL queries. It’s already on your machine.

SQL is a big part of datanoodling.

I use an app called Base to connect to the database and create a download of the data I’m looking for. It’s a lovely tool, although you do have to pay for it. There are other (free-er) tools out there; but this is the one that (as far as I’m concerned) just works


Anyway, I take this dataset and drop it into RStudio.

RStudio is the tool used by almost everyone who uses R. It’s free. I urge you to download it. I can’t imagine how I’d do anything without it.

You’ve already seen this chart: what it shows is that most of the comments on posts come in the first hour, and it quickly tails off. By this point, 5 hours after the posts have been published, the comments have tailed off to nearly nothing. Then there’s this really long tail.


But when I was looking at the chart, I saw this bump. I’ve plotted the comments on about 6,000 posts here. Instinctively, I’d expect it to be a bit smoother.

So that got me thinking. Clearly different posts would have different charts. What if the Page is boosting their Posts?

Now this is of great interest to everyone in Social these days; we know we have to combine paid and organic to make the most of our content and audience.

But we’ve NO idea what other people are doing. Wouldn’t it be interesting if I could see what Odeon Cinemas were up to?


So I change the code a tiny bit, and turned the histogram on its head

Here we’re looking at cumulative comments. And I’ve changed the y-axis to show percentages, not counts. Actually, it’s a bit easier to read. Here’s the 50% mark, and I can see that we’ve kind of got a half life of about a couple of hours.

Changing the code slightly again, I can plot ALL the posts in one big mess.

So I’m beginning to think that there are maybe two patterns superimposed on each other. Perhaps this one for Organic Posts, and this one for Posts with some spend behind them


I’m trying to think of a good way to split these out. Some more code, and we’ve got a histogram of how many comments posts receive. Most of them are getting fewer than 10 or 20 comments. But some of them are getting as many as 200.

I’m guessing — just guessing — that the Paid Posts will get more comments. So I can select the top posts, take a look at those.


And that’s what I do. I plot the curves for the top 10 or so posts.

This looks promising, right? Doesn’t look organic, at any rate. There’s the nice 2 hour half-life curve. And then bingo! It all kicks off again. And again a day later.


Here are a few more.

Looks like we may have established a pattern for identifying artificially boosted posts!

So the next step in datanoodling is to have a bit of a ponder. Pondering is underrated.

And I decide that the best thing to do is to test it on some known data from one of our clients. We know which posts are paid, and which aren’t.


So, back into ScraperWiki.

Even though this is a Page that we control, I need to do this to get the data on the Comments. Facebook only lets you get at that through the API.


Then into Facebook Insights to download the spreadsheets. This isn’t the best way of doing things, but I haven’t yet written a tool to download data directly from the Insights API.

I’d like to say that this is because I’ve not spent much time trying. The reality is that I’ve spent AGES trying and mostly failing. I’ll get there though.


The data doesn’t come in the shape I want it, so I have to sit in Excel reformatting and manipulating and copy pasting the data — this process is generally called “munging.” The recipe I wrote for this is much longer than the 22 step one I showed you earlier; and it comes with pictures.


Then into SQL again… to combine the ScraperWiki data with the Insights data, and…


Back into R. Here’s the curve on a known Paid Post…

It looks the same! I’ve hit on something


So now I’m wondering whether we can identify Paid Posts another way. Here I’m plotting the median comment delay (which is the same as the half-life) against the number of comments.

The red posts are the ones that have been promoted. You can see a nice enough pattern here.


And here’s the same plot for Odeon Cinemas, where I don’t know which ones are promoted. But I’m beginning to think I might guess.


But this isn’t what I’m paid to do for a living. I write — as I’m sure I’ve told you — bad code.

Sometimes you need to take a step back, acknowledge you’re going down a rabbit hole. Or to pay someone else to

All of this started when I saw this little bump. But there’s no reason to believe that this bump is actually connected to the paid posts.

Nonetheless, I think (I’m not sure) that I’ve identified a potentially reliable way to identify Promoted Posts on 3rd party Pages. Which could be interesting.

But that wasn’t the question I set out to answer. I’ve been completely sidetracked.

And that is the beauty and danger of datanoodling. You need to know when to stop. I don’t.


With luck, I’ve persuaded one or two of you to start datanoodling. Hands up those of you who want to give it a go!?

I’d love to give you a recipe to help you on your way. But I really don’t think that there’s one way.

You just have to give it a go. Read some stuff. Try it out on your own data, your own questions.

All you have to do — as runners say — is put the miles in, and it gets easier, and new horizons open to you.

There are so many good free courses and articles and tools out there; just start somewhere.


If you’re going to buy one book, though, buy this one: Data Smart: Using Data Science to Transform Information into Insight. It will blow your mind without you ever having to leave Excel.


Feel free to come up to me and ask questions at lunch. Or ping me on @mediaczar.

Thank you very much. You’ve been a lovely audience!

  1. The same is clearly true if you’re using Linux, but if you’re using Linux, you don’t need me to tell you this.


Please tell me what you think.