I’m probably missing something really obvious here, but I’ve just been looking at the clicks on bit.ly shortlinks embedded in posts on two Facebook Pages, Asda and TOMS. It seems that each can be plotted on really nice-looking curves:
Is that so obvious so as not to merit mentioning? Why should it be like this?
Which words are most often associated with the term “orange” in tweets? How might I improve my search construction so that I can focus in on only those mentions that are relevant to my interests, either by inclusion or exclusion? How might I use social signals to improve my keyword planning for SEO or PPC?
Keyword analysis is an interesting, useful, productive practice — but too often we’re at the mercy of third party tools and black box processes. Part of this series of experiments is to help me understand those processes better. Continue reading →
This is the second in a series of posts tracing the evolution of a project. In the first post I downloaded 35k English-language tweets from Sysomos containing the keyword “orange”. Here’s a quick glance at what the data look like:
A lot of the really interesting data is off screen. I could, of course, load the CSV into Excel and work on it there, but I find that it’s faster to bypass that wherever I can. Not saying it doesn’t work, mind you, only that Excel can introduce all sorts of new problems. Instead, I’ll use the csvcut tool from the csvkit package to peer at the column headings (you should be able to click all the images in this post to embiggen them.) calling csvcut -n tells it that I just want to see a numbered list of the column headings:
I downloaded an awful lot of data about each tweet that I don’t really need. In fact, for my purposes, all I really want is the body of the tweet, as contained in the Content column (number 27)
A first glance at the data
I’m using csvkit‘s csvlook tool to look at the data (the -c 27 tells csvcut to select the 27th column):
Almost immediately I spot a problem; there’s a tweet here with no mention of the word ‘orange’ (highlighted):
Sysomos’s search (while fast) isn’t particularly nuanced – after all, it’s a general-purpose platform that has to work with data from multiple sources, including blogs, forums, news sites, Facebook and Twitter. So my search didn’t pull up Tweets containing the keyword ‘orange’, but rather looked for it across the whole record. Because this tweet came from an account whose username contained the word, it was pulled into the pot along with everything else:
@ryangodier lhfh hahah no not at all ima laugh at you haha
So I’m presented with at least one problem: I need to remove tweets that don’t contain the keyword from the data set. Just grepping will do that for me (plus, it has the side benefit of removing the column header.)
I’m using the -i flag so that it’s case insensitive (“orange”, “Orange”, “ORANGE” and “OrAnGe” will all be accepted).
So that works nicely. Time to drop it into a loop. In the following code, I’m using a trick I picked up for grabbing the filename part of a file. It’s a bit gratuitous here, if I’m honest — I could have done it with perl or sed just as easily.
I’m saving the data to simple text files; I’ll be loading these into R later.
Looking at one of the files with less I can see that there are some nasty unicode characters:
For some reason cat displays them for what they probably are, emoji from iPhone clients:
So I need to clean these off. I’ve used iconv in the past, so I’ll just copy and paste that code into the original loop.
Now all the files are nice and clean, and ready for the next stage. That’s two blog posts and I’ve not begun to do anything interesting. I’ve found (like so many others before me) that it pays to get things in order. In fact I’m probably moving at an unhealthily rushed pace here.
This is the first in a series of planned posts that track how my workflows evolve and develop around a project. They’re a bit edited and idealised (I’ll only include my errors and dead ends if they might be enlightening, for example.)
I’m collecting together a corpus of tweets to run some experiments over the Christmas holidays. I’ve used Sysomos to collect English-language tweets containing the word “orange” day by day over the past week. I don’t know how Sysomos identifies English (and I really should) – but what I’m planning to do is hard enough without having to involve other languages.
Sysomos has 432,880 such tweets in its database, but Twitter’s firehose ToS mean that I can’t download them all; instead, Sysomos lets me pull a random sample of 5,000 per search. So I’ve been through 7 days, and downloaded 5k per day, giving me 35k tweets which should (I hope) be a good place to start my analysis.
The first thing to note is that Sysomos file names give no clue as to their contents, only the day on which they were downloaded (the sequential numbering is added by my OS). This means I have to open each file to see what’s in them. The relevant metadata are on the description line (line 5), and the real data start on line 7.
Now it’s fairly simple to get those data at the command line by looping through each file and extracting only the fifth line:
From here I should be able to construct new file names. For my purposes, I really just want to grab the first half of the second field and rename the file accordingly so that the dates are first. I could do this all with sed or perl but have come to rely on csvkit — a toolbox maintained by Christopher Groskopf. So I’m being a bit lazy here…
Then use perl to do the last bit, cutting-and-pasting to get the final file names.
Now I’ve got the filenames sorted, I can create new files with the new names (experience tells me that renaming with mv is rarely a good idea in these workflows. I’m using the tail -n +7 to trim off the first 6 lines of each file. This gets me into a good position to begin munging and cleaning the data.
So that’s the final line of code that I need to process Sysomos downloads.
This may all look like a lot of work for nothing, particularly as I’m only dealing with a week’s worth of files here. Imagine, though, if I were running this analysis across multiple keywords; or looking at a month’s worth of data.
I’m doing some jiggery-buggery at the moment around the general theme of “Social Listening 2.0″. I think we’re all more or less agreed that the promise of the early Social Listening platforms (“Funded by Homeland Security grants! Now available to marketers!”) hasn’t really been borne out in practice. But there are interesting and exciting things happening in academia and the start-ups and I wanted to get my head around the basics of machine learning so that I can make better decisions about how we approach the problem.
Anyway, I’ve taken as my test project twitter mentions of the word “Orange.” I’ve made a (fairly educated) guess that this will be a good way to explore and illustrate the problems and to investigate the potential range of solutions. Yesterday, there were over 100K tweets that included the word. Some of these used the word as an adjective, others as a noun.
I’d forgotten about Terry’s Chocolate Oranges. And a quick glance at Google and Twitter trends suggests I’m not the only one. Never having considered it, I hadn’t realised quite how seasonal interest was when it came to certain chocolate products.
Wonder which tracks better to sales figures, Twitter or Google? Anyway — season’s greetings to you all, hope you have happy hols, and see you in the new year.
Kia Motors UK seems to have been putting a lot of money behind sponsored posts recently. Unfortunately, they haven’t really been paying attention to what happens to mobile users. It seems that someone forgot to mention that Page apps don’t work on mobile (yes, there are ways around this). This trifling mobile issue would be fine, were it not for the fact that 39% of their traffic is coming from mobile, according to their own stats (see below)
Kia Motors are running sponsored posts to promote their app on mobile
Mobile users can’t see the app
39% of Kia’s traffic comes from mobile (~5k to date.)
Brands are still far too focussed on building apps inside Facebook, when in reality:
An iOS app can be a Facebook app. A mobile website can be a Facebook app. A console game can be a Facebook app. Your car, your shoes, your credit card or your toothbrush can be Facebook apps.
I’ve droned on about this in the past and several of my conference talks touch on this. Most of my smart colleagues in the industry know about this, Facebook knows, and the various tech vendors know. Our failure to persuade the market to mend their ways continues to irritate me, though.
A few years ago, I was fortunate enough to see Tom Whitwell, Editorial Director at Times Digital give his “How To Write Awesome Headlines” presentation. Tracing the development of headline writing, he claims that the patterns of web consumption and sharing means that headline writing has left behind the terrible (by which I mean “fantastic”) puns beloved of sub-editors.
In a world of Twitter, Reddit, news aggregators and curators, Whitwell says, the headline has become separated from the story; putting more pressure on sub-editors to make the headline sell harder.
He notes that:
The difference between a good headline and a weak headline isn’t 5% or 10%, it’s 10x, 20x or more.
…then lists his rules for click-able headlines:
Be specific. Why exactly should I read your story, not that other one?
Tell the whole story in the headline
Don’t try to be clever
Don’t try to be funny
Play to your niche. Don’t over simplify or patronise in the headline
Include lists, quotes, numbers and names
Don’t worry about ‘being boring’
Write the headline first. Really. Always.
Great story which you can’t explain in the headline = crap story
Don’t give it all away in the headline
The next presentation comes from Upworthy (who are a bit like a BuzzFeed with a social conscience). Like BuzzFeed, they are content curators, and like all successful curators, they find content, then
Improve the framing and put it on our site so more people will see it.
What constitutes “improving the framing”? There are some excellent points, but a good third of the presentation is given over to the importance of a good headline. By the very nature of what they’re doing (curating and re-framing stories, rather than creating them) they can’t, as Whitwell demands, write the headline first. Instead their practice is to “write 25 headlines for each story” before selecting the best. It’s a compelling presentation, and it stands out for me because their first and last rules directly contradict Whitwell’s rule.
Don’t give it all away in the headline.
Also, don’t give it all away in the excerpt, share image, or share text.
Don’t be shrill.
Don’t form an opinion for the end user. Let them do that.
Don’t bum people out.
Don’t sexualize your headlines in a way your mom wouldn’t approve.
And don’t over-think it. Some of your headlines will suck. Accept it and keep writing.
Which reminds me, my mom doesn’t like it when you put the word “sucks” in headlines.
Lastly, be clever. But not TOO clever.
I’ve never been good at headline writing, but as I begin to understand the relationship between content, social and SEO better, I am beginning to understand better what skills we need to hire and develop in our organisations.
I’m (finally) thinking more about content marketing. It’s taken me a while to get here, and I’m still not wholly sure what the triggers have been. I’d like to believe, though, that it’s a combination of a few things:
I’m beginning to hang around SEO people again. The SEO types I meet are smart, and they seem to be getting excited about the whole content thing. Clearly there’s something to see here.
A chat with Neil Perkin. He’s also a properly smart chap, and his publishing background gives him real insight into the area
A growing sense that there are better data about this market out there, and better tools to handle those data.
Each brief will bring its own challenges of course; but the following seem to me to be some of the more common challenges. I think that they can all be addressed and overcome, but it’s worth being aware of them.
Confusion: As so often, a common marketing buzzword conceals a multitude of meanings. One person’s content marketing might be blogging- or curation-led. Another might be focused on encouraging social review content. Yet another’s might be SEO-led. Or blogger-outreach led. Or sponsorship-led. Or they might be heavily invested in making advergames or video content, and looking for new ways to dress up an old dog. Everyone brings their own expectations and biases; and the common "Content Marketing" terminology doesn’t help make those clear. Like ‘engagement’, and ‘social’ before it, it seems that ‘content’ is already well on its way to being a meaningless marketing term.
I suspect that it probably helps to see these things in terms of what we’re trying to achieve, the problems we’re trying to solve. For example — which of these would you prioritise?:
Assist in-bound marketing
Provide incentive for data-capture
Give marketing teams and client teams opportunities & reasons to email clients, tweet, post stuff to their LinkedIn profiles…
Improve SEO metrics
Deliver in-bound links through distributed content
Increase advocacy through shareable content
Become the de facto charts used in 3rd party presentations (Kleiner Perkins’ Mary Meeker’s mobile charts end up everywhere, as do eMarketer’s charts)
Compliance: At least a portion of a strong content strategy will rely on news hooking, opportunism & fast response. Yet any half-way-decent in-house legal department will place restrictions on what we can say, and how fast we can say it. Publishers’ workflows are set up to speed the publication process; but that’s rarely true of brands. Brands’ IT services and infrastructure probably won’t help either.
Content calendaring should help a bit; but we’re going to need to thrash things out with Legal and I.T. somewhere along the line; and we’d better have some good answers for their tricky questions.
Competition: There are many other publishers competing for the audience’s finite appetite for content, information and entertainment. Some of them may be our traditional media partners — who, it should be pointed out — aren’t always having the easiest time monetizing their own content strategies. Of course brands want to become publishers, but they’re going to have to recognise that it’s already a crowded market, and the publishers have a lot of experience and talent.
Context: What works off line isn’t always what works online. While the people may be the same, the context has changed; the audience’s attention may be divided, shorter, focused on different goals and activities. For example, long-form content generally falls foul of the tl;dr problem ("too long; didn’t read"). Everything I just said goes double for mobile. Spooging traditional content into a mobile context is more or less doomed to fail.
The audience is also a kind of context: certain kinds of content theme have repeatedly been proven to work well online. We need to understand how to spin our content for different kinds of audience, and to learn to love the niche. This isn’t always easy for marketers who are used to thinking in big, broad demographic terms. And it may be hard for more conservative clients to step outside the television mass market and into the internet mass market.
Cost: There will be finite internal resource, and (given the way this works) we’ll probably want to produce a lot of relevant content, fairly regularly. We need to find ways to get more bang for our content buck. We’ll need to think in terms of re-using, translating and recycling content (a single piece of research becomes 10 blog posts, 2 guest blog posts, a SlideShare presentation, 5 infographics, 2 podcast interviews and one video.)
Every so often we find ourselves negotiating the digital side of media partnerships and sponsorships as part of a larger deal. What is it that we want out of them? My rough thoughts and notes are below: I’d be grateful if you’d add your own ideas in the comments.
Requirement: Guaranteed performance on mobile devices. This means not only that all content must be visible on mobile devices, but that a smooth mobile U/X must be provided (e.g. no tiny buttons, no graphical text, navigation elements placed so that there’s no danger of accidental mis-keying)
Note: This is just a straightforward requirement these days, and we shouldn’t have to negotiate on this point at all. It’s not just about ruling out Flash, it’s about ensuring that audiences have as good a mobile experience as possible. For more on this, Luke Wroblewski’s "Mobile First" presentation and book are essential reading.
Requirement: The index page of the content partnership must be linked from the appropriate section heading page; and both the section heading page and the link to the Digital Content Hub must be search engine crawl-able (i.e. no robots.txt exclusions, no nofollow links). The section heading page itself must be accessible through the global navigation.
Note: This is more about making sure that the content can be crawled by a search engine. We’re not asking for the link to the content partnership to be above the fold, just that it be in readily crawl-able .