Orange Project Step 3 – Tokenise and Stopword Removal

Which words are most often associated with the term “orange” in tweets? How might I improve my search construction so that I can focus in on only those mentions that are relevant to my interests, either by inclusion or exclusion? How might I use social signals to improve my keyword planning for SEO or PPC?

Keyword analysis is an interesting, useful, productive practice — but too often we’re at the mercy of third party tools and black box processes. Part of this series of experiments is to help me understand those processes better. Continue reading

Orange Project Step 2 – Munging & Cleaning

This is the second in a series of posts tracing the evolution of a project. In the first post I downloaded 35k English-language tweets from Sysomos containing the keyword “orange”. Here’s a quick glance at what the data look like:

Sysomos data

A lot of the really interesting data is off screen. I could, of course, load the CSV into Excel and work on it there, but I find that it’s faster to bypass that wherever I can. Not saying it doesn’t work, mind you, only that Excel can introduce all sorts of new problems. Instead, I’ll use the csvcut tool from the csvkit package to peer at the column headings (you should be able to click all the images in this post to embiggen them.) calling csvcut -n tells it that I just want to see a numbered list of the column headings:

using csvcut to look at columns

I downloaded an awful lot of data about each tweet that I don’t really need. In fact, for my purposes, all I really want is the body of the tweet, as contained in the Content column (number 27)

A first glance at the data

I’m using csvkit‘s csvlook tool to look at the data (the -c 27 tells csvcut to select the 27th column):

Screen Shot 2012-12-24 at 20.38.00

Almost immediately I spot a problem; there’s a tweet here with no mention of the word ‘orange’ (highlighted):

no mention of keyword

Sysomos’s search (while fast) isn’t particularly nuanced – after all, it’s a general-purpose platform that has to work with data from multiple sources, including blogs, forums, news sites, Facebook and Twitter. So my search didn’t pull up Tweets containing the keyword ‘orange’, but rather looked for it across the whole record. Because this tweet came from an account whose username contained the word, it was pulled into the pot along with everything else:

So I’m presented with at least one problem: I need to remove tweets that don’t contain the keyword from the data set. Just grepping will do that for me (plus, it has the side benefit of removing the column header.)

I’m using the -i flag so that it’s case insensitive (“orange”, “Orange”, “ORANGE” and “OrAnGe” will all be accepted).

grep -i

So that works nicely. Time to drop it into a loop. In the following code, I’m using a trick I picked up for grabbing the filename part of a file. It’s a bit gratuitous here, if I’m honest — I could have done it with perl or sed just as easily.

I’m saving the data to simple text files; I’ll be loading these into R later.

Looking at one of the files with less I can see that there are some nasty unicode characters:

unicode characters

For some reason cat displays them for what they probably are, emoji from iPhone clients:

emoji

So I need to clean these off. I’ve used iconv in the past, so I’ll just copy and paste that code into the original loop.

Now all the files are nice and clean, and ready for the next stage. That’s two blog posts and I’ve not begun to do anything interesting. I’ve found (like so many others before me) that it pays to get things in order. In fact I’m probably moving at an unhealthily rushed pace here.

Orange Project Step 1 – Data Collection

This is the first in a series of planned posts that track how my workflows evolve and develop around a project. They’re a bit edited and idealised (I’ll only include my errors and dead ends if they might be enlightening, for example.)

I’m collecting together a corpus of tweets to run some experiments over the Christmas holidays. I’ve used Sysomos to collect English-language tweets containing the word “orange” day by day over the past week. I don’t know how Sysomos identifies English (and I really should) – but what I’m planning to do is hard enough without having to involve other languages.

Sysomos has 432,880 such tweets in its database, but Twitter’s firehose ToS mean that I can’t download them all; instead, Sysomos lets me pull a random sample of 5,000 per search. So I’ve been through 7 days, and downloaded 5k per day, giving me 35k tweets which should (I hope) be a good place to start my analysis.

Sysomos's downloaded files don't have useful filenames

The first thing to note is that Sysomos file names give no clue as to their contents, only the day on which they were downloaded (the sequential numbering is added by my OS). This means I have to open each file to see what’s in them. The relevant metadata are on the description line (line 5), and the real data start on line 7.

Useful metadata are on the description line (line 5), and the real data start on line 7

Now it’s fairly simple to get those data at the command line by looping through each file and extracting only the fifth line:

Extract the description line from each file

From here I should be able to construct new file names. For my purposes, I really just want to grab the first half of the second field and rename the file accordingly so that the dates are first. I could do this all with sed or perl but have come to rely on csvkit — a toolbox maintained by Christopher Groskopf. So I’m being a bit lazy here…

Extract only the second column of the description line

Then use perl to do the last bit, cutting-and-pasting to get the final file names.

Create new filename

Now I’ve got the filenames sorted, I can create new files with the new names (experience tells me that renaming with mv is rarely a good idea in these workflows. I’m using the tail -n +7 to trim off the first 6 lines of each file. This gets me into a good position to begin munging and cleaning the data.

Create new files with the new names

So that’s the final line of code that I need to process Sysomos downloads.

This may all look like a lot of work for nothing, particularly as I’m only dealing with a week’s worth of files here. Imagine, though, if I were running this analysis across multiple keywords; or looking at a month’s worth of data.

Seasonal Chocolate

I’m doing some jiggery-buggery at the moment around the general theme of “Social Listening 2.0″. I think we’re all more or less agreed that the promise of the early Social Listening platforms (“Funded by Homeland Security grants! Now available to marketers!”) hasn’t really been borne out in practice. But there are interesting and exciting things happening in academia and the start-ups and I wanted to get my head around the basics of machine learning so that I can make better decisions about how we approach the problem.

Anyway, I’ve taken as my test project twitter mentions of the word “Orange.” I’ve made a (fairly educated) guess that this will be a good way to explore and illustrate the problems and to investigate the potential range of solutions. Yesterday, there were over 100K tweets that included the word. Some of these used the word as an adjective, others as a noun.

orange

Or even:

terrys orange

I’d forgotten about Terry’s Chocolate Oranges. And a quick glance at Google and Twitter trends suggests I’m not the only one. Never having considered it, I hadn’t realised quite how seasonal interest was when it came to certain chocolate products.

Search Seasonality in Chocolate

Tweets mentioning "Terry's Chocolate Orange"

Wonder which tracks better to sales figures, Twitter or Google? Anyway — season’s greetings to you all, hope you have happy hols, and see you in the new year.

Facebook gets mobile. Advertisers fail to follow

Kia Motors UK seems to have been putting a lot of money behind sponsored posts recently. Unfortunately, they haven’t really been paying attention to what happens to mobile users. It seems that someone forgot to mention that Page apps don’t work on mobile (yes, there are ways around this). This trifling mobile issue would be fine, were it not for the fact that 39% of their traffic is coming from mobile, according to their own stats (see below)


Brands are still far too focussed on building apps inside Facebook, when in reality:

An iOS app can be a Facebook app. A mobile website can be a Facebook app. A console game can be a Facebook app. Your car, your shoes, your credit card or your toothbrush can be Facebook apps.

 

I’ve droned on about this in the past and several of my conference talks touch on this. Most of my smart colleagues in the industry know about this, Facebook knows, and the various tech vendors know. Our failure to persuade the market to mend their ways continues to irritate me, though.

How important are awesome headlines?

In early February 2011, a YouTube user posted a video with the title, “Zach Walls Speaks About Family”. Almost ten months later, the video was reposted on progressive campaigning site, MoveOn with a new title, “Two Lesbians Raised A Baby And This Is What They Got”. Here’s what happened to the views:

Screen grab of video statistics from http://www.youtube.com/watch?v=yMLZO-sObzQ

MoveOn.org re-titled a YouTube video, massively increasing its distribution

It’s not a straightforward correlation — after all, MoveOn.org commonly receives ~1.5m monthly UVs, so the additional exposure must have helped a bit. But the video had been posted on Reddit back in February 2011 with the uninspiring-if-informative title “Zach Wahls, a 19-year-old University of Iowa student spoke about the strength of his family during a public forum on House Joint Resolution 6 in Iowa”, so I think it’s fair to assume that the title played a big part.

 

Headlines have become separated from stories

A few years ago, I was fortunate enough to see Tom Whitwell, Editorial Director at Times Digital give his “How To Write Awesome Headlines” presentation. Tracing the development of headline writing, he claims that the patterns of web consumption and sharing means that headline writing has left behind the terrible (by which I mean “fantastic”) puns beloved of sub-editors.

The Sun headline so awesome that they ran it twice (/ht Who Was the Super Caley Sub?, Guardian)

In a world of Twitter, Reddit, news aggregators and curators, Whitwell says, the headline has become separated from the story; putting more pressure on sub-editors to make the headline sell harder.

He notes that:

The difference between a good headline and a weak headline isn’t 5% or 10%, it’s 10x, 20x or more.

…then lists his rules for click-able headlines:

  1. Be specific. Why exactly should I read your story, not that other one?
  2. Tell the whole story in the headline
  3. Don’t try to be clever
  4. Don’t try to be funny
  5. Play to your niche. Don’t over simplify or patronise in the headline
  6. Include lists, quotes, numbers and names
  7. Don’t worry about ‘being boring’
  8. Write the headline first. Really. Always.
  9. Great story which you can’t explain in the headline = crap story

 

Don’t give it all away in the headline

The next presentation comes from Upworthy (who are a bit like a BuzzFeed with a social conscience). Like BuzzFeed, they are content curators, and like all successful curators, they find content, then

Improve the framing and put it on our site so more people will see it.

What constitutes “improving the framing”? There are some excellent points, but a good third of the presentation is given over to the importance of a good headline. By the very nature of what they’re doing (curating and re-framing stories, rather than creating them) they can’t, as Whitwell demands, write the headline first. Instead their practice is to “write 25 headlines for each story” before selecting the best. It’s a compelling presentation, and it stands out for me because their first and last rules directly contradict Whitwell’s rule.

  1. Don’t give it all away in the headline.
  2. Also, don’t give it all away in the excerpt, share image, or share text.
  3. Don’t be shrill.
  4. Don’t form an opinion for the end user. Let them do that.
  5. Don’t bum people out.
  6. Don’t sexualize your headlines in a way your mom wouldn’t approve.
  7. And don’t over-think it. Some of your headlines will suck. Accept it and keep writing.
  8. Which reminds me, my mom doesn’t like it when you put the word “sucks” in headlines.
  9. Lastly, be clever. But not TOO clever.

 

I’ve never been good at headline writing, but as I begin to understand the relationship between content, social and SEO better, I am beginning to understand better what skills we need to hire and develop in our organisations.

The Challenges of Content Marketing

I’m (finally) thinking more about content marketing. It’s taken me a while to get here, and I’m still not wholly sure what the triggers have been. I’d like to believe, though, that it’s a combination of a few things:

  1. I’m beginning to hang around SEO people again. The SEO types I meet are smart, and they seem to be getting excited about the whole content thing. Clearly there’s something to see here.
  2. A chat with Neil Perkin. He’s also a properly smart chap, and his publishing background gives him real insight into the area
  3. A growing sense that there are better data about this market out there, and better tools to handle those data.

CHALLENGES

Each brief will bring its own challenges of course; but the following seem to me to be some of the more common challenges. I think that they can all be addressed and overcome, but it’s worth being aware of them.

Confusion: As so often, a common marketing buzzword conceals a multitude of meanings. One person’s content marketing might be blogging- or curation-led. Another might be focused on encouraging social review content. Yet another’s might be SEO-led. Or blogger-outreach led. Or sponsorship-led. Or they might be heavily invested in making advergames or video content, and looking for new ways to dress up an old dog. Everyone brings their own expectations and biases; and the common "Content Marketing" terminology doesn’t help make those clear. Like ‘engagement’, and ‘social’ before it, it seems that ‘content’ is already well on its way to being a meaningless marketing term.

I suspect that it probably helps to see these things in terms of what we’re trying to achieve, the problems we’re trying to solve. For example — which of these would you prioritise?:

  • Assist in-bound marketing
  • Provide incentive for data-capture
  • Give marketing teams and client teams opportunities & reasons to email clients, tweet, post stuff to their LinkedIn profiles…
  • Improve SEO metrics
  • Deliver in-bound links through distributed content
  • Increase advocacy through shareable content
  • Become the de facto charts used in 3rd party presentations (Kleiner Perkins’ Mary Meeker’s mobile charts end up everywhere, as do eMarketer’s charts)

Compliance: At least a portion of a strong content strategy will rely on news hooking, opportunism & fast response. Yet any half-way-decent in-house legal department will place restrictions on what we can say, and how fast we can say it. Publishers’ workflows are set up to speed the publication process; but that’s rarely true of brands. Brands’ IT services and infrastructure probably won’t help either.

Content calendaring should help a bit; but we’re going to need to thrash things out with Legal and I.T. somewhere along the line; and we’d better have some good answers for their tricky questions.

Competition: There are many other publishers competing for the audience’s finite appetite for content, information and entertainment. Some of them may be our traditional media partners — who, it should be pointed out — aren’t always having the easiest time monetizing their own content strategies. Of course brands want to become publishers, but they’re going to have to recognise that it’s already a crowded market, and the publishers have a lot of experience and talent.

Context: What works off line isn’t always what works online. While the people may be the same, the context has changed; the audience’s attention may be divided, shorter, focused on different goals and activities. For example, long-form content generally falls foul of the tl;dr problem ("too long; didn’t read"). Everything I just said goes double for mobile. Spooging traditional content into a mobile context is more or less doomed to fail.

The audience is also a kind of context: certain kinds of content theme have repeatedly been proven to work well online. We need to understand how to spin our content for different kinds of audience, and to learn to love the niche. This isn’t always easy for marketers who are used to thinking in big, broad demographic terms. And it may be hard for more conservative clients to step outside the television mass market and into the internet mass market.

Cost: There will be finite internal resource, and (given the way this works) we’ll probably want to produce a lot of relevant content, fairly regularly. We need to find ways to get more bang for our content buck. We’ll need to think in terms of re-using, translating and recycling content (a single piece of research becomes 10 blog posts, 2 guest blog posts, a SlideShare presentation, 5 infographics, 2 podcast interviews and one video.)

Do these ring true? What have I missed?

What might a social media planner want from a media content partnership?

Every so often we find ourselves negotiating the digital side of media partnerships and sponsorships as part of a larger deal. What is it that we want out of them? My rough thoughts and notes are below: I’d be grateful if you’d add your own ideas in the comments.

Mobile first

Requirement: Guaranteed performance on mobile devices. This means not only that all content must be visible on mobile devices, but that a smooth mobile U/X must be provided (e.g. no tiny buttons, no graphical text, navigation elements placed so that there’s no danger of accidental mis-keying)

Note: This is just a straightforward requirement these days, and we shouldn’t have to negotiate on this point at all. It’s not just about ruling out Flash, it’s about ensuring that audiences have as good a mobile experience as possible. For more on this, Luke Wroblewski’s "Mobile First" presentation and book are essential reading.

 Search engines

Requirement: The index page of the content partnership must be linked from the appropriate section heading page; and both the section heading page and the link to the Digital Content Hub must be search engine crawl-able (i.e. no robots.txt exclusions, no nofollow links). The section heading page itself must be accessible through the global navigation.

Note: This is more about making sure that the content can be crawled by a search engine. We’re not asking for the link to the content partnership to be above the fold, just that it be in readily crawl-able .

Continue reading

Decomposing Time Series Data with R

In an earlier post I started looking at how I might use R to forecast Google search volume.

Now I find the useful `decompose` function, which decomposes a time series into seasonal, trend and irregular components using moving averages.

Which produces the following:

Search trend for “cold remedies” – showing the original time series and beneath that, the decomposed trend, seasonal fluctuation and noise.