Orange Project Step 2 – Munging & Cleaning

This is the second in a series of posts tracing the evolution of a project. In the first post I downloaded 35k English-language tweets from Sysomos containing the keyword “orange”. Here’s a quick glance at what the data look like:

Sysomos data

A lot of the really interesting data is off screen. I could, of course, load the CSV into Excel and work on it there, but I find that it’s faster to bypass that wherever I can. Not saying it doesn’t work, mind you, only that Excel can introduce all sorts of new problems. Instead, I’ll use the csvcut tool from the csvkit package to peer at the column headings (you should be able to click all the images in this post to embiggen them.) calling csvcut -n tells it that I just want to see a numbered list of the column headings:

using csvcut to look at columns

I downloaded an awful lot of data about each tweet that I don’t really need. In fact, for my purposes, all I really want is the body of the tweet, as contained in the Content column (number 27)

A first glance at the data

I’m using csvkit‘s csvlook tool to look at the data (the -c 27 tells csvcut to select the 27th column):

Screen Shot 2012-12-24 at 20.38.00

Almost immediately I spot a problem; there’s a tweet here with no mention of the word ‘orange’ (highlighted):

no mention of keyword

Sysomos’s search (while fast) isn’t particularly nuanced – after all, it’s a general-purpose platform that has to work with data from multiple sources, including blogs, forums, news sites, Facebook and Twitter. So my search didn’t pull up Tweets containing the keyword ‘orange’, but rather looked for it across the whole record. Because this tweet came from an account whose username contained the word, it was pulled into the pot along with everything else:

So I’m presented with at least one problem: I need to remove tweets that don’t contain the keyword from the data set. Just grepping will do that for me (plus, it has the side benefit of removing the column header.)

I’m using the -i flag so that it’s case insensitive (“orange”, “Orange”, “ORANGE” and “OrAnGe” will all be accepted).

grep -i

So that works nicely. Time to drop it into a loop. In the following code, I’m using a trick I picked up for grabbing the filename part of a file. It’s a bit gratuitous here, if I’m honest — I could have done it with perl or sed just as easily.

I’m saving the data to simple text files; I’ll be loading these into R later.

Looking at one of the files with less I can see that there are some nasty unicode characters:

unicode characters

For some reason cat displays them for what they probably are, emoji from iPhone clients:


So I need to clean these off. I’ve used iconv in the past, so I’ll just copy and paste that code into the original loop.

Now all the files are nice and clean, and ready for the next stage. That’s two blog posts and I’ve not begun to do anything interesting. I’ve found (like so many others before me) that it pays to get things in order. In fact I’m probably moving at an unhealthily rushed pace here.


Please tell me what you think.