This is the second in a series of posts tracing the evolution of a project. In the first post I downloaded 35k English-language tweets from Sysomos containing the keyword “orange”. Here’s a quick glance at what the data look like:
A lot of the really interesting data is off screen. I could, of course, load the CSV into Excel and work on it there, but I find that it’s faster to bypass that wherever I can. Not saying it doesn’t work, mind you, only that Excel can introduce all sorts of new problems. Instead, I’ll use the
csvcut tool from the
csvkit package to peer at the column headings (you should be able to click all the images in this post to embiggen them.) calling
csvcut -n tells it that I just want to see a numbered list of the column headings:
I downloaded an awful lot of data about each tweet that I don’t really need. In fact, for my purposes, all I really want is the body of the tweet, as contained in the Content column (number 27)
A first glance at the data
csvlook tool to look at the data (the
-c 27 tells
csvcut to select the 27th column):
Almost immediately I spot a problem; there’s a tweet here with no mention of the word ‘orange’ (highlighted):
Sysomos’s search (while fast) isn’t particularly nuanced – after all, it’s a general-purpose platform that has to work with data from multiple sources, including blogs, forums, news sites, Facebook and Twitter. So my search didn’t pull up Tweets containing the keyword ‘orange’, but rather looked for it across the whole record. Because this tweet came from an account whose username contained the word, it was pulled into the pot along with everything else:
@ryangodier lhfh hahah no not at all ima laugh at you haha
— MEME ©®™ (@luh_orange) December 17, 2012
So I’m presented with at least one problem: I need to remove tweets that don’t contain the keyword from the data set. Just
grepping will do that for me (plus, it has the side benefit of removing the column header.)
I’m using the
-i flag so that it’s case insensitive (“orange”, “Orange”, “ORANGE” and “OrAnGe” will all be accepted).
So that works nicely. Time to drop it into a loop. In the following code, I’m using a trick I picked up for grabbing the filename part of a file. It’s a bit gratuitous here, if I’m honest — I could have done it with
sed just as easily.
I’m saving the data to simple text files; I’ll be loading these into
Looking at one of the files with
less I can see that there are some nasty unicode characters:
For some reason
cat displays them for what they probably are, emoji from iPhone clients:
So I need to clean these off. I’ve used
iconv in the past, so I’ll just copy and paste that code into the original loop.
Now all the files are nice and clean, and ready for the next stage. That’s two blog posts and I’ve not begun to do anything interesting. I’ve found (like so many others before me) that it pays to get things in order. In fact I’m probably moving at an unhealthily rushed pace here.