Orange Project Step 1 – Data Collection

This is the first in a series of planned posts that track how my workflows evolve and develop around a project. They’re a bit edited and idealised (I’ll only include my errors and dead ends if they might be enlightening, for example.)

I’m collecting together a corpus of tweets to run some experiments over the Christmas holidays. I’ve used Sysomos to collect English-language tweets containing the word “orange” day by day over the past week. I don’t know how Sysomos identifies English (and I really should) – but what I’m planning to do is hard enough without having to involve other languages.

Sysomos has 432,880 such tweets in its database, but Twitter’s firehose ToS mean that I can’t download them all; instead, Sysomos lets me pull a random sample of 5,000 per search. So I’ve been through 7 days, and downloaded 5k per day, giving me 35k tweets which should (I hope) be a good place to start my analysis.

Sysomos's downloaded files don't have useful filenames

The first thing to note is that Sysomos file names give no clue as to their contents, only the day on which they were downloaded (the sequential numbering is added by my OS). This means I have to open each file to see what’s in them. The relevant metadata are on the description line (line 5), and the real data start on line 7.

Useful metadata are on the description line (line 5), and the real data start on line 7

Now it’s fairly simple to get those data at the command line by looping through each file and extracting only the fifth line:

Extract the description line from each file

From here I should be able to construct new file names. For my purposes, I really just want to grab the first half of the second field and rename the file accordingly so that the dates are first. I could do this all with sed or perl but have come to rely on csvkit — a toolbox maintained by Christopher Groskopf. So I’m being a bit lazy here…

Extract only the second column of the description line

Then use perl to do the last bit, cutting-and-pasting to get the final file names.

Create new filename

Now I’ve got the filenames sorted, I can create new files with the new names (experience tells me that renaming with mv is rarely a good idea in these workflows. I’m using the tail -n +7 to trim off the first 6 lines of each file. This gets me into a good position to begin munging and cleaning the data.

Create new files with the new names

So that’s the final line of code that I need to process Sysomos downloads.

This may all look like a lot of work for nothing, particularly as I’m only dealing with a week’s worth of files here. Imagine, though, if I were running this analysis across multiple keywords; or looking at a month’s worth of data.


Please tell me what you think.