This is the first in a series of planned posts that track how my workflows evolve and develop around a project. They’re a bit edited and idealised (I’ll only include my errors and dead ends if they might be enlightening, for example.)
I’m collecting together a corpus of tweets to run some experiments over the Christmas holidays. I’ve used Sysomos to collect English-language tweets containing the word “orange” day by day over the past week. I don’t know how Sysomos identifies English (and I really should) – but what I’m planning to do is hard enough without having to involve other languages.
Sysomos has 432,880 such tweets in its database, but Twitter’s firehose ToS mean that I can’t download them all; instead, Sysomos lets me pull a random sample of 5,000 per search. So I’ve been through 7 days, and downloaded 5k per day, giving me 35k tweets which should (I hope) be a good place to start my analysis.
The first thing to note is that Sysomos file names give no clue as to their contents, only the day on which they were downloaded (the sequential numbering is added by my OS). This means I have to open each file to see what’s in them. The relevant metadata are on the description line (line 5), and the real data start on line 7.
Now it’s fairly simple to get those data at the command line by looping through each file and extracting only the fifth line:
From here I should be able to construct new file names. For my purposes, I really just want to grab the first half of the second field and rename the file accordingly so that the dates are first. I could do this all with
perl but have come to rely on
csvkit — a toolbox maintained by Christopher Groskopf. So I’m being a bit lazy here…
perl to do the last bit, cutting-and-pasting to get the final file names.
Now I’ve got the filenames sorted, I can create new files with the new names (experience tells me that renaming with
mv is rarely a good idea in these workflows. I’m using the
tail -n +7 to trim off the first 6 lines of each file. This gets me into a good position to begin munging and cleaning the data.
So that’s the final line of code that I need to process Sysomos downloads.
This may all look like a lot of work for nothing, particularly as I’m only dealing with a week’s worth of files here. Imagine, though, if I were running this analysis across multiple keywords; or looking at a month’s worth of data.