One of our media partners kindly shared a couple of slide decks today with lots of information about some broad audience segments. There were a couple of graphs that looked interesting, and I wanted to grab the raw data. As so often happens, though, the charts had been copied and pasted from Excel, so those data weren’t available. Instead, PowerPoint responded with the familiar error message, “The linked file is not available.” This has probably happened to you. It can be particularly frustrating when it’s your own presentation that you were trying to update, and you have deleted the original Excel file that you were using as a scratchpad.
But this time, I noticed (possibly for the first time) that — no matter what PowerPoint was telling me — all the data were still in place. Just mousing over the chart data points told me what I needed to know – but (short of manually noting down 96 data points per line) how could I get them out?
It turns out that you can open the PowerPoint document using my text editor of choice, BBEdit (a little checking shows you can also use the free version, TextWrangler). This shows you a nice directory tree, and if you drop into the
/ppt/charts/ directory, all the charts will be there in a readily-parseable XML format.
I’ve been looking for another way to get at these data, and thanks to Pat Parslow I can share an even easier method. It transpires that Microsoft Office’s Open XML formats are simple zipped collections of XML files, so here’s how you can get at the data:
- Unzip the
.pptxfile, either by typing
unzip filename.xmlin the terminal, or by changing the extension from
.zip, then double clicking on it (this has the advantage of working on either OS X or Windows machines.
- That’s more or less it. Open the folder, go to
/ppt/charts/directory and fill your boots.
The BBEdit route in pictures
Then you can do what you like to clean the data. Here’s the (ugly, incomplete, but near-enough)
sed command I used.
In retrospect though, I should probably have used Google Refine. It took Refine a few seconds to do the work of that
sed command, and it took me more than 10 minutes to write the regular expression (
sed has its own weird regexes, and it seems that it’s even more clunky on OS X)
Here’s the Google Refine approach (I need to play with Refine more):