Interpreting a G+ JSON “takeout”

I think I want¹ to convert my Google+ posts into a static (probably Jekyll) site. I’m looking at a JSON takeout of my G+ profile, and the resources in Posts don’t seem to connect, which is going to make this hard.

Archival content follows; note that Google changed the format of G+ JSON takeouts after this was written.

As far as I can tell, some of the Photos/Photos from posts/*/*metadata.csv files contain urls that map the photos (file names without trailing metadata.csv) to the URLs in the Posts/*.json files. Those actual image files seem to be duplicated between the Posts and Photos/Photos from posts/*/ directories in at least some cases, but this is not consistently true. Mostly the media objects in the Posts/*.json files have url and resourceName that are strings not appearing anywhere else in the archive, either inside a file or in a file name. (Duplication is clearly substantial; the .tgz export is 2.6G but when I import the whole thing into a fresh git repoitory and commit it, the .git directory contains only 1G.)

Between the text I wrote and the comments I and others wrote, there’s quite a bit of text to process here.

$ ls *.json | wc -l
976
$ for i in *.json ; do cat "$i" | jq '.content' ; done | wc
    976   52729  362412
$ for i in *.json ; do cat "$i" | jq '[.content, .comments[]?.content]' ; done | wc
   5111  163182 1117401

The content is HTML inside JSON, but probably won’t be terrible to turn into markdown because it started out as the G+ weak markup. It’s easier to read (that is, it’s clearer to the naked eye that it is HTML) after running it through jq which turns things like \u003cbr\u003e\u003cbr\u003e into <br><br> and \u0026 into &

¹ But do I care enough to do the work? That remains to be seen…²

² Yes, I did care enough, you are reading this work as a result!

(Originally posted on Pluspora)