Interpreting a G+ JSON “takeout”
I think I want¹ to convert my Google+ posts into a static (probably
Jekyll) site. I’m looking at a JSON takeout of my G+ profile,
and the resources in
Posts don’t seem to connect, which is going
to make this hard.
Archival content follows; note that Google changed the format of G+ JSON takeouts after this was written.
As far as I can tell, some of the
Photos/Photos from posts/*/*metadata.csv
files contain urls that map the photos (file
names without trailing
metadata.csv) to the URLs in the
files. Those actual image files seem to be duplicated between the
Photos/Photos from posts/*/ directories in at least some
cases, but this is not consistently true. Mostly the media objects
in the Posts/*.json files have
resourceName that are strings
not appearing anywhere else in the archive, either inside a file
or in a file name. (Duplication is clearly substantial; the .tgz
export is 2.6G but when I import the whole thing into a fresh git
repoitory and commit it, the .git directory contains only 1G.)
Between the text I wrote and the comments I and others wrote, there’s quite a bit of text to process here.
$ ls *.json | wc -l 976 $ for i in *.json ; do cat "$i" | jq '.content' ; done | wc 976 52729 362412 $ for i in *.json ; do cat "$i" | jq '[.content, .comments?.content]' ; done | wc 5111 163182 1117401
The content is HTML inside JSON, but probably won’t be terrible
to turn into markdown because it started out as the G+ weak
markup. It’s easier to read (that is, it’s clearer to the naked
eye that it is HTML) after running it through
jq which turns things
¹ But do I care enough to do the work? That remains to be seen…²
² Yes, I did care enough, you are reading this work as a result!