Interpreting a G+ JSON “takeout”
I think I want¹ to convert my Google+ posts into a static (probably
Jekyll) site. I’m looking at a JSON takeout of my G+ profile,
and the resources in Posts
don’t seem to connect, which is going
to make this hard.
Archival content follows; note that Google changed the format of G+ JSON takeouts after this was written.
As far as I can tell, some of the Photos/Photos from posts/*/*metadata.csv
files contain urls that map the photos (file
names without trailing metadata.csv
) to the URLs in the Posts/*.json
files. Those actual image files seem to be duplicated between the
Posts
and Photos/Photos from posts/*/
directories in at least some
cases, but this is not consistently true. Mostly the media objects
in the Posts/*.json files have url
and resourceName
that are strings
not appearing anywhere else in the archive, either inside a file
or in a file name. (Duplication is clearly substantial; the .tgz
export is 2.6G but when I import the whole thing into a fresh git
repoitory and commit it, the .git directory contains only 1G.)
Between the text I wrote and the comments I and others wrote, there’s quite a bit of text to process here.
$ ls *.json | wc -l
976
$ for i in *.json ; do cat "$i" | jq '.content' ; done | wc
976 52729 362412
$ for i in *.json ; do cat "$i" | jq '[.content, .comments[]?.content]' ; done | wc
5111 163182 1117401
The content is HTML inside JSON, but probably won’t be terrible
to turn into markdown because it started out as the G+ weak
markup. It’s easier to read (that is, it’s clearer to the naked
eye that it is HTML) after running it through jq
which turns things
like \u003cbr\u003e\u003cbr\u003e
into <br><br>
and \u0026
into &
¹ But do I care enough to do the work? That remains to be seen…²
² Yes, I did care enough, you are reading this work as a result!