How I Import into Discourse

Apparently more people than I had previously realized are interested in using my Friends+Me Google+ Exporter Discourse script and some of them haven’t started doing imports. So I’m writing up the process I used for importing content from Google+ into MakerForums

Preparing for import

I recommend that you set up Google authentication as at least one login option before importing, in order to attribute work appropriately. The data produced by the exporter includes the same ID that Google authentication uses, and when users log in with Google authentication, they will own the content they created, and within the settings you have configured, will be able to modify the imported content that they wrote.

If you have not already, but do intend to set up S3/CDN (S3, Digital Ocean Spaces, or other similar content storage supported by Discourse), do it before importing. This way, the imported images and videos will live in the content storage, not on the filesystem. However, you will want to turn that off in your development instance, so that you aren’t uploading while doing test imports.

Do note that the Discourse experience is substantially faster with a CDN, and if you have a lot of images (and especially if you have videos) storage costs are typically substantially lower if you are using an S3-like system for content. Except for small, private, or predominately private Discourse instances, I recommend it.

Also, if you have a substantial amount of image content to import, I recommend setting include_thumbnails_in_backups — we didn’t, and when we had to restore a backup, we set ourselves up for about a week of rebaking the whole site.

Until imported users log in with Google, no email will be sent, because their email record is :googleId@gplus.invalid which is part of the .invalid domain explicitly reserved for these kinds of purposes, and as a result of experience with this script Discourse doesn’t even try to deliver email to these addresses, as a feature.

The Easy Way

The easiest thing, if you have the luxury of importing in a non-production environment, is to set up the non-production environment on the system where you ran the importer, do a backup in production, restore it in the development environment, do the import, back up the development environment, and then restore it into production. If you do that, you’ll want to put the production site in read-only mode during the process. Or, of course, do all this before bringing the production site up in the first place.

Also, if you are using Discourse’s own hosting, the easy way—with its limitations, is the only way.

The Hard Way

I didn’t have the luxury of doing it the easy way, because I was importing way too much to take the site offline for the imports (days of clock time), and the site was already functional before I showed up with an offer to import. So the rest of this is about how I did the (relatively weird, relative to the norm for Discourse ports) process of importing into a live site.

My drill after importing a backup from production into development was:

Create admin user on command line
Set https not required on command line
Disable sending emails in settings
Disable S3 for uploads
Disable S3 for backups
Turn off scheduled backups
Take a database-only backup so I can retry easily with these new settings until I’m satisfied with the settings.

I had more data to import than disk space available on the production server, so I could not copy all the data to the production server all together. I wrote a quick script that copied only referenced images. I processed 1000 posts at a time, by choosing the JSON format in the exporter and asking it to export 1000 posts at a time. That was conservative in order to avoid consuming too much memory or disk space; 5000 posts at a time would probably have been fine.

I ran imports from the discourse checkout, based on the same git hash as the currently-running production system, checked out on the system on which I ran the Friends+Me Google+ Exporter. The exporter saves absolute paths in its data mapping CSV files, which makes it tricky to move the data between systems, and is incompatible with the very common method of using the official Docker images to deploy Discrouse.

I created a b subdirectory under the discourse checkout (I no longer remember what I meant that to be an abbreviation for) in which I stored things like whitelists, blacklists, categories.json, etc. I reserved the directory u (ditto) for creating sets of files to import, and the directory incs for storing tarball packages to import into production.

Here’s a version of the script I use to create the tarballs that are import packs:

#!/bin/bash

name=$1
shift

for feed in ~/.config/Google+\ Exporter/google-plus-exports/g+-feed*$name*.json ; do

  echo processing $feed
  counter=$(echo $feed | sed -E 's/.*(-[[:digit:]]+of[[:digit:]]+).json/\1/')
  echo $counter
  increment=$name$counter
  echo $increment

  rm -rf u
  mkdir u
  LWL=
  RWL=
  cp "$feed" u/feed.json
  cp b/usermap.json u/usermap.json
  cp b/categories.json u/categories.json
  cp b/blacklist.json u/blacklist.json
  if [ -f b/$name-whitelist.json ] ; then
    cp b/$name-whitelist.json u/whitelist.json
    LWL=u/whitelist.json
    RWL=/shared/tmp/$LWL
  fi

  time bundle exec ruby script/import_scripts/friendsmegplus.rb u/categories.json u/feed.json u/usermap.json "${HOME}/.config/Google+ Exporter/google-plus-exports/google-plus-image-list.csv" "${HOME}/.config/Google+ Exporter/google-plus-exports/google-plus-video-list.csv" u/blacklist.json u/$increment-upload-paths.txt $LWL "$@" || {
    echo importer failed
    exit 2
  }

  echo "#!/bin/bash" > u/import.sh
  echo ruby script/import_scripts/friendsmegplus.rb /shared/tmp/u/categories.json /shared/tmp/u/google-plus-image-list.csv /shared/tmp/u/google-plus-video-list.csv /shared/tmp/u/blacklist.json /shared/tmp/u/feed.json /shared/tmp/u/usermap.json $RWL "$@" >> u/import.sh
  chmod +x u/import.sh

  cat u/$increment-upload-paths.txt | while read path; do cp "$path" u/ ; done
  sed "s^${HOME}/.config/Google+ Exporter/google-plus-images/../../../^/shared/tmp/u/^" ~/.config/Google+\ Exporter/google-plus-exports/google-plus-image-list.csv > u/google-plus-image-list.csv
  sed "s^${HOME}/.config/Google+ Exporter/google-plus-videos/../../../^/shared/tmp/u/^" ~/.config/Google+\ Exporter/google-plus-exports/google-plus-video-list.csv > u/google-plus-video-list.csv
  tar czf incs/$increment.tar.gz u
  ls -lh incs/$increment.tar.gz
done

I then copied the resulting tarballs to the production system.

I used the import script written by the first script to run an identical import in production. In my case, on the production system (using Digital Ocean and their docker Discourse implementation), I ran imports using a script like this from outside the docker container:

#!/bin/bash

[ -f "$1" ] || {
  echo "import file not found"
  exit 1
}

set -e

docker cp $1 app:/shared/tmp
docker exec app /bin/tar -C /shared/tmp -x -z -f /shared/tmp/$1
time docker exec -u discourse app /bin/bash -c "cd /var/www/discourse; bash /shared/tmp/u/import.sh"
docker exec app rm -rf /shared/tmp/u

rm $1
docker exec app rm /shared/tmp/$1

Before I start importing in prod, I pin a warning that email delivery is disabled for this reason, so that people understand not to try to register as a new user while the import is underway.

Note that after your import is underway, Discourse will start an asynchronous process of optimizing images and making copies in different sizes. These processes are run at low priority in an attempt to reduce the impact on interactive usage of your Discourse instance. You’ll see lots of convert, optipng, and similar processes running on the system after the import is otherwise complete. You can see the queue by going to https://yourdiscourse/sidekiq after logging in as an administrator to https://yourdiscourse/. This is normal.

Fixing up duplicate users

If you have users who were already using the site before your import, and they used Google authentication for login, they should have been recognized as the authors of their content at import time. However, if a user started with only some other authentication before your import, the import scripts will have created a new Discourse handle for them. You may want to merge them together.

This is possible only from the command line, and so is not available directly in hosted Discourse. The rest of this assumes you have command-line access as an administrator. Danger: This is a one-way operation. You cannot undo it. Take a backup first.

Let’s call the users @localuser and @Imported_User because most Google+ users will have been imported with a full name. Let’s assume that the user wants to be refered to as @localuser — that’s the name they chose in the first place.

In the discourse directory (in the official Discourse Docker images, that’s /var/www/discourse), run these commands:

time bin/rake 'users:merge[Imported_User,localuser]'
time bin/rake 'posts:remap[@Imported_User,@localuser]`

In theory, only the first is required; in practice, I’ve almost always needed to run both. Note that while these are both one-way operations, only the second (currently) requires interactive confirmation. Note also that the first command does not use the initial @ and the second command does include it. Make sure that you do these steps in the correct order.

The time part is not strictly needed, but it’s information I like to keep track of.

Keep a record of all the renames you have done, keeping in mind that if you keep the record in Discourse, the second command will edit your record, making it useless. If you do additional imports after the rename, the import script may not have the information required to recognize the rename. (I believe, but have not validated, that it depends on whether the user logs in with google authentication after the rename process and before the subsequent import.)

More Information

I’ll plan to update this post with any more clarifications that I’m asked for. You can ask in Pluspora for the forseeable future.