Importing a remote database archive
Over at College Confidential, we have a lot of data that I’d like to analyze. I can use the Discourse Data Explorer to query the bulk of our data, but there’s another data set that’s currently sitting on an Amazon EC2 instance as a PostgreSQL data dump. So I’d like to populate the PostgreSQL database on my laptop with that data.
My first thought was to just copy the archive and restore it with psql. But then I counted the cost and noticed the dump would take nearly all of my disk space. Presumably I’d need some more space for PostgreSQL to store that data too. So it looked like I’d need to clear out some unused files first.
After some searching around, I discovered SSHFS which mounts a remote directory to behave like a local directory over SSH. The instructions for macOS are fairly straightforward:
- Download and install macFUSE.
- Download and install SSHFS.
- Restart your machine.
- Create an empty directory. (
mkdir /path/to/mount
) - Mount the remote directory via SSHFS.
The last step is the only tricky bit because it depends on how you access the remote machine. If you use an SSH key (you should), you’ll need to tell the command where it may be found.
sshfs username@example.com:/path/on/remote /path/to/remote \
-o IdentityFile=~/.ssh/key_for_server
If that works, you be able to see the remote files using ls
/path/to/mount
. If you see the data archive, you can start importing
it:
psql database_name < /path/to/remote/dump.sql
Setting up the database (prerequisite I should have mentioned before)
If you are like me, you probably didn’t think to create a database first:
createdb database_name
Or, horror of horrors, you might not have PostgreSQL running yet. If you have a Mac, you’ll need to use these commands:
brew install postgresql@14
Next start the server with:
brew services start postgresql@14
When the restore fails
Unfortunately, this SSH file system setup is a bit fragile. If you lose your network connection, you’ll get a message something like:
ERROR: COPY from stdin failed: aborted because of read failure
CONTEXT: COPY user_profiles, line 479387
could not read from input file: Device not configured
The CONTEXT
line is the key. It tells you what table was being
copied and the line it was working on when the restore failed. For
better or worse, the COPY command is atomic. So when the command
failed, none of the data was committed to the database. It doesn’t
really matter what line number it ended on. Only the table it was in
the middle of copying matters.
You can re-run the command, if you like, but there is no reason to
expect it will make progress. Fortunately, it’s possible to trim the
archive to just data that hasn’t been restored. One way to do that is
using sed
, if you know the line of the command that failed. I use
less
to search for the COPY
command of the table that failed:
/^COPY
Most likely the first result won’t be what you are looking for. You
can search forward with n
until you find the right table. Or you can
add the table name to the search command. Then I use =
to tell me
the line number. Finally, I copy from that line on into another file:
sed -n '1248450,$p' dump.sql > shorter-dump.sql
This is best done on the remote machine and not via SSHFS on the local
machine so that sed doesn’t have to wait for the data to be pull over
SSH. Once the sed
command has finished, you can run the import again
from your local machine:
psql database_name < /path/to/remote/shorter-dump.sql
In any case, there’s a good chance you’ll need to remount the directory:
diskutil umount /path/to/remote
If that doesn’t work, there’s a good chance you’ll need to restart your machine.
Conclusion
This is not the best way to move data around (see the problems caused by a dropped connection), but it helped me get the job done without using too much disk space.