Creating Discourse permalinks with shell commands

The other day, I deleted a bunch of categories in Discourse. In the process, I broke a bunch of links. I hate breaking links, but it just never occurred to me that deleting categories would break links too. But, of course, Discourse can’t know that I moved all these topics to school-specific tags that should replace the school-specific categories.

The pattern for redirection is extremely simple:

c/musical-theater-schools/american-university-mt/658 
=> tag/american-university-mtc/musical-theater-schools/american-university-mt/658

It’s not all that hard to do this by hand, but I have 58 permalinks I need to create! This sort of work is pretty mindless. I could probably do it while watching a single sitcom episode. But it’s also extremely easy to mess up if you aren’t paying enough attention. Writing a quick script can be nearly as easy and a lot more enjoyable. So that’s what I’m going to do in this case. (And make it take even longer by writing this post at the same time.)

Discourse supports permalinks and it’s not hard to reverse engineer the API for inserting them. The first thing I did was to insert a sample permalink using the admin interface:

Interface to create a new redirection on Discourse

From the developer interface, I see the request is to /admin/permalinks.json with the following payload:

url=c%2Fmusical-theater-schools%2Famerican-university-mt%2F658&permalink_type=tag_name&permalink_type_value=american-university-mt

Converting that into a more readable format:

url = "c/musical-theater-schools/american-university-mt/658"
permalink_type = "tag_name"
permalink_type_value = "american-university-mt"

So those are the three parameters I need. url is the URL I want to redirect, so it’s my input. permalink_type is static. I always want the link to point to a tag. permalink_type_value is the place I want to redirect to. In this case, it’s the tag name I want to point to instead of the category. Thankfully, I created the tags using the category slug, so I just need to extract that substring from the url.

The steps I need to do are:

Get a list of category URLs that I removed.
Extract the tag name from each URL.
Post a request to /admin/permalinks.json with the correct parameters.

When writing a shell script, it’s often easiest to start with the last step and work forward. So if I can post one request with hardcoded values, I know I can build a request based on one URL. And if I have a command that parameterized one URL, I can iterate over a list of URLs.

To put it another way, shell scripts often take the form of a pipeline:

$ first_command | middle_command | last_command

Since the last command produces the output you want, it’s important to give it the input it needs. So there’s no point in starting the middle command (much less the first command) before you know what the last command is going to need.

The final step, in this case, is a call to the Discourse API. Whenever you hear “API” in the context of shell scripting, you should think “curl”. I’m going to skip over a few steps I took reminding myself how curl works and jump straight to the sample command I executed:

curl $DISCOURSE_HOST/admin/permalinks.json \ 
-H "Api-Username: $DISCOURSE_USER" \
-H "Api-Key: $DISCOURSE_API" \
-d "url=/c/musical-theater-schools/american-university-mt/658&permalink_type=tag_name&permalink_type_value=american-university-mt"

I’ve put a few bits of data behind environment variables to make life easier when I move from testing commands to executing them on production servers. $DISCOURSE_HOST is the host I’m working on. In my case it’s set to https://talk.collegeconfidential.com when I’m ready to try commands on production. $DISCOURSE_USER is my username on the site. (I’m CCadmin_Jon on College Confidential.) $DISCOURSE_API is an API key I generated by visiting /admin/api/keys on a Discourse site where I’m an admin. Only admins can create API keys (or parmalinks).

The part I need to change is the -d parameter. It should look somewhat familiar since same payload I got from the dev tools when using the site interface. It has two parts that are variable:

/c/musical-theater-schools/american-university-mt/658
american-university-mt

Extracting the category slug can be done several different ways. sed is probably the right choice, but I’m going to use something less obvious:

$ dirname /c/musical-theater-schools/american-university-mt/658 | xargs basename
american-university-mt

dirname selects everything before the last / and basename selects everything after the last slash. xargs is my go-to tool for chaining commands. If basename accepted arguments form standard input, it wouldn’t be necessary. (Spoiler alert: I’m going to use xargs again later on.)

At this point, I’m going to start putting my script into a file so that it’s easier to edit. I’m going to call it create_permalink.ksh and it looks like this:

#!/usr/bin/env ksh

u=$1
t=`dirname $u| xargs basename`

curl $DISCOURSE_HOST/admin/permalinks.json \
-H "Api-Key: $DISCOURSE_API" \
-H "Api-Username: $DISCOURSE_USER" \
-d "url=$u&permalink_type=tag_name&permalink_type_value=$t"

Before I can use it the first time, I need to make it executable:

chmod +x create_permalink.ksh

Then I can test it:

$ ./create_permalink.ksh /c/musical-theater-schools/american-university-mt/658
{"status":500,"error":"Internal Server Error"}%

I’m getting an internal server error because I’ve already created this permalink. Actually I’ve created it several times for testing and removed it using the admin interface on the site.

Now there are some shell oddities in the script that you might not be familiar with. So I’ll go over it one command at a time:

#!/usr/bin/env ksh

This tells the interactive shell which interpreter to use. I’m a fan of ksh, but it should work with any Bourne shell descendant. I could have used #!/usr/bin/ksh or even just #!ksh, but the more complicated command is slightly more potable.

u=$1

I’m putting the URL in a variable called $u. This commend sets the variable using the value of the first input parameter. It’s a bit unnecessary since I could just use $1 everywhere in the script. But I got the habit of assigning arguments to variables so that I could use "$@" for lists of parameters. It also is slightly more readable if you name your variables instead of number them.

t=`dirname $u| xargs basename`

The tag name goes in $t. I’m using backticks (`) to capture the output of the command I mentioned above for extracting the category slug. The more modern method would be to use $(...) instead.

curl $DISCOURSE_HOST/admin/permalinks.json \
-H "Api-Key: $DISCOURSE_API" \
-H "Api-Username: $DISCOURSE_USER" \
-d "url=$u&permalink_type=tag_name&permalink_type_value=$t"

This is a single curl command put on separate lines. The only thing I changed is making the URL and the tag string variables.

Now I can do a bunch of permalinks by running this command several times. I could set up a for loop in my script:

for u in "$@"
do
  t=`dirname $u| xargs basename`

  curl $DISCOURSE_HOST/admin/permalinks.json \
  -H "Api-Key: $DISCOURSE_API" \
  -H "Api-Username: $DISCOURSE_USER" \
  -d "url=$u&permalink_type=tag_name&permalink_type_value=$t"
done

Or I can pipe the list into xargs:

$ echo $urls | xargs -n 1 ./create_permalink.ksh

But how do I get a list of the categories I’ve already deleted? I figured out that there was a problem because we noticed 404s in Google’s Search Console, so I could pull the list from there. But I can also get a list from the Discourse logs, which is handy given that future mistakes might not be so easily tracked by a third party.

If you are an admin on a Discourse site, you can visit /admin/logs/staff_action_logs to get a complete list of actions staff have performed, including deleting categories. There’s also a button to export the logs as a CSV file. After downloading the file, I used less to browse it. The first two lines helped me understand the format:

staff_user,action,subject,created_at,details,context
CCadmin_Jon,entity_export,staff_action,2021-03-09 18:40:41 UTC,,

Typically the first line explains the columns, as you can see in this case. And the second line shows the very last thing that was logged, which was me requesting the log to be exported. So I know this log is reverse chronological with the most recent events listed first. That’s a bit atypical because normally log entries are appended to the CSV file as they happen. Discourse logs are stored in database tables, so they can be exported in whatever order makes the most sense.

Using / to search through the logs, I found one of my delete_category entries.

CCadmin_Jon,delete_category,,2021-02-21 00:37:52 UTC,"created_at: 2020-11-26 11:45:12 UTC

Unfortunately, that line doesn’t say which category was deleted. For that, I needed to read down a few more lines:

CCadmin_Jon,delete_category,,2021-02-21 00:37:52 UTC,"created_at: 2020-11-26 11:45:12 UTC
name: American University MT
permissions: {}
parent_category: Musical Theater Schools",/c/musical-theater-schools/american-university-mt/658

It took me a bit to figure out how this entry spanned multiple lines. The details entry includes a double quote, which signals the start of a string. Since there is no close quote before the end of the line, the next line is also part of the string. So the details column ends on line 4 with the closing quote. I’m most interested in the final column, context, which includes the path of the category I deleted. That’s going to be my input to create_permalink.ksh.

Allowing newlines embedded in columns is pretty handy because it allows the log to show, for instance, the full body of a post that was deleted. For deleting a category, it shows some metadata about the category. But this throws a wrench in my plans to use grep to find the row where I deleted each category and extract the path. I didn’t sign up to parse complicated CSV files.

I could switch the Database Explorer API. But why should I when I already got the data I need in the log export? I just need to be a little creative in how I get it. Look again at the line that contains the path I’m looking for:

parent_category: Musical Theater Schools",/c/musical-theater-schools/american-university-mt/658

Looking at other deleted categories, I noticed they all look the same right up to the slug. Just as importantly, only the lines for deleted Musical Theater School categories start with this string. That means I can find what I’m looking for by using this grep command:

grep 'parent_category: Musical Theater Schools",/c/musical-theater-schools/' staff-action-210309-184041.csv

This isn’t technically right. There is a universe in which this command will cause hard-to-debug problems. Which brings me to the first rule of script programming: All is fair in love and shell. Alternatively: There’s no wrong way to get the right answer. An awful lot of learning to be a programmer centers on the concept of correctness. That’s because you never know how your code will be used in the future. You have to make sure it’s robust against bad input that could make everything go very wrong.

But if you are writing a script for yourself and for a specific purpose, you don’t have to worry about it being used by someone else for some purpose where it might, I don’t know, delete their hard drive or something. But that puts the onus on you to make sure you feed your script good input. So the next thing I did was sort the list and eliminate any duplicates to make sure I had the right lines:

$ grep [big long string] [file] | sort -u

Finally, I need to pull out the path. This time I will use sed:

$ grep [big long string] [file] | sort -u | sed -e 's/^.*,//' 
/c/musical-theater-schools/american-university-mt/658
/c/musical-theater-schools/baldwin-wallace-college-mt/662
/c/musical-theater-schools/ball-state-university-mt/661
...

The key bit is s/$.*,//. To work out what that means, you need to know regular expressions, which is a lot to learn. They are incredibly useful, however, so it’s worth the effort. This command substitutes (s/) the part of the string that matches a pattern ($.*,) with an empty string (//). The pattern ($.*,) begins with the start of the line ($), continues with any number of characters (.*) until it come across a comma (,). We are left with everything following the first comma, which is exactly what we want to find.

After double- and triple-checking the output, I put the whole thing together and created my permalinks:

grep 'parent_category: Musical Theater Schools",/c/musical-theater-schools/' staff-action-210309-184041.csv \
| sort -u \
| sed -e 's/^.*,//' \
| xargs -n 1 ./create_permalink.ksh

There’s xargs again! -n 1 is only necessary if I don’t include the for u in "$@" loop in create_permalink.ksh. It tells xargs to repeatedly call the command with just one parameter each time it’s called.

At long last, I have 58 new permalinks. I can check there were no errors by looking at the output of each curl command:

{"permalink":{"id":406987,"url":"c/musical-theater-schools/american-university-mt/658","topic_id":null,"topic_title":null,"topic_url":null,"post_id":null,"post_url":null,"post_number":null,"post_topic_title":null,"category_id":null,"category_name":null,"category_url":null,"external_url":null,"tag_id":572,"tag_name":"american-university-mt","tag_url":"https://talk.collegeconfidential.com/tag/american-university-mt"}}%

I can also go to /admin/customize/permalinks on my Discourse host. Finally, I can check the formerly broken link and make sure it ends up in the right place.

If you followed me this far, I hope you enjoyed reading this and aren’t some sort of compulsive reader who somehow couldn’t change to another page to read something more enjoyable. I enjoy writing shell scripts to get work done and I certainly spent more time than needed on this one.