Finding Errors

Filed under bug-hunting on May 20, 2021

My day to day grind at the moment has been configuring new environments as we have new work coming in. I’ve done it enough at this point that I know exactly where the pain points are and what’s next in line to automate, so I think by the end of next month I’ll be able to spin up a whole new environment in under 20 minutes before customisations.

The fun begins

I spun up a new environment and got all the Amazon configuration sorted. Connect was happy, I could make phone calls and I could see everything streaming out to the database properly.

I then proceeded to start configuring Asterisk, connecting it to the SIP trunk and so on. I could see the instance registering and taking calls just fine, but when I tried to bring up the dialler it would tell me that there was no such Asterisk user.

Huh.

The Hunt

I started walking through our stack, making sure containers were talking to each other, to the database, etc. I could see the Asterisk users in the Postgres RDS instance, so my confusion intensified.

It was around then I found the following log in Cloudwatch

res_config_pgsql.c:1606 pgsql_reconnect: PostgreSQL RealTime: Failed to connect database asterisk on
aaa-production-aaaaaa.aaaaaaaaaaaa.ap-southeast-2.rds.amazonaws:

Well, OK, that’s a weird log, I’d appreciate a better reason. No networking issues or anything, security groups are fine, I decided to have a look at the res_config_pgsql.c file in Asterisk.

The code block in question is this one:

if (pgsqlConn && PQstatus(pgsqlConn) == CONNECTION_OK) {
  ast_debug(1, "PostgreSQL RealTime: Successfully connected to database.\n");
  connect_time = time(NULL);
  version = PQserverVersion(pgsqlConn);
  return 1;
} else {
  ast_log(LOG_ERROR,
      "PostgreSQL RealTime: Failed to connect database %s on %s: %s\n",
      my_database, dbhost, PQresultErrorMessage(NULL));
  return 0;
}

It started to make sense as I wandered through the code. You’ll recall the URL we saw in the log

aaa-production-aaaaaa.aaaaaaaaaaaa.ap-southeast-2.rds.amazonaws

I was thinking that maybe the .com was being truncated as a “nice” thing to make the logs less verbose or something, when I saw this

static char dbhost[MAX_DB_OPTION_SIZE] = "";

So… What’s that constant there?

#define MAX_DB_OPTION_SIZE 64

So we have a string of size 64 to hold our variable. Guess how long the URL was?

So what to do?

My short term solution was to create CNAME record in Route 53 and go from there.

Easier than patching it, my C knowledge is kind of limited to what I learned in uni. Without a full analysis of that file I’m not going to know whether I’m going to cause other issues if I up that buffer size, so I’ll have to decide what I’ll do in the future.