Building a reliable fax pipeline with Twilio and AWS to expand access to the ballot box

Building a reliable fax pipeline with Twilio and AWS to expand access to the ballot box

Written by Ben Weissmann (VoteAmerica) and originally published on the Twilio Blog.

At VoteAmerica, we work to help Americans register to vote, and request vote-by-mail (also known as absentee) ballots. We use technology to remove barriers to voting and make sure that everyone can vote safely, easily, and securely.

The 2020 election will be like no other: with a global pandemic, we anticipate a massive increase in the number of Americans who will vote by mail this November, and we’re making sure that requesting a mail-in ballot is as accessible as possible. Requesting a ballot can be difficult: in many states, you need to print a form, sign it, and mail it — and very few households have access to a printer, especially with so many people working from home. In 19 states, voters can request a ballot online via an official state system. And 11 more states — covering over 50 million voters — allow you to submit a ballot application via email or fax.

And it’s those 11 states that we’ll focus on in this post: states where you can submit your application electronically… if you have the technical know-how to scan your signature onto the paperwork and email it or fax it. And in a few of these states (like Kansas and New Hampshire), email isn’t an option: you have to use a fax machine (and who has access to a fax machine at home any more?). Even in states that typically allow you to email your application, some counties have been so overwhelmed by the volume of email that they’ve temporarily stopped accepting emailed applications.

So at VoteAmerica, we set out to build a system that makes it easy for voters to fax their ballot applications, to help voters in states that don’t accept email or counties where the email systems aren’t working. And that’s where we ran into a tough technical problem: how to reliably send faxes. We have to be very, very sure that our faxes are going through: a lost application means that a voter could be waiting for a ballot that never comes. Twilio offers a simple, easy API for sending faxes — so what’s the tough part?

How Faxes Work (And Don’t Work)

Fax machines work via the telephone network. When you send a fax, you’re actually making a phone call — but rather than connecting two people to each other to talk, you’re connecting two fax machines. The sending fax machine scans the document and encodes it for transmission over the phone line, and the receiving fax machine decodes the document and prints it.

So unlike email, there’s a whole lot of things that can go wrong. Most commonly, the phone line is busy! If you’re using a web-based fax receiver (like Twilio), this isn’t a problem: Twilio handles the complex phone network magic to be able to receive multiple faxes simultaneously at the same number. But we’re not sending faxes to Twilio — we’re sending them to election offices, most of which are still using an old-fashioned fax machine connected to a landline. And just like phone calls, only one fax can be using the line at a time. So if someone else is trying to send a fax to that election office, we’ll get a busy signal when we try to send the voter’s application.

Even if the line isn’t busy, there’s still other things that can go wrong. Fax machines typically have limited memory (they’re just storing documents long enough to print them), so if the machine is out of paper or ink and runs out of space to store incoming documents, it won’t accept any more incoming faxes. Even simpler, the fax machine could be off or unplugged.

All this to say: we need to be really careful about sending faxes, and anticipate that a lot of our attempts to send a document won’t work. While Twilio’s API doesn’t completely solve this out-of-the-box — sending a document via the Twilio Fax API just tries to send once and fails if the line is busy or disconnected — it does give us all the building blocks we need to build a reliable fax system.

The Fax Gateway

To ensure our faxes get sent successfully every time, VoteAmerica built an open-source Fax Gateway: a system that sits on top of the Twilio API to handle queuing and retrying faxes. To make this system cheap to operate and easy to maintain, we built it using Amazon SQS, AWS Lambda, and the Serverless Framework.

The fax gateway handles a couple of problems:

  • If a fax doesn’t go through, it automatically queues it up to try again after a few minutes. It tracks how many times we’ve tried to send a particular fax, so if it fails too many times we can escalate that problem to an engineer to investigate and let the voter know their application wasn’t sent successfully.
  • It queues the faxes so we’re never sending more than one fax at a time to a particular number. Because most election offices are using a traditional physical fax machine that can only receive one document at a time, we don’t want to try sending multiple documents at the same time — we’d end up competing with ourselves and getting a lot more busy signals.

The basic flow is that a pending fax is written to an SQS queue. This is a FIFO (first-in, first-out) queue. SQS FIFO queues have a number of extra features that help us out:

  • They can deduplicate messages. This means that if the API call to write a message to the queue fails, we can safely retry it because the queue will remove the duplicate if it’s written twice.
  • We can group messages via a “message group ID”. This partitions the queue — basically, we get a separate logical queue for each message group ID. Messages within a group are processed strictly in order, and only one message from each group is processed at a time.

This second feature — the message group IDs — is what lets us make sure we don’t jam up a particular election office by sending multiple faxes at once. We use the destination fax number as a message group ID, and SQS will make sure that only one message from each group (meaning one fax for each destination number) is being processed at a time.

The Life Of A Fax In The Fax Gateway

When we want to send a fax, an application writes it to the Fax Queue. From there, it’s sent (via the Twilio API) to the Fax Processor. The fax processor sends the fax and checks if it went through. If it did, the fax gateway writes a message to the Webhook Queue, which sends a notification back to the application. If the fax didn’t go through successfully, we write the message to a Retry Queue, which holds on to it for a few minutes, and then sends it back to the Fax Queue to try again.

When an application wants to send a fax, it just writes it to the Fax Queue. Here is how this can be done from Python, using the boto3 library:

import boto3 sqs_client = boto3.client('sqs') payload = json.dumps({ "fax_id": str(fax_uuid), # A unique ID for the fax "to": str(fax_to), # A E.164-formatted phone number "pdf_url": fax_pdf_url, # We use a presigned s3 GET URL "callback_url": callback_url, # A URL in the application that # receives webhooks } ) sqs_client.send_message( QueueUrl=FAX_GATEWAY_SQS_QUEUE, MessageBody=payload, MessageGroupId=fax_to, )

And then the Fax Gateway’s queues and Lambda functions take over. Let’s take a deeper look at how one of these functions — the Fax Processor — works.

The Fax Processor

The Fax Processor is the heart of the Fax Gateway. It reads the faxes from the Fax Queue and sends them via the Twilio Programmable Fax API:

# Send the fax to Twilio twilio_fax = client.fax.faxes.create( from_=PHONE_NUMBER, # A phone number from our Twilio account, media_url=fax_record.pdf_url, # Our faxes can contain PII -- instruct Twilio to not retain # a copy of the PDF. store_media=False, # Fail fast (after 5 minutes) rather than leaving the fax # queued -- we have our own retry logic, and this lambda # function times out after 15 minutes so we'd rather # gracefully handle the failure ourselves rather # than have Twilio hold it in their queue for a long time. ttl=5, )

One important design decision was how to handle failed faxes. When you send a fax via Twilio, you generally provide a status callback. Twilio will try to send your fax, and let you know via the status callback whether it went through. So we could have the Fax Processor fire off the fax via Twilio, set the status callback to invoke another Lambda function, and then return (so Twilio would send the fax in the background, after the Fax Processor returns). However, this would mess up all the work we’ve done with message groups: we’re trying to only send one fax at a time to a destination number. So it’s important that the Lambda function doesn’t return until the fax is done sending.

So instead of using the Twilio status callback, we have our Lambda function stick around, and poll the Twilio API. This keeps the Lambda function running until the fax is done sending (successfully or not), so we maintain this “lock” on the destination number:

# Twilio fax status codes indicating still-in-progress or success – all other # codes are considered failures. # # TWILIO_STATUS_PENDING = ("queued", "processing", "sending") # How frequently to poll for fax status, in seconds TWILIO_POLL_INTERVAL = 15 def poll_until_fax_delivered(fax_sid: str) -> Any: while True: try: fax = client.fax.faxes(fax_sid).fetch() except Exception as e: # If there was an error getting the fax status, just # log it and keep polling -- we don't want to let a # transient error cause the whole lambda function to # fail. logging.exception("Error while polling for fax status") if fax.status not in TWILIO_STATUS_PENDING: return fax.status print(f"Fax has pending status: {fax.status}, waiting") time.sleep(TWILIO_POLL_INTERVAL)

We also distinguish between a failure to send the fax due to a busy signal or other expected problem — the stuff that we expect to happen every once in a while even if everything is working correctly on our end — and a failure to send the fax due to a problem with AWS, Lambda, or Twilio. If we encounter an unexpected error (for example, if the Twilio API returns a 500 error code), we use the normal Lambda error handling mechanism: we throw an error from the Lambda function, and the message will end up back in the Fax Queue to try again after the SQS visibility timeout. If we get an expected error — Twilio tells us the fax line was busy, for example — we don’t return an error from the Lambda function. Instead, we write the message to the Retry Queue and return a success.

The Retry Processor

Messages sit in the Retry Queue if we’ve tried to send the fax, but it didn’t go through. Usually, this is because the fax line is busy, so we want to wait a bit for the fax line to clear up before we try again. The Retry Queue is a delay queue, which means that we’ve configured it to hold onto messages for about 15 minutes before we deliver them. This lets us space out our retries, so the receiving fax machine has time to stop being busy, or have its paper or ink replaced, before we try again.

We keep the Retry Queue separate from the Fax Queue so that while we’re waiting to re-send that fax, we can try other faxes to that number (so if the fax is unprocessable for some reason — say, for example, the PDF is corrupt or invalid so Twilio can’t send it — it doesn’t clog up all messages being sent to that number).

The Retry Processor itself is very simple: it just reads the messages, and then writes them back to the Fax Queue:

# Take a fax record from the retry queue and sends it back to the # fax queue to be retried def handler(event: Any, context: Any) -> Any: # We set batchSize to 1 so there should always be 1 record assert len(event["Records"]) == 1 record = event["Records"][0] fax_record = Fax.json_loads(record["body"]) enqueue_fax(fax_record)

The Webhook Processor

If we do successfully send the fax, or we’ve failed to send the fax a whole lot of times (we typically attempt to send a fax 20 times before giving up), we write a message to the Webhook Queue, which delivers a notification (via a POST HTTP request) back to the application that was trying to send the fax.

The code for the Webhook Processor is also pretty simple — we just load up the message from the Webhook Queue and deliver it via an HTTP request:

# Take a webhook record from the queue and try to send it to the calling application def handler(event: Any, context: Any) -> Any: # We set batchSize to 1 so there should always be 1 record assert len(event["Records"]) == 1 record = event["Records"][0] # Webhook is a Python dataclass that corresponds with the # message that we write to the Webhook Queue webhook = Webhook.json_loads(record["body"]) response = webhook.callback_url, data=webhook.payload.json_dumps(), headers={"Content-Type": "application/json"}, ) response.raise_for_status()

One important distinction here between the Retry Process and the Fax Processor is that the Retry Processor uses the typical SQS/Lambda error handling: rather than have a separate retry queue for failed messages, we simply raise a Python exception. The SQS/Lambda integration uses this as a signal that the processor failed, and the message should be put back into the Webhook Queue to be sent again after the SQS visibility timeout. So if the application that’s supposed to handle this webhook fails and returns a 500, we’ll just end up processing this message again and re-send the webhook. We configure an SQS dead-letter queue so if we’ve tried to deliver the webhook a whole bunch of times, we’ll eventually give up and retain the message in the dead-letter queue for us to inspect and debug.

Using the Fax Gateway

We’ve released the Fax Gateway as an open-source project so anyone can use it to send faxes reliably. It’s easy to deploy in your own AWS account using the Serverless Framework.

To deploy the Fax Gateway, you’ll first want to fork the repository so you can make your own changes to the Serverless configuration. You’ll want to read through the configuration file to make sure it’s set up correctly for your AWS environment. In particular you’ll need to customize a few things:

  • We deploy our fax gateway in the us-west-2 region. If you use a different region, you’ll want to change the line that says region: us-west-2 to refer to your AWS region.
  • We deploy the Fax Gateway in our VPC, so we can use AWS security groups to ensure that only the Fax Gateway can send requests to our webhook endpoint. You’ll want to make sure that the vpc section of the configuration has the right configuration for your VPC, or remove that section altogether if you don’t want to deploy the Fax Gateway into a VPC.
  • We use DataDog to collect logs and metrics, so we’ve set up the Fax Gateway to deliver logs via the Datadog Forwarder. If you don’t use DataDog or the Datadog Forwarder, you’ll want to remove the serverless-plugin-datadog plugin.
  • We use Sentry to track errors. If you don’t use Sentry, remove the SENTRY_DSN and SENTRY_ENVIRONMENT environment variables.

We use AWS SSM to store the credentials for the Fax Gateway. You’ll need to configure the following SSM Parameters:

  • fax_gateway.common.sentry_dsn: If you use Sentry, set this to your Sentry DSN to send Fax Gateway errors to Sentry.
  • fax_gateway.common.twilio_sid and fax_gateway.common.twilio_auth_token: Set these to your Twilio credentials so the Fax Gateway can authenticate with Twilio to send faxes.
  • Set this to the phone number in your Twilio account that the faxes should come from. If you plan to use multiple environments (we use local, staging, and prod environments) for your Serverless function, also set fax_gateway.staging.twilio_phone_number and fax_gateway.local.twilio_phone_number so you use a different fax number for each environment.

Check That You’re Registered, And Get A Mail-In Ballot!

You don’t need to be an expert in sending faxes to register to vote and request your ballot. You can use VoteAmerica to double-check that you’re registered to vote, and to request a mail-in ballot.

Read more