Basic topic: Adjusting DNS records based on monitoring alerts to give a poor man’s high availability.
Project requirements:
- Low cost, like under $5/year.
- Reliable DNS service with low latency (like 30ms worldwide)
- GeoDNS, i.e. users directed to the closest server
- Failover, i.e. if closest server is down, direct users to the next-closest
- Low failover time, e.g. 5 minutes
- No Cloudflare
Background/motivation: High-quality managed DNS services often charge for, or limit, the number of health checks on their free or low-price tiers. Conversely, monitoring services offer a reasonable number of health checks for free. The goal is to marry the two, giving a poor man’s high availability via DNS. Such a service with a reputable managed DNS provider is otherwise not easy to find.
Overview:
The approach here works by translating Uptime Robot’s webhook alert contact into a NS1 API call which signals to NS1 whether a label (i.e. host) is up or down. The translation is done by an AWS Lambda function (since we want the translation itself to be highly available and not hosted on one of the servers we’re monitoring). The amount of AWS Lambda usage will be far under their ‘always-free’ level unless something is extremely unusual.
The example in this wiki uses NS1’s managed DNS service. NS1 has a free tier that includes 50 resource records, 500k lookups per month (which is plenty for low-traffic web sites), and most usefully, one filter chain. Importantly, their free tier uses Anycast servers and the latency is very good. However, NS1 have dropped the number of health checks in the free tier from 2 to 1, meaning that failover among a pool of geo-targeted servers is that much more difficult than it was before, unless external monitors are used.
The example also uses Uptime Robot, which has a free tier of 50 monitors with 5-minute monitoring intervals and has webook alerts. It is pretty easy to adapt the example to other providers, though.
Steps to follow:
We’ll start by building a Python script for converting the Uptime Robot webhook to a NS1 API call. It is probably better to do this on your own Linux system, but it is possible to do directly in AWS. Here’s the steps on CentOS 7, which should be similar for other distros:
cat <<'EOT' > lambda_function.py
import os
import requests
import json
def lambda_handler(event, context):
pskey=event['queryStringParameters']['key']
if pskey == os.environ['MYPRVKEY']:
url="https://api.nsone.net/v1/feed/"+os.environ['NSONEFEED']
headers={'X-NSONE-Key': os.environ['NSONEKEY']}
updown=int(event['queryStringParameters']['alertType'])-1
label=event['queryStringParameters']['monitorFriendlyName']
data={label: {"up": updown}}
response = requests.post(url, data=json.dumps(data), headers=headers)
EOT
mkdir awspkg
pip3 install --target ./awspkg requests
cd awspkg
zip -r9 ../awspkg.zip .
cd ..
zip -g awspkg.zip lambda_function.py
At NS1:
- Click on your NS1 username in the top-right, choose “Account settings” then “Users and teams”. Click the “API Keys” tab and add a key. It only needs the “push to datafeeds” permission. Make a note of the key.
- Under the “Integrations” tab, go to “Data Sources”, click “add a data source” on the left, and set the source type to “NS1 API”.
- Click “Incoming feeds”, and add a feed to the data source you just set up. In the “label” field, enter the friendly name of the Uptime Robot monitor. Repeat this step for as many monitors as you need.
- Make a note of the feed URL, particularly the source string, which is the hex string after the last
/
- Set up the DNS for
example.com
domain as usual. Go to theA
record forexample.com
and click “Create filter chain”. Set up filter, geotarget and first N rules (beyond the scope of this wiki, see NS1’s help pages). - Add an answer (i.e. IP address) for each server in your pool, and click the “up” filter. Choose the incoming feed that corresponds to the Uptime Robot monitor for that server.
At AWS:
- Go to the AWS console’s Lambda dashboard, click “Create function”, choose “author from scratch”, give it a name like
URtoNS1
, and choose the Python 3.6 runtime (note: this is selected to match what is on the Linux system used to createawspkg.zip
; CentOS 7 comes with Python 3.6). - In the “Function code” section under “Code entry type”, choose “Upload a .zip file” and upload the
awspkg.zip
file you created earlier. - In the “Environment variables” section, make three entries: (1) set
NSONEFEED
to the hex string taken from the NS1 feed URL, (2) setNSONEKEY
to the NS1 API key you generated, (3) setMYPRVKEY
to a random string - this is simply checked so that your NS1 API is not exposed if someone learns the AWS URL. In reality, you might want to modify the Python script to check that the caller is indeed Uptime Robot. - Click “Add trigger”, choose “API Gateway”, choose “Create a new API”, then click “Add”. “Save” your changes (button in top-right).
- In the Designer, click the “API Gateway” that you just created and scroll to the bottom of the page. Note the endpoint URL, which will be something like
https://abcd1234.execute-api.us-east-2.amazonaws.com/default/URtoNS1
At Uptime Robot:
- Go to “My settings” and add an alert contact. The type should be “webhook”, notifications should be enabled for up & down, and the URL should be what you noted in the previous step, with your own key added, e.g.
https://abcd1234.execute-api.us-east-2.amazonaws.com/default/URtoNS1?key=your_MYPRVKEY&
- Add this contact to the monitor for each server in your pool. Remember that the translation assumes the “friendly name” for the monitor matches the label in the corresponding NS1 data feed.
Note that calls to AWS HTTP API are not free, but cost $1 per million and are only called when Uptime Robot generates an up/down alert, which means that practically your charge each month is likely to be $0.
Concluding remarks
If you’re OK with Cloudflare, then use it; this wiki mightn’t be for you. If you’re OK with hosting your own DNS or distributed monitoring service, then likewise, this wiki mightn’t be for you. If your site is static, you can always use a CDN, but that doesn’t solve the issue for dynamic sites (or for non-website servers).
There’s a number of ways you can adapt this wiki. First, you can host the monitoring itself on AWS (e.g. cloudping), but this will increase your Lambda usage considerably. NS1 also has a native AWS SNS data source, so you could publish up/down status to SNS and then subscribe to that in NS1 rather than using the NS1 API. Also note that the Python script has been trimmed for simplicity, and in reality you might want to queue a failed API call for retry.
Finally, note that the TTL of the DNS record is a balance between failover time and how many lookups you’ll generate. Maybe start with a TTL of 1200 (20 minutes) and adjust it upwards if you’re getting close to your 500k lookup limit (for the NS1 free tier) or downwards if not.
At the end of the day, hopefully this helps you to geo-target a pool of low-end servers with failover capability using a premium DNS service, at a total cost of zero or close to it.