Connecting Uptime Robot with NS1 DNS API

tetech · December 29, 2019, 8:24am

Basic topic: Adjusting DNS records based on monitoring alerts to give a poor man’s high availability.

Project requirements:

Low cost, like under $5/year.
Reliable DNS service with low latency (like 30ms worldwide)
GeoDNS, i.e. users directed to the closest server
Failover, i.e. if closest server is down, direct users to the next-closest
Low failover time, e.g. 5 minutes
No Cloudflare

Background/motivation: High-quality managed DNS services often charge for, or limit, the number of health checks on their free or low-price tiers. Conversely, monitoring services offer a reasonable number of health checks for free. The goal is to marry the two, giving a poor man’s high availability via DNS. Such a service with a reputable managed DNS provider is otherwise not easy to find.

Overview:
The approach here works by translating Uptime Robot’s webhook alert contact into a NS1 API call which signals to NS1 whether a label (i.e. host) is up or down. The translation is done by an AWS Lambda function (since we want the translation itself to be highly available and not hosted on one of the servers we’re monitoring). The amount of AWS Lambda usage will be far under their ‘always-free’ level unless something is extremely unusual.

The example in this wiki uses NS1’s managed DNS service. NS1 has a free tier that includes 50 resource records, 500k lookups per month (which is plenty for low-traffic web sites), and most usefully, one filter chain. Importantly, their free tier uses Anycast servers and the latency is very good. However, NS1 have dropped the number of health checks in the free tier from 2 to 1, meaning that failover among a pool of geo-targeted servers is that much more difficult than it was before, unless external monitors are used.

The example also uses Uptime Robot, which has a free tier of 50 monitors with 5-minute monitoring intervals and has webook alerts. It is pretty easy to adapt the example to other providers, though.

Steps to follow:
We’ll start by building a Python script for converting the Uptime Robot webhook to a NS1 API call. It is probably better to do this on your own Linux system, but it is possible to do directly in AWS. Here’s the steps on CentOS 7, which should be similar for other distros:

cat <<'EOT' > lambda_function.py
import os
import requests
import json

def lambda_handler(event, context):
  pskey=event['queryStringParameters']['key']
  if pskey == os.environ['MYPRVKEY']:
    url="https://api.nsone.net/v1/feed/"+os.environ['NSONEFEED']
    headers={'X-NSONE-Key': os.environ['NSONEKEY']}
    updown=int(event['queryStringParameters']['alertType'])-1
    label=event['queryStringParameters']['monitorFriendlyName']
    data={label: {"up": updown}}
    response = requests.post(url, data=json.dumps(data), headers=headers)
EOT
mkdir awspkg
pip3 install --target ./awspkg requests
cd awspkg
zip -r9 ../awspkg.zip .
cd ..
zip -g awspkg.zip lambda_function.py

At NS1:

Click on your NS1 username in the top-right, choose “Account settings” then “Users and teams”. Click the “API Keys” tab and add a key. It only needs the “push to datafeeds” permission. Make a note of the key.
Under the “Integrations” tab, go to “Data Sources”, click “add a data source” on the left, and set the source type to “NS1 API”.
Click “Incoming feeds”, and add a feed to the data source you just set up. In the “label” field, enter the friendly name of the Uptime Robot monitor. Repeat this step for as many monitors as you need.
Make a note of the feed URL, particularly the source string, which is the hex string after the last /
Set up the DNS for example.com domain as usual. Go to the A record for example.com and click “Create filter chain”. Set up filter, geotarget and first N rules (beyond the scope of this wiki, see NS1’s help pages).
Add an answer (i.e. IP address) for each server in your pool, and click the “up” filter. Choose the incoming feed that corresponds to the Uptime Robot monitor for that server.

At AWS:

Go to the AWS console’s Lambda dashboard, click “Create function”, choose “author from scratch”, give it a name like URtoNS1, and choose the Python 3.6 runtime (note: this is selected to match what is on the Linux system used to create awspkg.zip; CentOS 7 comes with Python 3.6).
In the “Function code” section under “Code entry type”, choose “Upload a .zip file” and upload the awspkg.zip file you created earlier.
In the “Environment variables” section, make three entries: (1) set NSONEFEED to the hex string taken from the NS1 feed URL, (2) set NSONEKEY to the NS1 API key you generated, (3) set MYPRVKEY to a random string - this is simply checked so that your NS1 API is not exposed if someone learns the AWS URL. In reality, you might want to modify the Python script to check that the caller is indeed Uptime Robot.
Click “Add trigger”, choose “API Gateway”, choose “Create a new API”, then click “Add”. “Save” your changes (button in top-right).
In the Designer, click the “API Gateway” that you just created and scroll to the bottom of the page. Note the endpoint URL, which will be something like https://abcd1234.execute-api.us-east-2.amazonaws.com/default/URtoNS1

At Uptime Robot:

Go to “My settings” and add an alert contact. The type should be “webhook”, notifications should be enabled for up & down, and the URL should be what you noted in the previous step, with your own key added, e.g. https://abcd1234.execute-api.us-east-2.amazonaws.com/default/URtoNS1?key=your_MYPRVKEY&
Add this contact to the monitor for each server in your pool. Remember that the translation assumes the “friendly name” for the monitor matches the label in the corresponding NS1 data feed.

Note that calls to AWS HTTP API are not free, but cost $1 per million and are only called when Uptime Robot generates an up/down alert, which means that practically your charge each month is likely to be $0.

Concluding remarks
If you’re OK with Cloudflare, then use it; this wiki mightn’t be for you. If you’re OK with hosting your own DNS or distributed monitoring service, then likewise, this wiki mightn’t be for you. If your site is static, you can always use a CDN, but that doesn’t solve the issue for dynamic sites (or for non-website servers).

There’s a number of ways you can adapt this wiki. First, you can host the monitoring itself on AWS (e.g. cloudping), but this will increase your Lambda usage considerably. NS1 also has a native AWS SNS data source, so you could publish up/down status to SNS and then subscribe to that in NS1 rather than using the NS1 API. Also note that the Python script has been trimmed for simplicity, and in reality you might want to queue a failed API call for retry.

Finally, note that the TTL of the DNS record is a balance between failover time and how many lookups you’ll generate. Maybe start with a TTL of 1200 (20 minutes) and adjust it upwards if you’re getting close to your 500k lookup limit (for the NS1 free tier) or downwards if not.

At the end of the day, hopefully this helps you to geo-target a pool of low-end servers with failover capability using a premium DNS service, at a total cost of zero or close to it.

anon40039896 · December 29, 2019, 12:06pm

@tetech

Thanks for this write-up !

NS1 is great, and i really recommend anyone reading this thread to look into this.

Wolveix · December 29, 2019, 2:45pm

This is a brilliant writeup, thanks for this! I might have a play around with it

wordpress · December 31, 2019, 4:09am

thanks for poor man HA setup

tetech · December 31, 2019, 6:32am

A couple more things I tried. First, I played around with NS1’s SNS integration. To achieve the same result using SNS, the above instructions above are generally still good, with the following notes.

At NS1:

Go to NS1’s “Integrations” section, click “Data sources”, “Add a data source” and choose the “Amazon Cloudwatch” type.
Add a feed to this data source, and in the “Alarm name” use the name of the monitor at the monitoring service.
You’ll see a feed URL like https://api.nsone.net/v1/feed/abcdef1234567890, make note of it.

At AWS:

Go to the SNS dashboard.
Click “Topics” on the left and “Create topic”. Choose a name, but generally the settings can be left at their defaults. Make a note of the topic’s URN, which will look like arn:aws:sns:us-east-2:01234567890:topicname
After the topic is created, scroll down and click “Create subscription”, and enter the NS1 feed URL.
Go to Lambda

When creating the Python function copy the following into the editor:

import boto3
import json

def lambda_handler(event, context):
  pskey=event['queryStringParameters']['key']
  if pskey == os.environ['MYPRVKEY']:
    sns = boto3.client('sns')
    sv=int(event['queryStringParameters']['alertType'])
    updown="OK" if (sv == 2) else "ALARM"
    label=event['queryStringParameters']['monitorFriendlyName']
    data={"AlarmName": label, "NewStateValue": updown}
    response = sns.publish(
        TopicArn='arn:aws:sns:us-east-2:01234567890:topicname',    
        Message=json.dumps(data)
    )

Note that you will have to change the execution role so that the Lambda function has the permission to publish SNS messages, which I won’t go into here.

Everything else is pretty much the same. Now in your NS1 filter chain, you can use the SNS data feed instead of the NS1 API data feed.

There’s some pluses and minuses of using SNS instead of the NS1 API. On the plus side, AWS will take care of any retries in the event of a notification failure to NS1. The Lambda function doesn’t have dependencies so you don’t need to upload it as a zip file, and you don’t need to store/update any NS1 keys on AWS. On the minus side, you have to deal with AWS permissions (which I find a bit of a nightmare) and Lambda execution time seems to triple (although you’ll typically be so far under the free tier limit that it won’t matter).

The second thing I did was adapt the translation function for Hetrix Tools, and put that Lambda function in a different AWS region. Everything on the NS1 end stays the same. This means I’ve now got better ability to withstand an outage in one AWS region, or of one monitoring service. The down-side of this is my AWS usage doubles, because I’ll get an alert from each monitoring service independently, but the expected number of alerts means my AWS cost is still likely to be zero. But to do this I think it is better to use the NS1 API.