A Simple and Cheap but Powerful Heartbeat Fault Detector for Intranet

October 08, 2020

For simple environments such as a slightly complicated home network or a small office, monitoring (of system resources, cpu load etc.) is maybe not particularly needed but fault detection quickly becomes a must. For example, you have a name server in the intranet, and you want to know when the host is down. Most of the monitoring or uptime detection systems rely on polling methods, for example, for a website, the uptime detection systems requests a page, or pings a service. To have a reasonable detection system, the detector has to be outside of the intranet, but this creates the problem, since the intranet systems are not accessible from outside. One (only?) reasonable way to do this is to reverse this mechanism, from polling to pushing, meaning the monitored system, not the monitoring system, sends something, a heartbeat signal, periodically, and a missing heartbeat indicates there is a problem with the system or the service being monitored.

Such (heartbeat based) services are (although not as common as others) available, but probably requires a non-free subscription. It is not difficult to make a very simple version of such a system, actually it is not difficult to make a very powerful (in the sense of performance, availability and scalability) such a system. Here is an example.

I am using Google Cloud. To send emails, I am using SendGrid, which provides 100 emails/day for free, which is I think normally enough for a small system. The notifications part of this system is not my focus, since it can be easily extended to individual needs, to SMS/Telegram/Slack messages etc.

I initially started to build the system on AppEngine, but then I realized it is not needed.

The system I describe below uses: Cloud Functions, Cloud Datastore, Cloud PubSub and Cloud Scheduler.

I represent a single thing (system, service etc.) that sends a heartbeat by an entity in Cloud Datastore, which I call a definition.

Because I do not need the history of uptimes, faults or heartbeats, this system does not keep any data other than the definitions.

I have 3 cloud functions: ping, check1 and check2.

ping is what the monitored system regularly sends heartbeat (so it is open to public, Allow unauthenticated in Cloud Functions). The call is a simple GET or HEAD call. It is very simple to add this to Linux, just by a wget/curl command in crontab. ping is triggered by HTTP.

check1 is the first stage of fault detection. It is triggered by a PubSub message on a topic (check1). This message is created regularly (every minute) by Cloud Scheduler. check1’s only job is to look to the datastore, and create a message for each definition (N messages for N definitions), and publishes them on another PubSub topic (check2).

The definitions are stored in Cloud Datastore. They have a very basic schema, name (as key) which is also used when calling ping (ping?n=), pinged (last ping time in timemillis) and status (1: all good, 0: first call, -1: not good/alarm). At 0 or 1 to -1, and -1 to 1 status changes, a notification (email) is triggered. Since there is already an editor of Cloud Datastore in Google Cloud Console, there is no need for a UI to add/remove/modify the definitions.

check2 is the second and the last stage of fault detection. It is triggered by a PubSub message on a topic (check2). This message contains the name of the definition to be checked. In check2, basically the pinged value of the definition is checked against the current datetime, if it is more behind than X seconds (I use 120 seconds), this means there is a fault, it did not send a heartbeat within last X seconds. If this triggers a status change like described above, then a notification (email) is triggered.

The code for ping, check1 and check2 that I am using at the moment is below. Although it is a very simple solution, because of the nature of services used on Google Cloud, it is highly available and highly scalable if needed and without any extra effort.

from google.cloud import datastore
from google.cloud import pubsub_v1
from flask import abort
import time
import base64
import os
import sendgrid
import json
import datetime
from sendgrid.helpers.mail import *

TOPIC_CHECK2 = os.environ.get('TOPIC_CHECK2')

datastore_client = datastore.Client()
publisher = pubsub_v1.PublisherClient()

def ping(request):

    if request.method == 'GET' or request.method == 'HEAD':

        name = request.args.get('n');

        if name is None:
            return abort(400)

        print('name: %s' % name)

        k = datastore_client.key('Definition', name)
        e = datastore_client.get(k)
        now = int(time.time())
        if e is not None:
            if 'status' not in e:
                e['status'] = 0
            e['pinged'] = now
            datastore_client.put(e, timeout=30)
            return 'ok: %s' % now
            return 'nok: %s' % now


        return abort(405)

def is_msg_too_old(ctx):
    ts = datetime.datetime.strptime(ctx.timestamp, "%Y-%m-%dT%H:%M:%S.%fZ")
    to_check = datetime.datetime.now() - datetime.timedelta(seconds=120)
    return ts < to_check

def check1(event, context):
    if is_msg_too_old(context):
        print('omitting too old msg')
    q = datastore_client.query(kind='Definition')
    r = list(q.fetch())
    for e in r:
        if 'status' in e:
            name = e.key.name
            doc = json.dumps({
                'name': name,
                'pinged': e['pinged'],
                'status': e['status']
            print('check1: %s' % name)
            # do not b64 encode this, publish already does it
            # however the receiver cloud func has to base64 decode explicitly
            msg = doc.encode('utf-8')
            publisher.publish(TOPIC_CHECK2, msg)

def sendmail(msg):
    sg = sendgrid.SendGridAPIClient(SENDGRID_APIKEY)
    from_email = Email('<from_address_here>')
    to_email = To('<to_address_here>')
    content = Content('text/plain', msg)
    mail = Mail(from_email, to_email, msg, content)
    response = sg.client.mail.send.post(request_body=mail.get())

def recovered(name):
    msg = 'UP: %s' % name

def failed(name):
    msg = 'DOWN: %s' % name

def first(name):
    msg = 'FIRST: %s' % name

def check2(event, context):
    if is_msg_too_old(context):
        print('omitting too old msg')
    msg = event['data']
    # msg is always b64 encoded in cloud func, so decode it
    doc = json.loads(base64.b64decode(msg).decode('utf-8'))
    name = doc['name']
    pinged = doc['pinged']
    status = doc['status']
    print('check2: %s' % name)
    to_check = int(time.time()) - 120
    if pinged < to_check:
        if status == 0 or status == 1:
            print('changing status of %s to -1' % name)
            e = datastore.Entity(datastore_client.key('Definition', name))
            e.update({'status': -1, 'pinged': pinged})
        elif status == -1:
            # do nothing
            print('status of %s was already -1, so omitting' % name)
            print('ERROR, invalid status: %s' % status)
        if status == 0:
            print('changing status of %s from 0 to 1' % name)
            e = datastore.Entity(datastore_client.key('Definition', name))
            e.update({'status': 1, 'pinged': pinged})
        elif status == -1:
            e = datastore.Entity(datastore_client.key('Definition', name))
            e.update({'status': 1, 'pinged': pinged})
        elif status == 1:
            # do nothing
            print('status of %s was already 1, so omitting' % name)
            print('ERROR, invalid status: %s' % status)

Lets try to calculate the cost of using Google Cloud. For N definitions, that sends a heartbeat every minute and when check1 is also running every minute, there will be:

  • N call ping (N datastore get, N datastore put) per minute
  • 1 pubsub message (on check1) -> 1 call check1 (N datastore get) -> N pubsub message (on check2)
  • call check2 (max N datastore put, min 0 datastore put, 1 put per status change)

Lets take N=100 as an example.

  • The stored data on datastore is obviously less than 1 GB which is the free quota. Ingress and Egress within the same region is also free. There is 1440*(N+N) document reads and 1440*(N) document writes per day at minimum. So document reads cost 1440200/1000000.042=0.1, document writes cost 1440100/1000000.126=0.12, in total 0.22 USD per day. I did discount the free quota.

  • Total function invocations is 43200*(N+1+N), so it costs 43200201/20000000.4=1.73. I am using 128MB instance, and it looks like the calls are usually completed under 100ms, so each call can be thought as one 100ms unit. 432002010.000000231=2. I did not discount the free quota here. Also inbound data (ping HTTP call) is free, and outbound data within the same region is free. So in total, this is 3.73 USD per month.

  • Cloud Scheduler gives 3 free jobs, so there is no cost of that.

  • For Cloud PubSub, there are 43200*(1+N) messages, and each message is less than 1000 bytes so 1K can be used as the size. So this is 432001011K which is around 4.3GiB per month and Cloud PubSub gives 10 GiB for free. So there is no cost for this as well.

So if my calculation above is correct, for N=100, you need to pay 0.22 USD per day plus 3.73 USD per month, in total approx. 10.3 USD per month. This is the minimum and every status change costs some money because of the datastore write operation but this is going to be very small comparing to everything happening every minute. You can monitor or limit the actual cost by defining a Budget in Google Cloud.

I am using this system since a week or so for only a few hosts, and it works without any problem. I do not know the monthly cost yet, but I can see daily cost is around 0.01 USD (without the discount).