Rebuilding my monitoring infrastructure

March 17, 2025 • Monitoring, Networking, IPv6

I was recently bored and decided to finally fix my homelab monitoring infrastructure. This post outlines it’s architecture, and involves using IPv6 to solve an actual issue!

Overview #

My homelab is heavily segmented (due to reasons) and more or less every service/container runs in it’s own subnet, and all communication is routed through my central firewall. I also keep an out-of-band management network for my physical nodes as a backup in case I really mess something up. I also run a few VPSes which I want to monitor as well, plus some remote machines behind CGNAT.

Initially I kept adding forwarding rules to my firewall for internal east-west traffic, but at some point I realized that I can just use Tailscale for both internal and external communication with all my devices. I also liked using Tailscale ACLs after spending ~5 minutes reading up on them.

The problem #

For the example image below, if the Prometheus node (scraper) was scraping container-1 it went through the firewall. While this was fine for internal east-west traffic, the number of forwarding rules kept increasing which was tedious.

Once I introduced Tailscale to the mix this got worse – now my scraper node had to use a public relay for east-west communication! While I think I could have setup an internal STUN server, that would increase the overall complexity which I wasn’t interested in.

Luckily I have a IPv6 tunnel through Tunnelbroker.net, giving me a routed /48¹ to use in my homelab. Unfortunately these addresses are basically worthless for humans, as spammers and scammers has abused the prefix so I can’t use Netflix et c using IPv6…but that’s not an issue for machine to machine communication within the same /48, as it’s routed ‘internally’ through my router, while still being globally reachable!

The solution #

In short, this is how my setup works:

I’ve assigned all relevant, internal subnets it’s own globally routable IPv6 /64 prefix (using CoreRAD)
I’ve setup strict ACLs for my tailnet, and assign ACLs to hosts using tags
I’ve allowed incoming and outgoing traffic on UDP :41641 for a subset of my IPv6 alloctaion²

So now my nodes have globally routable addresses, yet all of them are behind my firewall, nearly all my tailnet traffic³ is routed internally, and Tailscale ACLs limits what each machine can reach over the tailnet. I’m pretty happy with this solution, especially as it works identically for both internal and remote nodes.

Tailscale ACLs #

This is a partial ACL configuration, that allows hosts with tag scraper to talk to machines tagged monitor on port 9100/tcp (node-exporter) plus 161/udp (snmp). Machines tagged monitor can also talk to scraper on 162/udp (snmp traps). All other traffic over the tailnet is refused.

{
    "tagOwners": {
        "tag:monitor": [],
        "tag:scraper": []
    },

    "acls": [
        {
            "action": "accept",
            "src":    ["tag:scraper"],
            "dst":    ["tag:monitor:9100"],
            "proto":  "tcp"
        },

        {
            "action": "accept",
            "src":    ["tag:scraper"],
            "dst":    ["tag:monitor:161"],
            "proto":  "udp"
        },

        {
            "action": "accept",
            "src":    ["tag:monitor"],
            "dst":    ["tag:scraper:162"],
            "proto":  "udp"
        }
    ]
}

I assign ACL tags either when enrolling hosts or through the admin console.

Firewall setup #

My nftables setup is basically:

I keep a few named sets containing my relevant subnets
I refer to said set in my forwarding chain to allow the relevant traffic

Below is not a complete configuration:

set net6_svc {
    type ipv6_addr; flags interval;
    elements = { 2001:db8:100::/56 }
}
# [...]
chain forward {
    type filter hook forward priority 0; policy drop;
    meta l4proto udp th sport 41641 ip6 saddr @net6_svc ct state new accept
}

You get the idea…

Metrics collection #

This is just a incus container running at home. I’ve provisioned it using Ansible and the excellent prometheus-community/ansible collection. I’ve not yet used the Tailscale API as a data source for Ansible, but I think it should be feasible to use that instead of manually adding hosts for scraping.

Obviously not a complete example:

# /etc/prometheus/prometheus.yaml
scrape_configs:
  - job_name: prometheus
    metrics_path: /prometheus/metrics
    static_configs:
    - targets:
      - scraper:9090
  - job_name: node
    file_sd_configs:
    - files:
      - /etc/prometheus/file_sd/node.yml

Since I use MagicDNS I can just write hostnames instead of IP, which is nice.

# /etc/prometheus/file_sd/node.yml
- targets:
  - scraper:9100
  - host1:9100
  - host2:9100

Bonus: Caddy #

I like using Caddy and I liked it even more when I discovered that it has an integration for provisioning valid certificates using Tailscale! This integrates well with using ACLs/tags, so I can expose an internal service with a browser-valid certificate over my tailnet, while using ACLs to limit who can reach said service.

A basic Caddyfile for my scraper node might look like this:

{
  email oscar@example.com
}

# Remember to change your tailnet domain alias!
scraper.foo-bar.ts.net {
  reverse_proxy http://localhost:9090

  # Or, to expose multiple services using subdirs
  # reverse_proxy /prometheus* http://localhost:9090
  # reverse_proxy /grafana* http://localhost:3000
}

Then, from a machine that has the necessary ACLs, I can reach https://scraper.foo-bar.ts.net which will present (and manage it’s lifecycle) a browser-valid certificate.

Conclusion #

While I guess I could have setup a similar infrastructure myself, using Tailscale was much easier and a lot of fun.

That’s 65536 subnets, each with 18446744073709551616 available addresses. Should be enough for me! ↩︎
I’ve allocated a few /56 (each containing 256 subnets) for different purposes – one for management networks, one for IoT, one (unused) for clients and one for services. ↩︎
My remote devices are VPSes or SBCs in remote locations, they can be reached using the normal NAT traversal techniques Tailscale uses. If I had another remote location with a complex networking setup requiring a similar setup I think I would get myself another hobby… ↩︎