Proxmox private SDN network between two clusters¶

Story of setting up network between two distinct Proxmox clusters in different locations based on wireguard and SDN EVPN zone.

I started with already working homelab setup for Proxmox (diagram below). It's quite typical and nothing special besides the paranoid usage of proxmox opentofu bpg provider (which I really love and also contribute to) for everything that was possible and ansible for nodes/vms setup. I think that's a strong requirement, simple "fire and forget" approach to setup something like this article describes is probably a very bad idea to the point where you'll probably end up with a messed up, broken current environment and non-working SDN (for one of dozens of non-trivial reasons).

Enjoy!

                            Public Internet
                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

            roz                                      wro
      ================                         ================

+---------------------------+          +---------------------------+
| ISP Router (roz)          |          | ISP Router (wro)          |
| WAN: public IP STATIC     |          | WAN: public IP DYNAMIC    |
| LAN: 192.168.1.0/24       |          | LAN: 192.168.1.0/24       |
+-------------+-------------+          +-------------+-------------+
              |                                        |
              | 192.168.1.35                           | 192.168.1.5
+-------------v-------------+          +-------------v-------------+
| My Router (roz)           |          | Proxmox (wro)             |
| WAN: 192.168.1.35/24      |          | mgmt: 192.168.1.5         |
| LAN: 192.168.2.0/16       |          |                           |
+-------------+-------------+          | VMs:                      |
              |                        | - ddns_updater VM         |
              | 192.168.2.3            |   192.168.1.7             |
              |                        | - ...                     |
+-------------v-------------+          +---------------------------+
| Proxmox (roz)             |
| mgmt: 192.168.2.3         |
|                           |
| VMs:                      |
| - haproxy VM              |
|   192.168.3.1             |
| - ...                     |
+---------------------------+

The goals were the following:

Waste a lot of time.
Have fun learning SDN.
(If you think i needed it for anything really, then you are wrong).
Be able to connect from VMs from roz to wro and reverse. So for example, in the above diagram, haproxy VM is unable to connect to ddns_updater VM other than using NAT port. With cross-site network, it's as simple as within each cluster in initial setup.
Possibility to extend it with another cluster. In theory, this is quite scalable to new nodes within each cluster with minimal effort, and moderate effort to span new 10.3.0.0/16 zone in new location (ideally, I can try that some day and write a followup; if that can be done in <2h then I would call it success, whereas something like 2 days would be a fail).
Have firewall enabled. A lot of things is much simpler and just work with PVE firewall disabled. This is more or less the result:

                        Public Internet
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

           roz                                     wro
     ==================                    ==================

+----------------------------+      +----------------------------+
| Proxmox (roz)              |      | Proxmox (wro)              |
| mgmt: 192.168.2.3          |      | mgmt: 192.168.1.5          |
| WG:   10.1.0.1/16          |      | WG:   10.2.0.1/16          |
| FRR + evpnctl              |      | FRR + evpnctl              |
+--------------+-------------+      +--------------+-------------+
               |                                   |
               |        WireGuard (UDP/51820)      |
               +=========== tunnel ================+
               |                                   |
               |        EVPN / VXLAN overlay       |
               +==== BGP TCP/179 | VXLAN UDP/4789 =+
               |             (over WireGuard)      |
               |                                   |

        Overlay VM subnets (EVPN/VXLAN)
        --------------------------------

        roz: 10.1.1.0/24                    wro: 10.2.1.0/24

+----------------------------+      +----------------------------+
| VMs:                       |      | VMs:                       |
| - haproxy VM               |      | - ddns_updater VM          |
|   10.1.1.31 (main, eth0)   |      |   10.2.1.7 (main, eth0)    |
|   192.168.3.1 (eth1)       |      |   192.168.1.7 (eth1)       |
| - ...                      |      | - ...                      |
+----------------------------+      +----------------------------+

Final working shape¶

Early traps trying to use VXLAN Zone and/or fabrics instead of EVPN zone and wireguard.
Underlay WireGuard between the Proxmox hosts setup using ansible.
Overlay SDN Proxmox EVPN zone controlled by evpnctl (FRR). VXLAN encapsulation runs over the WireGuard underlay.
Per-site SDN subnets and VMs
Rozalin: 10.1.1.0/24 (gateway 10.1.1.1, sdntest vm at 10.1.1.11)
Wro: 10.2.1.0/24 (gateway 10.2.1.1, sdntest vm at 10.2.1.9)
Critical fixes snat=true (enabling snat for subnet is needed).
Critical fixes exit_nodes_local_routing=false (enabling "local routing" was the trap).
Firewall Proxmox cluster firewall must explicitly allow FORWARD and IN rules for SDN subnets and also between nodes on public network level (wireguard) and also some allow IN rules for wireguard network interface.

Early traps¶

Early on I tried to use cross-site VXLAN zone.

This looks exactly what we want:

You have to configure the underlay network yourself to enable UDP connectivity between all peers. You can, for example, create a VXLAN overlay network on top of public internet, appearing to the VMs as if they share the same local Layer 2 network.

Perfect. We can have both clusters nodes expose UDP port overt NAT. If new node is added, new port and new peer is connected. There is even builtin way to have IPSEC Encryption using strongswan, because there is no encryption by default.

But quick look on provider resource proxmox_virtual_environment_sdn_zone_vxlan and then confirming in PVE api viewer confirms that you must use static IP for every peer. Not going to work with dynamic public IP. I have dyndns domain, not static IP in wro region. Beside dyndns, hardcoded IPs means limited flexibility and for things like this network, you don't want to have breaking changes after setup.

Then, i have read and tried using Fabrics with EVPN zone.

Fabrics in Proxmox VE SDN provide automated routing between nodes in a cluster. They simplify the configuration of underlay networks between nodes to form the foundation for SDN deployments

Another dead end. They are indeed nice foundation for cluster network (for things like Ceph, which makes sense), but won't help in multi cluster network.

After another spike and looking for solutions, it turned out that wireguard vpn between nodes and EVPN SDN on top of it should work as expected.

Underlay WireGuard¶

There is ansible role – ansible-role-wireguard which has a seamless setup based on assumption that you have all nodes in single inventory (and hosts group). There is great README on the GitHub, I am not going into the details. Thankfully, I already have that. I have proxmox group and dedicated playbook_proxmox.yml playbook for proxmox nodes (whether roz or wro). The only nit pick is linear strategy (free doesn't work well).

# playbook_proxmox.yml
# Ignore meaningless details, clue is that every node have same setup
# I put "extra" data like that on purpose to be precise, this is not light read anyway

---
- hosts: proxmox
  become: yes
  strategy: linear
  tasks:
    - import_role:
        name: proxmox_host
    - import_role:
        name: apt_upgrade_and_common
    - import_role:
        name: authorized_keys
      vars:
        ssh_keys:
          - "2025_01_27_rafsaf_rycerski_servers.txt"
          - "2025_01_28_rafsaf_szczery_servers.txt"

Host vars for wro and roz nodes. This is all documented in ansible role. In short, let's take roz. The wireguard adresses will be 10.1.0.1 and I plan to reserve 10.1.0.0/24 for new nodes, so new-node-roz2 would have "10.1.0.2/16" wireguard_addresses and the same wireguard_allowed_ips. wireguard_endpoint is DNS with public IP (either dynamic or static, does not matter). This is required to allow wireguard_port over UDP over public internet.

Please ignore evpn related configuration for now (described later on).

# roz host_vars

wireguard_addresses:
  - "10.1.0.1/16"
wireguard_endpoint: "roz.rafsaf.pl"
wireguard_allowed_ips: "10.1.0.0/16"
wireguard_persistent_keepalive: "25"
wireguard_port: "51820"
wireguard_interface_restart: true

evpn_controller_name: "evpnctl"
evpn_controller_asn: 65000
evpn_controller_peers:
  - "10.2.0.1"  # wro Wireguard IP
  - "10.1.0.1"  # roz Wireguard IP

# wro host_vars

wireguard_addresses:
  - "10.2.0.1/16"
wireguard_endpoint: "wro.rafsaf.pl"
wireguard_allowed_ips: "10.2.0.0/16"
wireguard_persistent_keepalive: "25"
wireguard_port: "51820"
wireguard_interface_restart: true

evpn_controller_name: "evpnctl"
evpn_controller_asn: 65000
evpn_controller_peers:
  - "10.2.0.1"  # wro Wireguard IP
  - "10.1.0.1"  # roz Wireguard IP

With configuration in place, let's go over proxmox_host role. I split it into 3 parts and describe it.

# proxmox_host role - part 1/3

- name: Install required apt libs
  apt:
    update_cache: yes
    name:
      - sudo
      - lsb-release
      - frr
      - frr-pythontools
    state: present

- name: Enable FRR service
  service:
    name: frr
    enabled: yes
    state: started

- name: Configure WireGuard for cross-site VXLAN
  ansible.builtin.import_role:
    name: githubixx.ansible_role_wireguard

The role use lsb_release, so that's why it's added. Same for sudo for become to work without any issue. Other than that, frr and frr-pythontools are simply because Proxmox wiki say so:

To enable the EVPN controller, you need to enable FRR on every node, see install FRRouting.

# proxmox_host role - part 2/3
# =============================================================================
# EVPN Controller
# =============================================================================

- name: Check if EVPN controller exists
  shell: pvesh get /cluster/sdn/controllers/ --output-format json 2>/dev/null || echo "NOT_FOUND"
  register: evpn_controller_check
  changed_when: false
  failed_when: false

- name: Create EVPN controller
  shell: |
    pvesh create /cluster/sdn/controllers \
      --controller  \
      --type evpn \
      --asn  \
      --peers ""
  when:
    - evpn_controller_check.stdout == "NOT_FOUND"
  register: evpn_controller_created

- name: Apply SDN configuration after controller creation
  shell: pvesh set /cluster/sdn
  when:
    - evpn_controller_created is changed

Why do i create EVPN controller via ansible instead of using already mentioned opentofu proxmox provider? Because in v0.93.0 as of Jan 2026, there is no controller resource support yet. As it's a rather simple API, this will probably be added in the near future. Another option is to just use GUI.

EVPN Controller is a requirement for SDN EVPN zone, with simple arguments, unique random ASN and list of peers basically where:

Peers - An IP list of all nodes that are part of the EVPN zone. (could also be external nodes or route reflector servers)

Of course for that we use our wireguard internal IP, that is 10.2.0.1 for wro and 10.1.0.1 for roz. What if, again, we add new node 10.1.0.2 into roz cluster? Then we need to update the peers list and update existing controllers. There is no such thing in the above simplified role. Doing it via opentofu would be much more convenient.

# proxmox_host role - part 3/3
# =============================================================================
# EVPN Exit Node - Required sysctl settings
# =============================================================================

- name: Enable IP forwarding for EVPN exit node routing
  ansible.posix.sysctl:
    name: ""
    value: "1"
    sysctl_set: yes
    state: present
    reload: yes
  loop:
    - net.ipv4.ip_forward
    - net.bridge.bridge-nf-call-ip6tables
    - net.bridge.bridge-nf-call-iptables

- name: Disable rp_filter (required for EVPN exit nodes)
  ansible.posix.sysctl:
    name: ""
    value: "0"
    sysctl_set: yes
    state: present
    reload: yes
  loop:
    - net.ipv4.conf.default.rp_filter
    - net.ipv4.conf.all.rp_filter

Last but not least, sysctl modification that will help later. ALL of them are required, maybe except rp_filter settings, but see Multiple EVPN Exit Nodes, they encourage doing it right away to allow packet flow between nodes.

net.ipv4.ip_forward=1 (setting in Linux controls whether the system can forward IPv4 packets between network interfaces)
net.bridge.bridge-nf-call-iptables=1 (see libvirt wiki, without it packets were stuck on pve firewall in my case (from VM1 in wro to VM2 in roz to be precise, not on wireguard level), net.bridge.bridge-nf-call-ip6tables is not really needed but won't hurt to already fix ipv6).

With all of that in place and successful deployment of this ansible playbook on proxmox nodes in both regions:

wireguard is working between nodes, e.g. encrypted traffic between 10.2.0.1 and 10.1.0.1 (e.g. test with telnet)
evpn controller is in place for SDN EVPN foundation (that can be moved to opentofu resource, if it exists)
there are sysctl variables prepared

Overlay SDN¶

Below is the Proxmox SDN setup in roz region. In wro, the only difference is that subnet is obviously part of 10.2.x.x and not 10.1.x.x, but beside that, the same setup is used.

resource "proxmox_virtual_environment_sdn_zone_evpn" "crosssite" {
  id         = "crossite"
  controller = "evpnctl"
  vrf_vxlan  = 99999
  mtu        = 1370

  advertise_subnets          = true
  disable_arp_nd_suppression = false

  exit_nodes               = [local.proxmox_dell_r740_node_name]
  exit_nodes_local_routing = false
  ipam                     = "pve"

  depends_on = [
    proxmox_virtual_environment_sdn_applier.evpn_finalizer
  ]
}

resource "proxmox_virtual_environment_sdn_vnet" "crosssite_vnet" {
  id    = "crossvnt"
  zone  = proxmox_virtual_environment_sdn_zone_evpn.crosssite.id
  tag   = 100000
  alias = "Cross-site EVPN network - Rozalin subnet"

  depends_on = [
    proxmox_virtual_environment_sdn_applier.evpn_finalizer
  ]
}

resource "proxmox_virtual_environment_sdn_subnet" "crosssite_subnet" {
  cidr    = "10.1.1.0/24"
  vnet    = proxmox_virtual_environment_sdn_vnet.crosssite_vnet.id
  gateway = "10.1.1.1"
  snat    = true

  depends_on = [
    proxmox_virtual_environment_sdn_applier.evpn_finalizer
  ]
}

resource "proxmox_virtual_environment_sdn_applier" "evpn_applier" {
  lifecycle {
    replace_triggered_by = [
      proxmox_virtual_environment_sdn_zone_evpn.crosssite,
      proxmox_virtual_environment_sdn_vnet.crosssite_vnet,
      proxmox_virtual_environment_sdn_subnet.crosssite_subnet,
    ]
  }

  depends_on = [
    proxmox_virtual_environment_sdn_zone_evpn.crosssite,
    proxmox_virtual_environment_sdn_vnet.crosssite_vnet,
    proxmox_virtual_environment_sdn_subnet.crosssite_subnet,
  ]
}

resource "proxmox_virtual_environment_sdn_applier" "evpn_finalizer" {
}

Clarifications:

proxmox_virtual_environment_sdn_zone_evpn:

arbitrary id
controller – the id of one created in ansible, evpn controller id, in my case "evpnctl"
vrf_vxlan 99999 – arbitrary value:

A VXLAN-ID used for dedicated routing interconnect between VNets. It must be different than the VXLAN-ID of the VNets
MTU reduced for VXLAN (50 bytes) + WireGuard (~60-80 bytes) – 1500 - 50 - 80 = 1370. In fact, it's possible that 1370 is not the perfect value and packet fragmentation may occur, but I didn't dig that deep for now. Whatever works, I guess, unless you're building a corporate-grade network, but then you're probably not reading this article.
advertise_subnets true (Announce the full subnet in the EVPN network) seems useful, but I didn't check result without it
disable_arp_nd_suppression false (again, at your opinion)
exit_nodes - all of them, in this case, only one
exit_nodes_local_routing!!! false – one of few hours debugging session spent here:

This is a special option if you need to reach a VM/CT service from an exit node. (By default, the exit nodes only allow forwarding traffic between real network and EVPN network). Optional.

This looks innocently and "useful", but in fact in my case it broke packets between vms on different clusters (vm1wro to vm1roz), described in section below. - ipam pve - default one

proxmox_virtual_environment_sdn_vnet:

arbitrary id
zone id from above
arbitrary tag
some alias

proxmox_virtual_environment_sdn_subnet:

cidr matching roz zone 10.1.0.0/16. Why 10.1.1.0/24? Better question is why not. This ensure that there will be no overlap with reserved node 10.1.0.0/24 for roz.
vnet id from above
arbitrary 10.1.1.1 gateway – standard choice for 10.1.1.0/24
snat true – this is required with this setup. Described in section below why.

Per-site SDN subnets and VMs¶

# vm_sdntest.tf in roz

locals {
  proxmox_dell_r740_node_name = "proxmox-dell-r740"
  sdntest_ip                  = "192.168.3.12"
  sdntest_evpn_ip             = "10.1.1.11"
  sdntest_vm_id               = 123
  sdntest_tag                 = "sdntest"
  sdntest_memory              = 512
  sdntest_cores               = 1
  sdntest_size                = 6
}

resource "proxmox_virtual_environment_vm" "sdntest" {
  name        = local.sdntest_tag
  description = "Managed by Terraform - EVPN test VM"
  tags        = [local.sdntest_ip, local.sdntest_tag]

  node_name  = local.proxmox_dell_r740_node_name
  vm_id      = local.sdntest_vm_id
  protection = false

  startup {
    order    = 200
    up_delay = 30
  }

  agent {
    enabled = true
    timeout = "10s"
  }

  disk {
    datastore_id = local.proxmox_dell_r740_datastore_ssd
    file_id      = proxmox_virtual_environment_download_file.dell_r740__debian_13_trixie_qcow2_img.id
    interface    = "scsi0"
    size         = local.sdntest_size
    file_format  = "raw"
  }

  initialization {
    datastore_id      = local.proxmox_dell_r740_datastore_ssd
    user_data_file_id = proxmox_virtual_environment_file.cloud_config_debian.id

    ip_config {
      ipv4 {
        address = "${local.sdntest_evpn_ip}/24"
        gateway = proxmox_virtual_environment_sdn_subnet.crosssite_subnet.gateway
      }
    }
    ip_config {
      ipv4 {
        address = "${local.sdntest_ip}/16"
      }
    }
  }

  network_device {
    bridge   = proxmox_virtual_environment_sdn_vnet.crosssite_vnet.id
    enabled  = true
    firewall = true
  }
  network_device {
    bridge   = "vmbr0"
    enabled  = true
    firewall = true
  }

  memory {
    dedicated = local.sdntest_memory
  }

  cpu {
    architecture = "x86_64"
    cores        = local.sdntest_cores
    flags        = ["+md-clear", "+spec-ctrl", "+ssbd", "+pcid"]
    sockets      = 1
    type         = "x86-64-v4"
    units        = 100
  }

  operating_system {
    type = "l26"
  }

  keyboard_layout = "pl"

  started = true
  migrate = true
}

resource "proxmox_virtual_environment_firewall_options" "tf-sdntest" {
  depends_on    = [proxmox_virtual_environment_vm.sdntest]
  node_name     = local.proxmox_dell_r740_node_name
  vm_id         = local.sdntest_vm_id
  enabled       = true
  dhcp          = false
  ipfilter      = false
  log_level_in  = "err"
  log_level_out = "err"
  macfilter     = false
  ndp           = true
  input_policy  = "REJECT"
  output_policy = "ACCEPT"
  radv          = false
}

resource "proxmox_virtual_environment_firewall_rules" "tf-sdntest" {
  depends_on = [
    proxmox_virtual_environment_cluster_firewall_security_group.tf-sdntest,
    proxmox_virtual_environment_vm.sdntest
  ]

  node_name = local.proxmox_dell_r740_node_name
  vm_id     = local.sdntest_vm_id

  rule {
    security_group = proxmox_virtual_environment_cluster_firewall_security_group.tf-sdntest.name
    comment        = "TF managed"
  }
}

resource "proxmox_virtual_environment_cluster_firewall_security_group" "tf-sdntest" {
  name    = "tf-${local.sdntest_tag}"
  comment = "Managed by Terraform"

  rule {
    type    = "in"
    action  = "ACCEPT"
    comment = "Allow ssh to vm from sdntest vm in wro"
    source  = "10.2.1.9"
    dport   = "22"
    proto   = "tcp"
    log     = "err"
  }
  rule {
    type    = "in"
    action  = "ACCEPT"
    comment = "Allow ssh to vm from bastion"
    source  = "192.168.100.1"
    dport   = "22"
    proto   = "tcp"
    log     = "err"
  }
}

resource "proxmox_virtual_environment_file" "cloud_config_debian" {
  content_type = "snippets"
  datastore_id = "local"
  node_name    = local.proxmox_dell_r740_node_name

  source_raw {
    data = <<-EOF
    #cloud-config
    package_upgrade: false
    users:
      - name: debian
        groups: sudo
        shell: /bin/bash
        ssh-authorized-keys:
          - ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFcSbrQuaj60B0gOKXm8biPsF3xQxYvzx0rQTacGOOYX servers rafsaf@RYCERSKI 2025
          - ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFj7D5PqyotQupnbUaL8/1Jc/20nuWIBwv7PBfa3rTxY servers rafsaf@szczery 2025
        sudo: ALL=(ALL) NOPASSWD:ALL
    packages:
      - qemu-guest-agent
    runcmd:
      - systemctl enable qemu-guest-agent
      - systemctl start qemu-guest-agent
    EOF

    file_name = "cloud-config-debian.yaml"
  }
}

resource "proxmox_virtual_environment_download_file" "dell_r740__debian_13_trixie_qcow2_img" {
  content_type = "import"
  datastore_id = "local"
  node_name    = local.proxmox_dell_r740_node_name
  url          = "https://cloud.debian.org/images/cloud/trixie/20251117-2299/debian-13-generic-amd64-20251117-2299.qcow2"
  overwrite    = false
}

This is mainly proxmox_virtual_environment_vm I use, with firewall rules alongisde.

Let's focus on the change before/after SDN for network interfaces. For older VMs, there is only eth0 definition on vmbr0:

initialization {
  ip_config {
      ipv4 {
        address = "${local.sdntest_ip}/16"
        gateway = local.proxmox_cluster_network_gateway
      }
    }
}

network_device {
  bridge   = "vmbr0"
  enabled  = true
  firewall = true
}

That would give following interfaces generated by cloud init:

# /etc/netplan/50-cloud-init.yaml

network:
    version: 2
    ethernets:
        eth0:
            addresses:
            - 192.168.3.12/16
            gateway4: 192.168.1.1
            match:
                macaddress: bc:24:11:f6:35:20
            nameservers:
                addresses:
                - 192.168.3.12
                search:
                - proxmox-dell-r740.internal
            set-name: eth0

whereas in the SDN VM it's:

initialization {
  ip_config {
    ipv4 {
      address = "${local.sdntest_evpn_ip}/24"
      gateway = proxmox_virtual_environment_sdn_subnet.crosssite_subnet.gateway
    }
  }
  ip_config {
    ipv4 {
      address = "${local.sdntest_ip}/16"
    }
  }
}

network_device {
  bridge   = proxmox_virtual_environment_sdn_vnet.crosssite_vnet.id
  enabled  = true
  firewall = true
}

network_device {
  bridge   = "vmbr0"
  enabled  = true
  firewall = true
}

This means eth0 is now the SDN network and default route for internet is via 10.1.1.1 gateway. We leave the (optional) vmbr0 management interface, but it's not used for internet anymore.

network:
  version: 2
  ethernets:
    eth0:
      match:
        macaddress: "bc:24:11:5d:83:33"
      addresses:
      - "10.1.1.11/24"
      nameservers:
        addresses:
        - 192.168.3.8
        search:
        - proxmox-dell-r740.internal
      set-name: "eth0"
      routes:
      - to: "default"
        via: "10.1.1.1"
    eth1:
      match:
        macaddress: "bc:24:11:30:b0:cb"
      addresses:
      - "192.168.3.12/16"
      nameservers:
        addresses:
        - 192.168.3.8
        search:
        - proxmox-dell-r740.internal
      set-name: "eth1"

(Please ignore nameserver 192.168.3.8, I use dnscrypt VM for DNS resolution and this is why such address)

In fact, this is exactly what we need. ALL traffic must go over the subnet gateway and SDN because this is the place where we can decide whether traffic should go over router to public internet or to the other data center (yes, in fact this is also using public internet, but hidden by wireguard magic and encrypted):

For example:

VM 10.1.1.5 from roz wants to reach 10.2.1.5 from wro
├── 10.2.1.5 is NOT in local subnet 10.1.1.0/24
├── VM sends packet to default gateway 10.1.1.1 (standard IP routing)
├── Gateway (Proxmox EVPN router) has route to 10.2.1.0/24 via BGP-EVPN
├── Packet encapsulated in VXLAN, sent to 10.2.0.1 via Wireguard
└── WRO decapsulates, delivers to VM 10.2.1.5

This scenario is not possible when vmbr0 is main interface.

So for each VM, it must be:

+--------------------+
| eth0: vnet (SDN)   | 10.x.1.x/24
|  - default gw =    | 10.x.1.1
|    SDN gateway     |
|                    |
| eth1: vmbr0 (mgmt) | 192.168.x.x
+--------------------+

eth0 overlay + exit-node + SNAT
eth1 control when overlay is broken, port forwarding from router

Critical fixes¶

SNAT¶

snat=1 enabled in proxmox_virtual_environment_sdn_subnet.

You could theoretically have masquerade and NAT setup for 10.1.1.0/24 on your router and with some hassle that is going to work, but this is exactly what we want to avoid – having a hardcoded, non-trivial setup anywhere if not needed. With snat, proxmox node is going to have iptables rule for this subnet so it can access internet via vmbr0 node interface:

$ iptables -t nat -L -n -v | egrep '10\.1\.1\.0/24|MASQUERADE|SNAT'

   64  4859 SNAT       all  --  *      vmbr0   10.1.1.0/24          0.0.0.0/0            to:192.168.2.3
    0     0 SNAT       all  --  *      vmbr0   10.1.1.0/24          0.0.0.0/0            to:192.168.2.3
    0     0 SNAT       all  --  *      vmbr0   10.1.1.0/24          0.0.0.0/0            to:192.168.2.3
    0     0 SNAT       all  --  *      vmbr0   10.1.1.0/24          0.0.0.0/0            to:192.168.2.3

Small nit pick found: after disabling it with snat=0, iptables rules were still in place for 10.1.1.0/24, so Proxmox doesn't clean them up. It's probably a bug, but maybe a feature.

The killer setting: `exit_nodes_local_routing`¶

This was the bug-shaped hole.

I initially assumed "local routing" should be enabled because... why not.

Exit Nodes Local Routing

This is a special option if you need to reach a VM/CT service from an exit node. (By default, the exit nodes only allow forwarding traffic between real network and EVPN network). Optional.

In practice for this design, enabling local routing broke cross-site behavior causing hard to debug packets port mangling.

Test scenario that discovered the issue:

tcpdump -n -vv -i vmbr0 host 104.18.27.120 in 1st terminal on roz node
tcpdump -n -vv -i xvrf_crossite host 104.18.27.120 in 2nd terminal on roz node
tcpdump -n -vv -i crossvnt host 104.18.27.120 in 3rd terminal on roz node
tcpdump -n -v -i eth0 'host 104.18.27.120' in 1st terminal in VM
10.1.1.11 roz VM: curl --connect-timeout 5 http://example.com -v (104.18.27.120) in 2nd terminal in VM

On Proxmox (vmbr0)¶

104.18.26.120.80 > 10.1.1.11.59376 with full HTTP/1.1 200 OK body.
SNAT on the exit node is correct (Cloudflare replies to 192.168.2.3:59376).
De‑SNAT is correct (by the time it’s leaving vmbr0 towards EVPN, it’s 80 → 10.1.1.11:59376).

On Proxmox (crossvnt) and inside the VM (eth0)¶

Handshake and GET are normal on port 80:
10.1.1.11.58044 > 104.18.27.120.80 [S]
104.18.27.120.80 > 10.1.1.11.58044 [S.]
10.1.1.11.58044 > 104.18.27.120.80 [.]
10.1.1.11.58044 > 104.18.27.120.80 [P.] GET / ...

Then all subsequent packets from the server side on the way to the guest:

104.18.27.120.397 > 10.1.1.11.58044 [.]
104.18.27.120.238 > 10.1.1.11.58044 [P.]
104.18.27.120.125 > 10.1.1.11.58044 [P.]

...and so on with ports 100, 49, 402, 440, 171, etc.

VM immediately sending RSTs to every one of these:

10.1.1.11.58044 > 104.18.27.120.397: Flags [R] ...
10.1.1.11.58044 > 104.18.27.120.238: Flags [R] ...

So sequence looks like:

Internet ↔ EdgeRouter ↔ vmbr0: fine (port 80).
vmbr0 ↔ xvrf_crossite: fine (port 80).
xvrf_crossite → crossvnt: the port is rewritten from 80 to random ephemeral values.
crossvnt → VM: these corrupted ports reach the guest, which correctly resets.

With exit_nodes_local_routing=false, there is no xvrf_crossite interface at all and no problems.

Firewall¶

First, for forward traffic rules you need nftables-based proxmox-firewall.

On cluster level, ebtables=true (proxmox_virtual_environment_cluster_firewall ebtables=true)
On each node, additionally it must be enabled (via GUI for now, no opentofu resource yet) – nftables (tech preview) = yes

resource "proxmox_virtual_environment_cluster_firewall_security_group" "tf-cluster-security-group-sdn" {
  name    = "tf-cluster-sg-sdn"
  comment = "Cluster level SDN firewall rules"

  rule {
    type    = "in"
    action  = "ACCEPT"
    comment = "Allow udp to host wireguard"
    source  = "0.0.0.0/0"
    dport   = "51820"
    proto   = "udp"
    log     = "err"
  }
  rule {
    type    = "in"
    action  = "ACCEPT"
    comment = "Allow EVPN BGP over WireGuard"
    source  = "10.2.0.0/24"
    dport   = "179"
    proto   = "tcp"
    log     = "err"
  }
  rule {
    type    = "in"
    action  = "ACCEPT"
    comment = "Allow VXLAN over WireGuard"
    source  = "10.2.0.0/24"
    dport   = "4789"
    proto   = "udp"
    log     = "err"
  }
  rule {
    type    = "forward"
    action  = "ACCEPT"
    comment = "SDN crosssite_vnet allow cross-site subnet"
    source  = "10.0.0.0/8"
    dest    = "+sdn/${proxmox_virtual_environment_sdn_vnet.crosssite_vnet.id}-all"
    log     = "err"
  }
  rule {
    type    = "in"
    action  = "ACCEPT"
    comment = "SDN crosssite_vnet allow traffic from no-gateway to gateway"
    dest    = "+sdn/${proxmox_virtual_environment_sdn_vnet.crosssite_vnet.id}-gateway"
    log     = "err"
    source  = "+sdn/${proxmox_virtual_environment_sdn_vnet.crosssite_vnet.id}-no-gateway"
  }
  rule {
    type    = "forward"
    action  = "ACCEPT"
    comment = "SDN crosssite_vnet allow forward traffic to 0.0.0.0/0"
    dest    = "0.0.0.0/0"
    log     = "err"
    source  = "+sdn/${proxmox_virtual_environment_sdn_vnet.crosssite_vnet.id}-all"
  }
}

PVE automatically creates some IPSets for every SDN Vnet so this is where "vnetid-all", "vnetid-gateway" come from.