VPN Tunnels: how we used them to migrate our platform to AWS

Luca Mattivi
Treatwell Engineering
7 min readJun 8, 2023

--

Intro

Since the beginning of Uala, we always focused on using the best and latest technology available. We’re proud to said that we used Docker when it was in beta, adopted Rancher since the first version, and later transitioned to Kubernetes once we felt confident in our ability to manage it (v1.15).

Initially, we relied on bare-metal servers (using Hetzner), which meant that we had to handle every aspect of our on-premises infrastructure without any easy solutions.

However, as our company grew, including the significant merger with Treatwell, in last years we reached the top of the market in our business domain.

Scalability and availability quickly became the two main priorities on the platform side.

Considering the existing platform team at Treatwell’s familiarity with AWS, we made the decision at the end of 2022 to migrate everything from bare-metal servers to AWS.

This transition presented us with a challenge: How could we move our infrastructure to AWS without any downtime or negative impact on productivity?

The Challenge with High Traffic

Our platform experiences high traffic 24/7 and consists of multiple terabyte-sized databases.

A big-bang approach to migration was not feasible, so we needed to find another way.

While several services enable communication between different clusters, most of them are complex and directly rely on the underlying network of the cluster.
Since we lacked confidence in those solutions, we had to explore other options.

We considered using a VPN as a potential solution. However, after attempting this approach, we quickly realized that managing a VPN at that scale was chaotic, especially given that containers are not VPN-friendly.

After extensive research, we discovered Tailscale, a user-friendly tool for VPN management.

VPN Tunnels

What distinguishes Tailscale from a standard VPN is its effortless setup. Everything works out of the box, offering perfect network management. It can detect connection failures and automatically reconnect, and we found that it works flawlessly with containers.

Nevertheless, relying on a third-party service introduces external dependencies, which is something we usually avoid.
Fortunately, the Tailscale client is open source, and there is a great project called Headscale (supported by Tailscale) that aims to make the server part open source as well.

With Tailscale, we found a way to establish communication between different components, but we were still far from achieving our goal.

We can use Tailscale to create multiple VPN Tunnels.

VPN performance

To ensure a seamless migration without impacting the platform, we needed to verify the performance and stability of the communication.

Two factors affected these aspects: relays and resources.

Initially, when we attempted to replicate our main Postgres database, we encountered a significant decrease in network speed, only achieving 20MB/s.

After investigating and debugging Tailscale, we discovered that we were using a Tailscale Relay, which introduced substantial latency between the two clients.

This was expected since the clients couldn’t communicate directly and required a relay to establish a connection.

Unfortunately, this was not an acceptable solution for us. However, after conducting tests, we realized that enabling the hostNetwork option on one of the vpn-client pods resolved the issue. It allowed for a direct connection between the two clients.

After making this adjustment, we tested again, but the bandwidth remained too low, reaching only about 60MB/s.

The solution was straightforward this time: to accommodate higher bandwidth, the vpn-clients required more resources.

Through further testing and optimization, we achieved a stable bandwidth of 150MB/s with a latency of 5–6ms for each vpn-tunnel.

Bandwidth of a VPN-Tunnel

With the performance requirements met, we were ready to proceed.

Practice in Kubernetes

To simplify management, we created a custom Docker image that facilitated proxying one or more services in either direction through Tailscale.

For example, our main Postgres instance needed to persist on the old cluster until the final promotion of a new replica (scheduled for the last day of migration). Additionally, we had to expose it in the new cluster so that every service could write to it, and new replicas could be created and kept aligned with it.
The following schema illustrates our approach:

Example of the database communication with Tailscale

While we knew how to utilize VPN tunnels in Kubernetes, we needed to integrate them into our Infrastructure as Code (IaC) to ensure easy usage during the migration.

Integrating VPN Tunnels with IaC

Our migration plan involved creating a new namespace where all the Tailscale clients would run. We then remapped the application services to the Tailscale clients within that namespace. Consequently, at the application level, the migration would be transparent. Each time we moved an application to the new cluster, the Kubernetes services would point to the migrated application without requiring any changes.

Here’s how we accomplished this using cdk8s, the IaC tool we employ.

Firstly, we created a new component for the Tailscale client:

export class TailscaleClient extends Core.StatefulSet {
constructor(scope: Construct, id: string, options: TailscaleClientOptions) {
const proxyEnvs : EnvVar[] = [];
proxyEnvs.push(
{ name: "TAILSCALE_HOSTNAME", value: options.tailscaleHostname! }
)
if (options?.proxies) {
options.proxies.forEach(proxy => {
const destinationPort = (proxy.destination.endsWith('.svc.cluster.local'))
? proxy.servicePort
: proxy.port;

proxyEnvs.push(
{
name: `PROXY_HOST_${proxy.port}`,
value: `${proxy.destination}:${destinationPort}:${proxy.dnsServer}`
}
);

new KubeService(scope, `${id}-${proxy.port}-service`, {
spec: {
type: "ClusterIP",
ports: [
{
port: proxy.servicePort,
name: proxy.servicePort.toString(),
targetPort: IntOrString.fromNumber(destinationPort)
}
],
selector: { app: Names.toLabelValue(scope, { extra: [id] }) }
}
});
});
}

const tailscaleStateVolume : PersistentVolumeClaimTemplate = {
metadata: {
name: "state",
},
spec: {
storageClassName: options.storageClassName,
accessModes: ["ReadWriteOnce"],
resources: {
requests: {
storage: Quantity.fromString(options.volumeStorage ?? "1Gi")
}
}
}
}

super(scope, id, {
...options,
name: id,
image: options.image ?? "leen15/tailscale:1.38.4-dns-server",
serviceType: options.serviceType ?? "Headless",
envs: [
...(options.envs || []),
...proxyEnvs
],
securityContext: {
capabilities: {
add: ['NET_ADMIN']
},
privileged: false
},
hostNetwork: options.hostNetwork,
volumeMounts: [
{
name: tailscaleStateVolume.metadata?.name || "",
mountPath: "/var/lib/tailscale"
},
],
volumeClaimTemplates: [
tailscaleStateVolume,
]
});
}
}

This allowed us to define a new VPN client using a YAML configuration:

vpn-tunnel:
enabled: true
namespace: vpn-tunnel-aws-staging-indianred
workloadsProps:
- name: vpn-client-be-admin
enabled: true
options:
replicas: 1
storageClassName: gp2
rollingStrategy: StopStart # need to stop first for unattach the volume
volumeStorage: 1Gi
labels:
vpn: client
tailscaleHostname: aws-staging-indianred-client-be-admin
proxies:
- port: 801
servicePort: 80
# web-lb-internal-service.be-admin-env-staging:80
destination: hetz-dev-envs-client-be-admin
dnsServer: 100.100.100.100
- port: 802
servicePort: 80
destination: web-lb-service.be-admin-env-staging.svc.cluster.local
dnsServer: 172.20.0.10
- port: 803
servicePort: 80
destination: web-lb-internal-service.be-admin-env-staging.svc.cluster.local
dnsServer: 172.20.0.10

What happened here?

We defined a new vpn-client for each namespace and specified a list of services to be mapped by the internal proxy of the vpn-client.

This setup allowed every vpn-client to operate in both directions.

In the provided example, port 801 was used to expose the port 801 of another vpn-client (running in the hetzner cluster) within the current aws cluster. A new Kubernetes service would be created to expose this on port 80, eliminating the need for any application-level changes.
The dnsServer parameter was used to enable the internal proxy of the vpn-client to resolve the destination, in this case, the Headscale dns-server.
The other two ports (802 and 803) were used to expose two internal services to the other vpn-client, relying on the cluster dns-server for resolution.

How did we use this client?

We set the service of a not yet migrated application to point to it:

beAdmin:
namespace: be-admin-env-staging
workloadsProps:
- name: web
enabled: false
- name: web-lb
enabled: false
- name: web-internal
enabled: false
- name: web-lb-internal
enabled: false
- name: traffic-black-hole
enabled: false
- name: web-lb-internal-service
enabled: true
ingressExtHost: vpn-client-be-admin-801-service.vpn-tunnel-aws-staging-indianred.svc.cluster.local

The ingressExtHost parameter is an alias to create the service with an ExternalName.

Consequently, any other application that needed to communicate with the web-lb-internal application interacted directly with the vpn-client, which then forwarded the traffic to the other cluster.

Conclusion

We moved every service one by one, creating multiple VPN tunnels to expose every service in the related cluster that needed to communicate with other services. We also forwarded the traffic from the old cluster using VPN tunnels until the DNS records were fully switched.

This approach enabled us to migrate our entire platform to a different cluster in another cloud provider without any downtime or issues for end users.
Throughout the migration, our developers could seamlessly operate on the applications, ensuring the same development experience as always.

--

--