Automatically scale your Rails application with HPA

Luca Mattivi
Treatwell Engineering
8 min readAug 23, 2023

--

Scalability is something we need to face when the application grows over time, and the easy solution to accomplish it is to add more replicas to our K8S deployments.

Nice and easy, but at the same scale costs grow, and that’s not what we want.

During 2023, in Treatwell, we replaced our bare metal servers with AWS cloud instances (check VPN Tunnels: how we used them to migrate our platform to AWS to read more on how we managed to do it), and having a cloud provider allowed high elasticity when talking about instance spin up and shutdown. Obviously, all of this is at a different cost, but there’s a high margin of saving if the application scale is always adequate to the traffic, without overpowering our system when not needed.

Our SaaS application traffic depends on business hours, in particular on beauty industry working days and hours, and in the majority of cases, there’s high traffic from Tuesday morning to Saturday evening (beauty salons are usually closed during Sunday and Monday).

1-week application web traffic where 16 Jul 2023 is Sunday
1-week application web traffic where 16 Jul 2023 is Sunday

Current application structure

The Treatwell SaaS application we wanted to scale is a Ruby on Rails application using Puma as the web server, Sidekiq for asynchronous jobs execution, PostgreSQL as the database, and Redis for the cache.

We’re covering just the Rails application side of things in this article, but it’s strongly recommended to check and adjust PostgreSQL and Redis connection settings if needed to keep up with the load.

Everything is running in a Kubernetes cluster, managed by an Infrastructure-as-a-Code application we maintain internally.

Kubernetes HPA

Kubernetes provides Horizontal Pod Autoscaler (HPA), an internal tool that allows automatic horizontal scaling of deployments, acting on replicas count. Its configuration requires one or more metrics to be used and a target value to be met.
Once configured, the desired number of replicas is calculated with the following formula:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / 
desiredMetricValue )]

Metric collector

We’re using Prometheus to collect our application base metrics (CPU, memory, disk, network), but for this task, we need to go further: we need custom metrics.

Why? Because usually HPA uses metrics about the CPU and memory.
This unfortunately doesn’t fit with our needs.
For asynchronous jobs for example you could have workers that are designed to use lots of CPU and memory for intensive work. So they are not metrics that we can use.
Due to this, we didn’t need to have only a custom metric exporting system for our workloads, but we also needed to let HPA use it, because this is not something available out of the box.

After some research and tries with different tools (like prometheus-adapter, which needed a very complex configuration), we went for the kube-metrics-adapter project by Zalando, which allows a clean usage of custom metrics for HPA.

It’s easily deployable with an helm chart, but keep attention that it misses required APIService to run (one for custom and one for external metrics):

---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
name: v1beta1.custom.metrics.k8s.io
spec:
group: custom.metrics.k8s.io
groupPriorityMinimum: 100
insecureSkipTLSVerify: true
service:
name: kube-metrics-adapter
namespace: monitoring
version: v1beta1
versionPriority: 100
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
name: v1beta1.external.metrics.k8s.io
spec:
group: external.metrics.k8s.io
groupPriorityMinimum: 100
insecureSkipTLSVerify: true
service:
name: kube-metrics-adapter
namespace: monitoring
version: v1beta1
versionPriority: 100

At this point, the adapter for custom Prometheus metrics was available and ready to be used for our HPAs.

Metric exporter

To properly configure HPA metrics’ queries to scale the Rails application, some Rails-specific metrics are needed. Luckily, there is the prometheus_exporter gem that allows Prometheus metrics exporting and comes with a lot of prebuilt ones.

After installing it, we followed the guide available for this gem:

# Middleware initializatio in application.rb
require 'prometheus_exporter'
require 'prometheus_exporter/middleware'
# This reports stats per request like HTTP status and timings
app.middleware.unshift PrometheusExporter::Middleware

# Puma configuration
on_worker_boot do
require 'prometheus_exporter'
require 'prometheus_exporter/instrumentation'
PrometheusExporter::Instrumentation::ActiveRecord.start(
custom_labels: { type: 'puma_worker' },
config_labels: %i[database host]
)
end
after_worker_boot do
require 'prometheus_exporter'
require 'prometheus_exporter/instrumentation'
# optional check, avoids spinning up and down threads per worker
PrometheusExporter::Instrumentation::Puma.start unless PrometheusExporter::Instrumentation::Puma.started?
end

# Sidekiq initializer
require 'prometheus_exporter'
require 'prometheus_exporter/instrumentation'
config.server_middleware do |chain|
chain.add PrometheusExporter::Instrumentation::Sidekiq
end
config.death_handlers << PrometheusExporter::Instrumentation::Sidekiq.death_handler
config.on(:startup) do
PrometheusExporter::Instrumentation::ActiveRecord.start(
custom_labels: { type: 'sidekiq' },
config_labels: %i[database host]
)
PrometheusExporter::Instrumentation::Process.start type: 'sidekiq'
PrometheusExporter::Instrumentation::SidekiqProcess.start
end
at_exit do
PrometheusExporter::Client.default.stop(wait_timeout_seconds: 10)
end

We also created a Rails task that runs in a dedicated workload, that collects generic statistics about Sidekiq queues; this prevents having a lot of duplicated data points, that would occur in the case of letting each Sidekiq worker to export the same queue metrics.
Every worker will however export its own metrics (jobs timings, memory used etc)

# frozen_string_literal: true
desc 'Export queues and stats from Sidekiq'
task sidekiq_metrics: :environment do
require 'prometheus_exporter'
require 'prometheus_exporter/instrumentation'
require 'prometheus_exporter/server'

server = PrometheusExporter::Server::WebServer.new(bind: '0.0.0.0', port: 9394, verbose: true)
server.start
PrometheusExporter::Client.default = PrometheusExporter::LocalClient.new(collector: server.collector)
PrometheusExporter::Metric::Base.default_prefix = 'ruby_'
PrometheusExporter::Instrumentation::SidekiqQueue.start(all_queues: true)
PrometheusExporter::Instrumentation::SidekiqStats.start
sleep
end

We configured our IaC application to let each pod that exports metrics have a sidecar container collecting and exposing them to Prometheus:

- name: metrics-container
image: uala/prometheus_exporter
resources:
limits:
cpu: 50m
memory: 32Mi
requests:
cpu: 50m
memory: 32Mi
livenessProbe:
httpGet:
path: /ping
port: 9394
scheme: HTTP
timeoutSeconds: 3
periodSeconds: 10
successThreshold: 1
failureThreshold: 5
readinessProbe:
httpGet:
path: /ping
port: 9394
scheme: HTTP
timeoutSeconds: 3
periodSeconds: 10
successThreshold: 2
failureThreshold: 5
startupProbe:
httpGet:
path: /ping
port: 9394
scheme: HTTP
timeoutSeconds: 1
periodSeconds: 2
successThreshold: 1
failureThreshold: 5

And its relative service with Prometheus scraping configuration:

apiVersion: v1
kind: Service
metadata:
name: deployment-metrics-service
namespace: default
annotations:
prometheus.io/path: /metrics
prometheus.io/port: '9394'
prometheus.io/scrape: 'true'
spec:
ports:
- name: '9394'
protocol: TCP
port: 9394
targetPort: 9394
selector:
app: default-deployment-a1b2c3d4
clusterIP: 172.20.xx.xx
clusterIPs:
- 172.20.xx.xx
type: ClusterIP

Metrics collecting is now a thing, let’s move on and make use of them.

HPA configuration

Using the kube-metrics-adapter, an HPA object is composed as follows:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: deployment-hpa
namespace: default
annotations:
metric-config.external.deployment-custom-metric-1.prometheus/interval: 15s
metric-config.external.deployment-custom-metric-1.prometheus/prometheus-server: http://prometheus.kube-system.svc.cluster.local:9090
metric-config.external.deployment-custom-metric-1.prometheus/query: ...
spec:
scaleTargetRef:
kind: Deployment
name: deployment
apiVersion: apps/v1
minReplicas: ...
maxReplicas: ...
metrics:
- type: External
external:
metric:
name: deployment-custom-metric-1
selector:
matchLabels:
type: prometheus
target:
type: Value
value: ...
behavior:
scaleUp:
stabilizationWindowSeconds: 120
selectPolicy: Min
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 120
selectPolicy: Max
policies:
- type: Pods
value: 1
periodSeconds: 60

In metadata we can configure the metric(s) used, as long as the Prometheus polling interval and the server address.
The spec contain the definition of the HPA itself, so the referenced Deployment , the metric(s) used (by metric name) with the wanted target value and a scale behaviour , used to fine-tune the scale window and size.

Web traffic scalability

As we saw at the beginning of this article, our web traffic has a predictable curve during the week, and it’s quite stable during the day after the initial growth when clients connect, so we leveraged on puma_max_threads and puma_thread_capacity metrics to scale it, setting an upper limit of 80 (that corresponds to 80% total usage of Puma threads available), resulting in this query:

round(
(
sum by (kubernetes_name) (ruby_puma_max_threads{kubernetes_namespace="default", kubernetes_name="deployment-metrics-service"}) -
sum by (kubernetes_name) (ruby_puma_thread_pool_capacity{kubernetes_namespace="default", kubernetes_name="deployment-metrics-service"})
) /
sum by (kubernetes_name) (ruby_puma_max_threads{kubernetes_namespace="default", kubernetes_name="deployment-metrics-service"})
* 100
)

In our experience with this Rails application, we noticed that when overloading the web server, response time quickly increases, so we added a second metric based on http_request_duration_seconds , setting an upper limit of 100 (that corresponds to 100ms of response time), resulting in this query:

round(
avg by(kubernetes_name) (ruby_http_request_duration_seconds{kubernetes_namespace="default", kubernetes_name="deployment-metrics-service"})
* 1000
)

Asynchronous jobs scalability

While web traffic is quite stable during the day, asynchronous job queues are different: some of them depend on cron tasks or external events, and their traffic is unpredictable; others mimic the web traffic situation, meaning that more web traffic enqueued more jobs, so the load between the two is related.

For the first group of queues, we went for sidekiq_queue_latency_seconds (“The number of seconds between oldest job being pushed to the queue and current time.”), and setting an upper limit of 20 (that corresponds to 20 seconds of latency), resulting in this query :

round(
avg by(queue) (ruby_sidekiq_queue_latency_seconds{kubernetes_namespace="default", queue="queue-name"}),
1
)

The usage of the latency works well with spikes of load but produces a high number of scales up and down where the traffic is more stable during a larger time window. To properly scale them, in addition to the latency , we added a query that monitors the capacity usage of the queue, using sidekiq_process_busy and sidekiq_process_concurrency , setting an upper limit of 100 (that corresponds to 100% usage of queue workers’ capacity), resulting in this query :

round(
sum(avg_over_time(ruby_sidekiq_process_busy{kubernetes_namespace="default", queues="queue"}[5m])) /
sum(avg_over_time(ruby_sidekiq_process_concurrency{kubernetes_namespace="default", queues="queue"}[5m]))
* 100,
1
)

Behavior of HPA

The behavior node in HPA allows fine-tuning of the scale-up and scale-down actions over the deployment. Multiple policies can be set, based on the number of Pods or by replica Percent ; when more than one policy is set, the priority of them can be set with selectPolicy .

A special mention must go to the stabilizationWindowSeconds , which approximates a rolling maximum, and avoids having the scaling algorithm frequently remove pods only to trigger recreating an equivalent pod just moments later.

This has been really helpful to stabilise our workloads avoiding to much fluctuation.

Results

We can see HPA in action for both Sidekiq and Puma workloads in the following screenshots.

First one is HPA enabled for Sidekiq workers, based only on queue latency:

Sidekiq HPA based on queue latency only

This one instead is a HPA enabled for Sidekiq workers using also a second metrics about the RPM:

Sidekiq HPA based on queue latency and stabilized with RPM

Finally, the HPA running with our Puma servers: as you can see the workers scales up based on the RPM, keeping the response time under the limit.

Puma HPA based on RPM

By utilising Horizontal Pod Autoscaling (HPA) in our workloads, the number of required replicas has been significantly reduced during periods of low demand. This directly led to a reduction in the number of instances needed, effectively decreasing our monthly costs by 35–40%.

--

--