Automatically scale your Rails application with HPA

Published in

Treatwell Engineering

8 min readAug 23, 2023

Scalability is something we need to face when the application grows over time, and the easy solution to accomplish it is to add more replicas to our K8S deployments.

Nice and easy, but at the same scale costs grow, and that’s not what we want.

During 2023, in Treatwell, we replaced our bare metal servers with AWS cloud instances (check VPN Tunnels: how we used them to migrate our platform to AWS to read more on how we managed to do it), and having a cloud provider allowed high elasticity when talking about instance spin up and shutdown. Obviously, all of this is at a different cost, but there’s a high margin of saving if the application scale is always adequate to the traffic, without overpowering our system when not needed.

Our SaaS application traffic depends on business hours, in particular on beauty industry working days and hours, and in the majority of cases, there’s high traffic from Tuesday morning to Saturday evening (beauty salons are usually closed during Sunday and Monday).

1-week application web traffic where 16 Jul 2023 is Sunday

Current application structure

The Treatwell SaaS application we wanted to scale is a Ruby on Rails application using Puma as the web server, Sidekiq for asynchronous jobs execution, PostgreSQL as the database, and Redis for the cache.

We’re covering just the Rails application side of things in this article, but it’s strongly recommended to check and adjust PostgreSQL and Redis connection settings if needed to keep up with the load.

Everything is running in a Kubernetes cluster, managed by an Infrastructure-as-a-Code application we maintain internally.

Kubernetes HPA

Kubernetes provides Horizontal Pod Autoscaler (HPA), an internal tool that allows automatic horizontal scaling of deployments, acting on replicas count. Its configuration requires one or more metrics to be used and a target value to be met.
Once configured, the desired number of replicas is calculated with the following formula:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / 
desiredMetricValue )]

Metric collector

We’re using Prometheus to collect our application base metrics (CPU, memory, disk, network), but for this task, we need to go further: we need custom metrics.

Why? Because usually HPA uses metrics about the CPU and memory.
This unfortunately doesn’t fit with our needs.
For asynchronous jobs for example you could have workers that are designed to use lots of CPU and memory for intensive work. So they are not metrics that we can use.
Due to this, we didn’t need to have only a custom metric exporting system for our workloads, but we also needed to let HPA use it, because this is not something available out of the box.

After some research and tries with different tools (like prometheus-adapter, which needed a very complex configuration), we went for the kube-metrics-adapter project by Zalando, which allows a clean usage of custom metrics for HPA.

It’s easily deployable with an helm chart, but keep attention that it misses required APIService to run (one for custom and one for external metrics):

---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.custom.metrics.k8s.io
spec:
  group: custom.metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: kube-metrics-adapter
    namespace: monitoring
  version: v1beta1
  versionPriority: 100
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.external.metrics.k8s.io
spec:
  group: external.metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: kube-metrics-adapter
    namespace: monitoring
  version: v1beta1
  versionPriority: 100

At this point, the adapter for custom Prometheus metrics was available and ready to be used for our HPAs.

Metric exporter

To properly configure HPA metrics’ queries to scale the Rails application, some Rails-specific metrics are needed. Luckily, there is the prometheus_exporter gem that allows Prometheus metrics exporting and comes with a lot of prebuilt ones.

After installing it, we followed the guide available for this gem:

# Middleware initializatio in application.rb
require 'prometheus_exporter'
require 'prometheus_exporter/middleware'
# This reports stats per request like HTTP status and timings
app.middleware.unshift PrometheusExporter::Middleware

# Puma configuration
on_worker_boot do
  require 'prometheus_exporter'
  require 'prometheus_exporter/instrumentation'
  PrometheusExporter::Instrumentation::ActiveRecord.start(
    custom_labels: { type: 'puma_worker' },
    config_labels: %i[database host]
  )
end
after_worker_boot do
  require 'prometheus_exporter'
  require 'prometheus_exporter/instrumentation'
  # optional check, avoids spinning up and down threads per worker
  PrometheusExporter::Instrumentation::Puma.start unless PrometheusExporter::Instrumentation::Puma.started?
end

# Sidekiq initializer
require 'prometheus_exporter'
require 'prometheus_exporter/instrumentation'
config.server_middleware do |chain|
  chain.add PrometheusExporter::Instrumentation::Sidekiq
end
config.death_handlers << PrometheusExporter::Instrumentation::Sidekiq.death_handler
config.on(:startup) do
  PrometheusExporter::Instrumentation::ActiveRecord.start(
    custom_labels: { type: 'sidekiq' },
    config_labels: %i[database host]
  )
  PrometheusExporter::Instrumentation::Process.start type: 'sidekiq'
  PrometheusExporter::Instrumentation::SidekiqProcess.start
end
at_exit do
  PrometheusExporter::Client.default.stop(wait_timeout_seconds: 10)
end

We also created a Rails task that runs in a dedicated workload, that collects generic statistics about Sidekiq queues; this prevents having a lot of duplicated data points, that would occur in the case of letting each Sidekiq worker to export the same queue metrics.
Every worker will however export its own metrics (jobs timings, memory used etc)

# frozen_string_literal: true
desc 'Export queues and stats from Sidekiq'
task sidekiq_metrics: :environment do
  require 'prometheus_exporter'
  require 'prometheus_exporter/instrumentation'
  require 'prometheus_exporter/server'

  server = PrometheusExporter::Server::WebServer.new(bind: '0.0.0.0', port: 9394, verbose: true)
  server.start
  PrometheusExporter::Client.default = PrometheusExporter::LocalClient.new(collector: server.collector)
  PrometheusExporter::Metric::Base.default_prefix = 'ruby_'
  PrometheusExporter::Instrumentation::SidekiqQueue.start(all_queues: true)
  PrometheusExporter::Instrumentation::SidekiqStats.start
  sleep
end

We configured our IaC application to let each pod that exports metrics have a sidecar container collecting and exposing them to Prometheus:

- name: metrics-container
  image: uala/prometheus_exporter
  resources:
    limits:
      cpu: 50m
      memory: 32Mi
    requests:
      cpu: 50m
      memory: 32Mi
  livenessProbe:
    httpGet:
      path: /ping
      port: 9394
      scheme: HTTP
    timeoutSeconds: 3
    periodSeconds: 10
    successThreshold: 1
    failureThreshold: 5
  readinessProbe:
    httpGet:
      path: /ping
      port: 9394
      scheme: HTTP
    timeoutSeconds: 3
    periodSeconds: 10
    successThreshold: 2
    failureThreshold: 5
  startupProbe:
    httpGet:
      path: /ping
      port: 9394
      scheme: HTTP
    timeoutSeconds: 1
    periodSeconds: 2
    successThreshold: 1
    failureThreshold: 5

And its relative service with Prometheus scraping configuration:

apiVersion: v1
kind: Service
metadata:
  name: deployment-metrics-service
  namespace: default
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: '9394'
    prometheus.io/scrape: 'true'
spec:
  ports:
    - name: '9394'
      protocol: TCP
      port: 9394
      targetPort: 9394
  selector:
    app: default-deployment-a1b2c3d4
  clusterIP: 172.20.xx.xx
  clusterIPs:
    - 172.20.xx.xx
  type: ClusterIP

Metrics collecting is now a thing, let’s move on and make use of them.

HPA configuration

Using the kube-metrics-adapter, an HPA object is composed as follows:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: deployment-hpa
  namespace: default
  annotations:
    metric-config.external.deployment-custom-metric-1.prometheus/interval: 15s
    metric-config.external.deployment-custom-metric-1.prometheus/prometheus-server: http://prometheus.kube-system.svc.cluster.local:9090
    metric-config.external.deployment-custom-metric-1.prometheus/query: ...
spec:
  scaleTargetRef:
    kind: Deployment
    name: deployment
    apiVersion: apps/v1
  minReplicas: ...
  maxReplicas: ...
  metrics:
    - type: External
      external:
        metric:
          name: deployment-custom-metric-1
          selector:
            matchLabels:
              type: prometheus
        target:
          type: Value
          value: ...
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120
      selectPolicy: Min
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 120
      selectPolicy: Max
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60

In metadata we can configure the metric(s) used, as long as the Prometheus polling interval and the server address.
The spec contain the definition of the HPA itself, so the referenced Deployment , the metric(s) used (by metric name) with the wanted target value and a scale behaviour , used to fine-tune the scale window and size.

Web traffic scalability

As we saw at the beginning of this article, our web traffic has a predictable curve during the week, and it’s quite stable during the day after the initial growth when clients connect, so we leveraged on puma_max_threads and puma_thread_capacity metrics to scale it, setting an upper limit of 80 (that corresponds to 80% total usage of Puma threads available), resulting in this query:

round(
  (
    sum by (kubernetes_name) (ruby_puma_max_threads{kubernetes_namespace="default", kubernetes_name="deployment-metrics-service"}) - 
    sum by (kubernetes_name) (ruby_puma_thread_pool_capacity{kubernetes_namespace="default", kubernetes_name="deployment-metrics-service"}) 
  ) / 
  sum by (kubernetes_name) (ruby_puma_max_threads{kubernetes_namespace="default", kubernetes_name="deployment-metrics-service"}) 
  * 100 
)

In our experience with this Rails application, we noticed that when overloading the web server, response time quickly increases, so we added a second metric based on http_request_duration_seconds , setting an upper limit of 100 (that corresponds to 100ms of response time), resulting in this query:

round(
  avg by(kubernetes_name) (ruby_http_request_duration_seconds{kubernetes_namespace="default", kubernetes_name="deployment-metrics-service"}) 
  * 1000
)

Asynchronous jobs scalability

While web traffic is quite stable during the day, asynchronous job queues are different: some of them depend on cron tasks or external events, and their traffic is unpredictable; others mimic the web traffic situation, meaning that more web traffic enqueued more jobs, so the load between the two is related.

For the first group of queues, we went for sidekiq_queue_latency_seconds (“The number of seconds between oldest job being pushed to the queue and current time.”), and setting an upper limit of 20 (that corresponds to 20 seconds of latency), resulting in this query :

round(
  avg by(queue) (ruby_sidekiq_queue_latency_seconds{kubernetes_namespace="default", queue="queue-name"}),
  1
)

The usage of the latency works well with spikes of load but produces a high number of scales up and down where the traffic is more stable during a larger time window. To properly scale them, in addition to the latency , we added a query that monitors the capacity usage of the queue, using sidekiq_process_busy and sidekiq_process_concurrency , setting an upper limit of 100 (that corresponds to 100% usage of queue workers’ capacity), resulting in this query :

round(
  sum(avg_over_time(ruby_sidekiq_process_busy{kubernetes_namespace="default", queues="queue"}[5m])) /
  sum(avg_over_time(ruby_sidekiq_process_concurrency{kubernetes_namespace="default", queues="queue"}[5m]))
  * 100,
  1
)

Behavior of HPA

The behavior node in HPA allows fine-tuning of the scale-up and scale-down actions over the deployment. Multiple policies can be set, based on the number of Pods or by replica Percent ; when more than one policy is set, the priority of them can be set with selectPolicy .

A special mention must go to the stabilizationWindowSeconds , which approximates a rolling maximum, and avoids having the scaling algorithm frequently remove pods only to trigger recreating an equivalent pod just moments later.

This has been really helpful to stabilise our workloads avoiding to much fluctuation.

Results

We can see HPA in action for both Sidekiq and Puma workloads in the following screenshots.

First one is HPA enabled for Sidekiq workers, based only on queue latency:

This one instead is a HPA enabled for Sidekiq workers using also a second metrics about the RPM:

Sidekiq HPA based on queue latency and stabilized with RPM

Finally, the HPA running with our Puma servers: as you can see the workers scales up based on the RPM, keeping the response time under the limit.

By utilising Horizontal Pod Autoscaling (HPA) in our workloads, the number of required replicas has been significantly reduced during periods of low demand. This directly led to a reduction in the number of instances needed, effectively decreasing our monthly costs by 35–40%.