Running Terraform Enterprise Active-Active on OpenShift: Lessons Learned
Series: HashiCorp on OpenShift

Introduction
In my last post, I covered the four challenges I hit running Vault Enterprise on OpenShift. That same lab cluster also runs Terraform Enterprise (TFE) in active-active mode, using that Vault cluster as its secrets backend. TFE brought its own set of OpenShift-specific problems — some in TFE itself, some in the supporting PostgreSQL and object storage layers, and one that turned out to be the same VSO bug from the Vault post showing up again in a different place.
The deployment follows the same GitOps pattern as Vault: ArgoCD, a three-source Helm setup (upstream TFE chart, a values file from the repo, and raw manifests for VSO custom resources, the Route, and the ImageStream/BuildConfig that supporting pieces need). TFE runs with 2 replicas against a CloudNativePG-based PostgreSQL cluster, a Redis deployment for active-active coordination, and a NooBaa S3 bucket (via OpenShift Data Foundation) for object storage. Every secret TFE needs — license, encryption password, database credentials, Redis password, S3 keys, registry pull secret, TLS certificate — comes from the local Vault cluster through Vault Secrets Operator (VSO).
This post covers five things that caught me off guard: TFE’s hard refusal to run mixed versions during a rolling update, why one Postgres cluster ended up with two different Vault-managed role designs, a missing trust anchor for the in-cluster S3 endpoint, the VSO stuck-reconciliation bug recurring across two more secrets, and what it takes to get TFE’s job agent running under OpenShift’s restricted security model.
Challenge 1: Active-Active Means Lockstep Versions
The Problem
TFE’s active-active mode replicates its Rails processes across multiple pods
that all have to agree on the running version. I found this out during what
should have been a routine chart and image bump. The new pod came up as
CrashLoopBackOff while the two existing pods stayed healthy:
startup check failed: check=upgrade ... TFE version (2.0.4) is not
inter-compatible with system version (2.0.2), to upgrade to this version
active TFE versions must be stopped
The root cause is the Helm chart’s default RollingUpdate strategy with
maxSurge: 25% — it starts the new-version pod alongside the two
still-running old-version pods to keep capacity up during the rollout. TFE’s
own startup check treats that mixed-version state as invalid and refuses to
come up, so the new pod crash-loops forever while the old pods sit there
looking healthy.
The Solution
The fix is to pin the Deployment to Recreate:
# values.yaml
strategy:
type: Recreate
Recreate tears down all pods before bringing any of them back up on the new
image, so there’s never a moment where two versions coexist. The trade-off is
real and worth stating plainly: every deploy that touches the pod template —
image bump or not — causes a brief full TFE outage while pods restart
(around 3 minutes, observed). For an active-active system, that’s a step back
from zero-downtime rollouts, but it’s the only option that matches TFE’s
lockstep version requirement.
If you ever hit the crash-loop mid-upgrade (say, the strategy setting got reverted by mistake), the recovery is a coordinated restart:
oc scale deployment terraform-enterprise -n tfe --replicas=0
oc scale deployment terraform-enterprise -n tfe --replicas=2
One subtlety if you’re running ArgoCD with selfHeal: true: it will fight the
scale-to-0 and start restoring replicas almost immediately. That’s fine here —
the new ReplicaSet on the new image brings both pods up together, which is
exactly what you want. Both pods reached 1/1 Running within a few minutes.
Challenge 2: One Postgres Cluster, Two Vault Role Designs
The Problem
Vault’s database secrets engine has an obvious, idiomatic pattern: a dynamic role that creates a brand-new Postgres user on every lease, with a short TTL. That’s what I set up first:
vault write tfe-psql/roles/tfe-psql-role \
db_name=tfe \
creation_statements="
CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}';
GRANT ALL PRIVILEGES ON DATABASE tfe TO \"{{name}}\";
" \
default_ttl="1h" \
max_ttl="24h"
It works — Vault happily mints a fresh, uniquely-named Postgres role every
hour. The problem is that TFE doesn’t want that. It expects to connect as one
stable database user across restarts, and its migrations need that user to
own the database and the public schema — not just have grants on them.
Dynamic per-lease roles with randomized names don’t fit that model at all.
The Solution
The fix was to layer Vault’s static role feature on top of the same database connection, rather than replacing the dynamic role. A static role manages the password for an existing, stable-named Postgres user on a fixed rotation schedule instead of creating new ones:
# Allow both role types on the same connection
vault write tfe-psql/config/tfe-psql \
allowed_roles="tfe-psql-role,tfe-static"
# Register the static role
vault write tfe-psql/static-roles/tfe-static \
db_name=tfe-psql \
username=tfe_app \
rotation_period=604800
But the tfe_app user has to exist and own the right objects before Vault can
manage its password. That’s a one-time manual bootstrap: create the role, then
hand over ownership as the Postgres superuser:
# Create the role using the cluster's admin user
psql "host=tfe-psql-primary.tfe.svc.cluster.local user=tfe dbname=tfe sslmode=require" \
-c "CREATE ROLE tfe_app WITH LOGIN PASSWORD '${INIT_PW}';"
# As the postgres superuser, transfer ownership
psql -U postgres -d tfe -c "
ALTER DATABASE tfe OWNER TO tfe_app;
GRANT ALL ON SCHEMA public TO tfe_app;
ALTER SCHEMA public OWNER TO tfe_app;
"
After that, Vault rotates tfe_app’s password every 7 days without ever
changing the username, which is exactly what TFE needs. The dynamic role
stays registered too — it’s a fine pattern for ad hoc admin access, just not
for the application’s own connection.
Challenge 3: NooBaa S3 and a Missing Trust Anchor
The Problem
TFE uses S3-compatible object storage for run logs and state, and this cluster provides that in-cluster via NooBaa (part of OpenShift Data Foundation) rather than a cloud S3 bucket. Pointing TFE at NooBaa’s internal endpoint seemed straightforward — until TFE crashed on startup with:
x509: certificate signed by unknown authority
The NooBaa S3 endpoint’s certificate is signed by OpenShift’s own internal service-CA, and TFE’s container doesn’t trust that CA by default. Nothing about the bucket configuration was wrong; TFE just couldn’t verify the TLS certificate on the connection.
The Solution
Every OpenShift namespace gets an auto-populated ConfigMap
(openshift-service-ca.crt) containing exactly that CA certificate. The fix
is to mount it into the path TFE’s chart already exposes for a custom CA
bundle:
# values.yaml
tls:
caCertBaseDir: /etc/ssl/certs
caCertFileName: custom_ca_certs.pem
extraVolumes:
- name: openshift-service-ca
configMap:
name: openshift-service-ca.crt
items:
- key: service-ca.crt
path: custom_ca_certs.pem
extraVolumeMounts:
- name: openshift-service-ca
mountPath: /etc/ssl/certs/custom_ca_certs.pem
subPath: custom_ca_certs.pem
readOnly: true
Once that ConfigMap is mounted and referenced by caCertBaseDir/
caCertFileName, TFE trusts the NooBaa endpoint’s certificate and connects
cleanly.
Worth noting on the bucket side: the ObjectBucketClaim that provisions the
bucket also produces a Secret with the S3 access keys, but TFE never reads
that Secret directly. Those keys get copied into Vault, and VSO syncs them
back out into the Secret TFE actually consumes — keeping VSO/Vault as the
single source of truth for every credential, not just the ones that
originate there.
Challenge 4: The VSO Stuck-Reconciliation Bug, Again
If you read the Vault post, you’ll remember VSO 1.4.0’s bug where a
VaultStaticSecret can stop reconciling after a transient Vault HA error —
and doesn’t recover on its own even if you restart the VSO controller. That
bug wasn’t a one-off. It hit twice more here, on two of TFE’s secrets, each
time triggered by an unrelated Vault rollout causing a brief Raft leader
election:
local node not active but active cluster node not found
Once on the VaultStaticSecret backing TFE’s S3 credentials, and once on the
one backing its Redis password — both during Vault maintenance that had
nothing to do with TFE. Same symptom both times: SecretSynced=False, a
lastTransitionTime stuck days in the past, and zero new controller log
lines for the resource.
The fix is the same spec-patch workaround from the Vault post — touch the resource to force VSO to requeue it, then revert:
oc -n tfe patch vaultstaticsecret tfe-s3 \
--type=merge -p '{"spec":{"refreshAfter":"6m"}}'
# confirm SecretSynced=True and a fresh lastTransitionTime, then revert
oc -n tfe patch vaultstaticsecret tfe-s3 \
--type=merge -p '{"spec":{"refreshAfter":"5m"}}'
There’s a related but distinct issue on TFE’s database credentials, which
come from a VaultDynamicSecret rather than a VaultStaticSecret. VSO 1.4.0
doesn’t treat a static-role database response as renewable unless told
otherwise, so without one extra field it syncs once and never refetches —
meaning the next scheduled password rotation leaves TFE with a stale
password and failing Postgres logins:
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultDynamicSecret
metadata:
name: tfe-database
namespace: tfe
spec:
vaultAuthRef: tfe-auth
mount: tfe-psql
path: static-creds/tfe-static
allowStaticCreds: true
refreshAfter: 24h
destination:
name: tfe-database
create: true
rolloutRestartTargets:
- kind: Deployment
name: terraform-enterprise
Two different VSO resource types, two different failure modes, two different fixes — but both are now a standard part of the runbook rather than a surprise, since between the Vault deployment and this one, the same bug family has now hit three separate secrets.
Challenge 5: A Custom Job-Agent Image for OpenShift
The Problem
TFE runs each Terraform run in a short-lived “agent” container. The
stock hashicorp/tfc-agent image fails admission under OpenShift’s
restricted security model — it expects to write to a working directory owned
by a fixed UID, which doesn’t hold up once OpenShift assigns the container an
arbitrary UID from the namespace’s allowed range.
The Solution
HashiCorp’s own OpenShift deployment guide covers this: build a thin wrapper
image that opens up the permissions the arbitrary UID needs. I wired it up as
an OpenShift BuildConfig that layers a couple of chmod calls onto the
upstream image:
apiVersion: build.openshift.io/v1
kind: BuildConfig
metadata:
name: tfc-agent
namespace: tfe-agents
spec:
output:
to:
kind: ImageStreamTag
name: tfc-agent:latest
source:
type: Dockerfile
dockerfile: |
FROM docker.io/hashicorp/tfc-agent:latest
USER root
RUN mkdir /.tfc-agent && \
chmod og+rw /.tfc-agent && \
chmod o+rx /home/tfc-agent
USER tfc-agent
strategy:
type: Docker
dockerStrategy:
from:
kind: DockerImage
name: docker.io/hashicorp/tfc-agent:latest
TFE points at the built image via TFE_RUN_PIPELINE_IMAGE, referencing the
in-cluster image registry rather than Docker Hub directly. Re-running
oc start-build tfc-agent picks up new upstream releases of tfc-agent
without needing to restart TFE itself — the new image is used the next time
an agent pod spawns.
A Few Smaller Landmines
Two more things that didn’t warrant a full section but are worth knowing before you deploy TFE on OpenShift yourself:
- The chart hard-codes its own resource names. The Deployment and Service
are both named the literal string
terraform-enterprise, regardless of your Helm release name —nameOverride/fullnameOverrideexist in the chart’s helpers but the main templates never use them. Anything that needs to reference TFE by name (VSO’srolloutRestartTargets, a Route’sto.name) has to useterraform-enterprise, not your release name. The Service’s port is also namedhttps-port, not the more conventionalhttps— a Route with the wrongtargetPortgives a silent 503 rather than an obvious error. - The chart defaults to a
LoadBalancerService. On a bare-metal cluster with no LoadBalancer provider, that sits atEXTERNAL-IP: <pending>forever and keeps ArgoCD’s health check stuck onProgressing. Overriding toservice.type: ClusterIPand fronting it with an OpenShift Route clears this up immediately.
Conclusion
Running TFE active-active on OpenShift is straightforward once you’ve hit each of these once, but the documentation doesn’t cover any of them:
- Active-active means lockstep versions —
RollingUpdatewill crash-loop a mixed-version rollout every time;strategy: Recreateis the only correct answer, at the cost of a brief outage on every deploy - Dynamic Vault roles don’t fit every app — TFE’s stable-username, owns-the-schema requirement needed a static role layered onto the same database connection, with a one-time manual ownership handoff
- In-cluster S3 needs in-cluster trust — mount OpenShift’s own
openshift-service-ca.crtConfigMap if you’re pointing TFE at NooBaa instead of a cloud S3 endpoint - The VSO stuck-reconciliation bug isn’t Vault-specific — it hit TFE’s S3 and Redis secrets independently, both times triggered by unrelated Vault HA events; the spec-patch workaround is now a standing runbook step
- OpenShift’s restricted SCC reaches into the job agent, too — not just
the TFE pods themselves; budget time for the custom
tfc-agentimage build
Between this post and the Vault one, that’s the full HashiCorp-on-OpenShift stack this lab runs — Vault as the secrets backend, VSO as the sync layer, and TFE consuming both. Every failure mode above came from the same root cause in different clothes: OpenShift’s stricter defaults (SCCs, arbitrary UIDs, in-cluster CAs) and TFE’s own strict operational assumptions (lockstep versions, stable database identity) both push back against defaults that work fine on vanilla Kubernetes.