Data center corridor with server racks lit by purple and orange accent lighting

Introduction

In my last post, I covered the four challenges I hit running Vault Enterprise on OpenShift. That same lab cluster also runs Terraform Enterprise (TFE) in active-active mode, using that Vault cluster as its secrets backend. TFE brought its own set of OpenShift-specific problems — some in TFE itself, some in the supporting PostgreSQL and object storage layers, and one that turned out to be the same VSO bug from the Vault post showing up again in a different place.

The deployment follows the same GitOps pattern as Vault: ArgoCD, a three-source Helm setup (upstream TFE chart, a values file from the repo, and raw manifests for VSO custom resources, the Route, and the ImageStream/BuildConfig that supporting pieces need). TFE runs with 2 replicas against a CloudNativePG-based PostgreSQL cluster, a Redis deployment for active-active coordination, and a NooBaa S3 bucket (via OpenShift Data Foundation) for object storage. Every secret TFE needs — license, encryption password, database credentials, Redis password, S3 keys, registry pull secret, TLS certificate — comes from the local Vault cluster through Vault Secrets Operator (VSO).

This post covers five things that caught me off guard: TFE’s hard refusal to run mixed versions during a rolling update, why one Postgres cluster ended up with two different Vault-managed role designs, a missing trust anchor for the in-cluster S3 endpoint, the VSO stuck-reconciliation bug recurring across two more secrets, and what it takes to get TFE’s job agent running under OpenShift’s restricted security model.


Challenge 1: Active-Active Means Lockstep Versions

The Problem

TFE’s active-active mode replicates its Rails processes across multiple pods that all have to agree on the running version. I found this out during what should have been a routine chart and image bump. The new pod came up as CrashLoopBackOff while the two existing pods stayed healthy:

startup check failed: check=upgrade ... TFE version (2.0.4) is not
inter-compatible with system version (2.0.2), to upgrade to this version
active TFE versions must be stopped

The root cause is the Helm chart’s default RollingUpdate strategy with maxSurge: 25% — it starts the new-version pod alongside the two still-running old-version pods to keep capacity up during the rollout. TFE’s own startup check treats that mixed-version state as invalid and refuses to come up, so the new pod crash-loops forever while the old pods sit there looking healthy.

The Solution

The fix is to pin the Deployment to Recreate:

# values.yaml
strategy:
  type: Recreate

Recreate tears down all pods before bringing any of them back up on the new image, so there’s never a moment where two versions coexist. The trade-off is real and worth stating plainly: every deploy that touches the pod template — image bump or not — causes a brief full TFE outage while pods restart (around 3 minutes, observed). For an active-active system, that’s a step back from zero-downtime rollouts, but it’s the only option that matches TFE’s lockstep version requirement.

If you ever hit the crash-loop mid-upgrade (say, the strategy setting got reverted by mistake), the recovery is a coordinated restart:

oc scale deployment terraform-enterprise -n tfe --replicas=0
oc scale deployment terraform-enterprise -n tfe --replicas=2

One subtlety if you’re running ArgoCD with selfHeal: true: it will fight the scale-to-0 and start restoring replicas almost immediately. That’s fine here — the new ReplicaSet on the new image brings both pods up together, which is exactly what you want. Both pods reached 1/1 Running within a few minutes.


Challenge 2: One Postgres Cluster, Two Vault Role Designs

The Problem

Vault’s database secrets engine has an obvious, idiomatic pattern: a dynamic role that creates a brand-new Postgres user on every lease, with a short TTL. That’s what I set up first:

vault write tfe-psql/roles/tfe-psql-role \
  db_name=tfe \
  creation_statements="
    CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}';
    GRANT ALL PRIVILEGES ON DATABASE tfe TO \"{{name}}\";
  " \
  default_ttl="1h" \
  max_ttl="24h"

It works — Vault happily mints a fresh, uniquely-named Postgres role every hour. The problem is that TFE doesn’t want that. It expects to connect as one stable database user across restarts, and its migrations need that user to own the database and the public schema — not just have grants on them. Dynamic per-lease roles with randomized names don’t fit that model at all.

The Solution

The fix was to layer Vault’s static role feature on top of the same database connection, rather than replacing the dynamic role. A static role manages the password for an existing, stable-named Postgres user on a fixed rotation schedule instead of creating new ones:

# Allow both role types on the same connection
vault write tfe-psql/config/tfe-psql \
  allowed_roles="tfe-psql-role,tfe-static"

# Register the static role
vault write tfe-psql/static-roles/tfe-static \
  db_name=tfe-psql \
  username=tfe_app \
  rotation_period=604800

But the tfe_app user has to exist and own the right objects before Vault can manage its password. That’s a one-time manual bootstrap: create the role, then hand over ownership as the Postgres superuser:

# Create the role using the cluster's admin user
psql "host=tfe-psql-primary.tfe.svc.cluster.local user=tfe dbname=tfe sslmode=require" \
  -c "CREATE ROLE tfe_app WITH LOGIN PASSWORD '${INIT_PW}';"

# As the postgres superuser, transfer ownership
psql -U postgres -d tfe -c "
  ALTER DATABASE tfe OWNER TO tfe_app;
  GRANT ALL ON SCHEMA public TO tfe_app;
  ALTER SCHEMA public OWNER TO tfe_app;
"

After that, Vault rotates tfe_app’s password every 7 days without ever changing the username, which is exactly what TFE needs. The dynamic role stays registered too — it’s a fine pattern for ad hoc admin access, just not for the application’s own connection.


Challenge 3: NooBaa S3 and a Missing Trust Anchor

The Problem

TFE uses S3-compatible object storage for run logs and state, and this cluster provides that in-cluster via NooBaa (part of OpenShift Data Foundation) rather than a cloud S3 bucket. Pointing TFE at NooBaa’s internal endpoint seemed straightforward — until TFE crashed on startup with:

x509: certificate signed by unknown authority

The NooBaa S3 endpoint’s certificate is signed by OpenShift’s own internal service-CA, and TFE’s container doesn’t trust that CA by default. Nothing about the bucket configuration was wrong; TFE just couldn’t verify the TLS certificate on the connection.

The Solution

Every OpenShift namespace gets an auto-populated ConfigMap (openshift-service-ca.crt) containing exactly that CA certificate. The fix is to mount it into the path TFE’s chart already exposes for a custom CA bundle:

# values.yaml
tls:
  caCertBaseDir: /etc/ssl/certs
  caCertFileName: custom_ca_certs.pem

extraVolumes:
  - name: openshift-service-ca
    configMap:
      name: openshift-service-ca.crt
      items:
        - key: service-ca.crt
          path: custom_ca_certs.pem

extraVolumeMounts:
  - name: openshift-service-ca
    mountPath: /etc/ssl/certs/custom_ca_certs.pem
    subPath: custom_ca_certs.pem
    readOnly: true

Once that ConfigMap is mounted and referenced by caCertBaseDir/ caCertFileName, TFE trusts the NooBaa endpoint’s certificate and connects cleanly.

Worth noting on the bucket side: the ObjectBucketClaim that provisions the bucket also produces a Secret with the S3 access keys, but TFE never reads that Secret directly. Those keys get copied into Vault, and VSO syncs them back out into the Secret TFE actually consumes — keeping VSO/Vault as the single source of truth for every credential, not just the ones that originate there.


Challenge 4: The VSO Stuck-Reconciliation Bug, Again

If you read the Vault post, you’ll remember VSO 1.4.0’s bug where a VaultStaticSecret can stop reconciling after a transient Vault HA error — and doesn’t recover on its own even if you restart the VSO controller. That bug wasn’t a one-off. It hit twice more here, on two of TFE’s secrets, each time triggered by an unrelated Vault rollout causing a brief Raft leader election:

local node not active but active cluster node not found

Once on the VaultStaticSecret backing TFE’s S3 credentials, and once on the one backing its Redis password — both during Vault maintenance that had nothing to do with TFE. Same symptom both times: SecretSynced=False, a lastTransitionTime stuck days in the past, and zero new controller log lines for the resource.

The fix is the same spec-patch workaround from the Vault post — touch the resource to force VSO to requeue it, then revert:

oc -n tfe patch vaultstaticsecret tfe-s3 \
  --type=merge -p '{"spec":{"refreshAfter":"6m"}}'

# confirm SecretSynced=True and a fresh lastTransitionTime, then revert
oc -n tfe patch vaultstaticsecret tfe-s3 \
  --type=merge -p '{"spec":{"refreshAfter":"5m"}}'

There’s a related but distinct issue on TFE’s database credentials, which come from a VaultDynamicSecret rather than a VaultStaticSecret. VSO 1.4.0 doesn’t treat a static-role database response as renewable unless told otherwise, so without one extra field it syncs once and never refetches — meaning the next scheduled password rotation leaves TFE with a stale password and failing Postgres logins:

apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultDynamicSecret
metadata:
  name: tfe-database
  namespace: tfe
spec:
  vaultAuthRef: tfe-auth
  mount: tfe-psql
  path: static-creds/tfe-static
  allowStaticCreds: true
  refreshAfter: 24h
  destination:
    name: tfe-database
    create: true
  rolloutRestartTargets:
    - kind: Deployment
      name: terraform-enterprise

Two different VSO resource types, two different failure modes, two different fixes — but both are now a standard part of the runbook rather than a surprise, since between the Vault deployment and this one, the same bug family has now hit three separate secrets.


Challenge 5: A Custom Job-Agent Image for OpenShift

The Problem

TFE runs each Terraform run in a short-lived “agent” container. The stock hashicorp/tfc-agent image fails admission under OpenShift’s restricted security model — it expects to write to a working directory owned by a fixed UID, which doesn’t hold up once OpenShift assigns the container an arbitrary UID from the namespace’s allowed range.

The Solution

HashiCorp’s own OpenShift deployment guide covers this: build a thin wrapper image that opens up the permissions the arbitrary UID needs. I wired it up as an OpenShift BuildConfig that layers a couple of chmod calls onto the upstream image:

apiVersion: build.openshift.io/v1
kind: BuildConfig
metadata:
  name: tfc-agent
  namespace: tfe-agents
spec:
  output:
    to:
      kind: ImageStreamTag
      name: tfc-agent:latest
  source:
    type: Dockerfile
    dockerfile: |
      FROM docker.io/hashicorp/tfc-agent:latest
      USER root
      RUN mkdir /.tfc-agent && \
          chmod og+rw /.tfc-agent && \
          chmod o+rx /home/tfc-agent
      USER tfc-agent      
  strategy:
    type: Docker
    dockerStrategy:
      from:
        kind: DockerImage
        name: docker.io/hashicorp/tfc-agent:latest

TFE points at the built image via TFE_RUN_PIPELINE_IMAGE, referencing the in-cluster image registry rather than Docker Hub directly. Re-running oc start-build tfc-agent picks up new upstream releases of tfc-agent without needing to restart TFE itself — the new image is used the next time an agent pod spawns.


A Few Smaller Landmines

Two more things that didn’t warrant a full section but are worth knowing before you deploy TFE on OpenShift yourself:

  • The chart hard-codes its own resource names. The Deployment and Service are both named the literal string terraform-enterprise, regardless of your Helm release name — nameOverride/fullnameOverride exist in the chart’s helpers but the main templates never use them. Anything that needs to reference TFE by name (VSO’s rolloutRestartTargets, a Route’s to.name) has to use terraform-enterprise, not your release name. The Service’s port is also named https-port, not the more conventional https — a Route with the wrong targetPort gives a silent 503 rather than an obvious error.
  • The chart defaults to a LoadBalancer Service. On a bare-metal cluster with no LoadBalancer provider, that sits at EXTERNAL-IP: <pending> forever and keeps ArgoCD’s health check stuck on Progressing. Overriding to service.type: ClusterIP and fronting it with an OpenShift Route clears this up immediately.

Conclusion

Running TFE active-active on OpenShift is straightforward once you’ve hit each of these once, but the documentation doesn’t cover any of them:

  • Active-active means lockstep versionsRollingUpdate will crash-loop a mixed-version rollout every time; strategy: Recreate is the only correct answer, at the cost of a brief outage on every deploy
  • Dynamic Vault roles don’t fit every app — TFE’s stable-username, owns-the-schema requirement needed a static role layered onto the same database connection, with a one-time manual ownership handoff
  • In-cluster S3 needs in-cluster trust — mount OpenShift’s own openshift-service-ca.crt ConfigMap if you’re pointing TFE at NooBaa instead of a cloud S3 endpoint
  • The VSO stuck-reconciliation bug isn’t Vault-specific — it hit TFE’s S3 and Redis secrets independently, both times triggered by unrelated Vault HA events; the spec-patch workaround is now a standing runbook step
  • OpenShift’s restricted SCC reaches into the job agent, too — not just the TFE pods themselves; budget time for the custom tfc-agent image build

Between this post and the Vault one, that’s the full HashiCorp-on-OpenShift stack this lab runs — Vault as the secrets backend, VSO as the sync layer, and TFE consuming both. Every failure mode above came from the same root cause in different clothes: OpenShift’s stricter defaults (SCCs, arbitrary UIDs, in-cluster CAs) and TFE’s own strict operational assumptions (lockstep versions, stable database identity) both push back against defaults that work fine on vanilla Kubernetes.