Skip to content

Infrastructure Implementation Tasks/Plan

Detailed Infra Tasks (Redis / DB / Services / Domains / Networking / Security / Deployment / Operations)

Section titled “Detailed Infra Tasks (Redis / DB / Services / Domains / Networking / Security / Deployment / Operations)”

Audience: DevOps engineers, platform engineers, backend leads, security engineers, SRE/operations, architecture leads
Status: execution-grade infrastructure specification
Goal: define, in practical detail, how to deploy, secure, connect, observe, scale, and operate the full Kisum platform infrastructure for:

  • Auth Backend
  • Platform Core Backend
  • Platform Admin Backend
  • Basic/Core Backend
  • Finance Backend
  • Market Backend
  • Touring Backend
  • Venue Backend
  • AI Backend
  • PostgreSQL databases
  • Redis
  • domains / DNS / TLS
  • networking / ingress / egress
  • secrets / credentials
  • backups / restore
  • logging / metrics / tracing
  • CI/CD / release strategy
  • incidents / runbooks / disaster recovery

This document is intentionally long and explicit.

It is not a product brief.
It is not a high-level architecture note.
It is not a slide deck.

It is meant to answer questions like:

  • where exactly should each service run?
  • what domains should point where?
  • what databases exist?
  • which services can talk to which others?
  • where does Redis live?
  • what must be private vs public?
  • how should environment variables be managed?
  • how should certificates be handled?
  • what gets backed up?
  • how do we deploy without breaking production?
  • what happens when Redis is down?
  • how do we rotate JWT signing keys safely?
  • how do we restore auth if the DB is corrupted?
  • what monitoring and alerts are mandatory before go-live?

Where there is still an application-level TBD, this document states that clearly rather than inventing fake certainty.


The infrastructure must support the agreed architecture:

  • Auth = identity, sessions, memberships, roles, permissions, delegation, access aggregation
  • Platform Core = packages, add-ons, modules, company entitlements
  • Business Backends = business data and business logic
  • Frontend = consumes backend APIs; does not own security truth
  • Redis = performance layer, never source of truth
  • PostgreSQL = source of truth for relational system state

The infrastructure must also support these business/technical rules:

  1. JWT remains small and identity-only.
  2. Access is recomputed from Auth + Core data, not embedded in JWT.
  3. Core App is only visible when Basic subscription is active.
  4. Add-on modules may be active even if Basic is inactive.
  5. Business backends must validate and enforce access consistently.
  6. Platform Core is internal for entitlements and should not become a public authorization engine.
  7. Frontend must call Auth for access bootstrap, not Core directly.

The platform should be thought of as these deployable groups:

These accept browser/client traffic from the public internet.

  • Frontend App
  • Platform Admin Frontend
  • Auth Backend
  • Platform Admin Backend
  • Basic Backend
  • Finance Backend
  • Market Backend
  • Touring Backend
  • Venue Backend
  • AI Backend

These should not be directly callable by browsers/public clients unless there is a very specific decision to expose them.

  • Platform Core Backend
  • internal supporting jobs/workers, if introduced
  • optional internal cache warming jobs
  • optional scheduled reconciliation jobs

These hold or accelerate platform state.

  • PostgreSQL for Auth
  • PostgreSQL for Platform Core
  • MongoDB for Main/Basic (current)
  • PostgreSQL for Finance
  • MongoDB for Market/Touring (current)
  • Venue DB (TBD)
  • AI DB / storage (TBD)
  • Redis

These support routing, delivery, and operations.

  • DNS provider
  • TLS certificate management
  • reverse proxy / ingress
  • CI/CD
  • secret manager or secret injection mechanism
  • logging sink
  • metrics backend
  • alerting system
  • backup jobs
  • restore tooling

At minimum, infrastructure should support:

  • local
  • dev
  • staging
  • production

Optional:

  • preview
  • qa
  • disaster-recovery / warm standby

Purpose:

  • individual development
  • integration tests with minimal dependencies

Should support:

  • local Auth
  • local Core
  • optional local Redis
  • local PostgreSQL
  • mocked or dev versions of other backends

Purpose:

  • shared development environment
  • integration testing by multiple engineers

Should support:

  • all services deployable
  • test domains
  • real Redis
  • real Postgres instances
  • non-production secrets
  • isolated data

Purpose:

  • production-like validation
  • release candidate testing
  • smoke tests
  • load testing before prod

Should support:

  • same routing pattern as prod
  • same TLS pattern as prod
  • same secret layout pattern as prod
  • realistic DB/Redis services
  • monitoring and alerts enabled

Purpose:

  • customer traffic
  • platform admin traffic
  • real business operations

Must have:

  • secure TLS
  • monitored DB/Redis
  • backups
  • restore runbooks
  • incident alerting
  • environment separation from staging
  • rollout and rollback plan

  • auth.kisum.io → Auth Backend
  • app.kisum.io → main frontend / Core App shell / module routes
  • admin.kisum.io → platform admin frontend
  • api-v2.kisum.dev (or production equivalent under kisum.io if desired) → Basic Backend
  • api-v2-finance.kisum.dev → Finance Backend
  • api-v2-market.kisum.dev → Market Backend
  • api-v2-touring.kisum.dev → Touring Backend (if separate)
  • api.kisum.dev/venue or dedicated subdomain → Venue Backend
  • api-v2-ai.kisum.dev → AI Backend
  • core.kisum.io or platform-core.kisum.io → Platform Core Backend
  • optional internal admin-to-core/internal-to-auth hostnames depending on network design

Option A — keep current mixed domain pattern temporarily

Section titled “Option A — keep current mixed domain pattern temporarily”

This reflects your current system reality and reduces migration friction.

Option B — normalize all production domains later

Section titled “Option B — normalize all production domains later”

Examples:

  • api-basic.kisum.io
  • api-finance.kisum.io
  • api-market.kisum.io

This can be cleaner long term, but not required immediately.

For each public hostname, define:

  • A / AAAA or CNAME depending on hosting model
  • TTL appropriate for your infra
  • separate staging/dev DNS namespace
  • no shared records between prod and non-prod

Examples:

  • auth.kisum.io
  • auth-staging.kisum.io
  • core-staging.kisum.io
  • app-staging.kisum.io
  • document all DNS records in infra repo
  • avoid manual undocumented DNS edits
  • define ownership for DNS changes
  • use lower TTL before cutovers or migrations
  • record certificate dependencies for each hostname

Mandatory for:

  • frontend
  • auth
  • admin
  • all public APIs

Recommended:

  • still use TLS if routed through internal ingress/service mesh
  • otherwise restrict to private networking and authenticated service calls
  • choose cert strategy:
    • managed certs via cloud LB
    • or Let’s Encrypt via ingress / reverse proxy
  • ensure auto-renewal
  • alert before expiration
  • test renewals in staging

For frontend and auth domains, configure:

  • HSTS
  • X-Content-Type-Options
  • X-Frame-Options or CSP equivalent as appropriate
  • Referrer-Policy
  • secure cookie policy if cookies are used anywhere
  • CSP for frontends if feasible

  • Auth Backend
  • Frontend(s)
  • Platform Admin Backend
  • Basic Backend
  • Module backends
  • Platform Core Backend
  • databases
  • Redis
  • internal job runners

Default rule:

  • if a service does not need public internet traffic, do not expose it publicly

Platform Core is the most important example:

  • it should be reachable by Auth and Admin backend
  • it should not be a public browser API

Recommended segmentation:

  • ingress/public subnet or layer
  • app/services private subnet or network
  • data subnet or private data layer

At minimum:

  • public reverse proxy or LB
  • private app-to-app connectivity
  • private DB and Redis connectivity

Can talk to:

  • Auth DB
  • Redis
  • Platform Core Backend
  • optional email provider (SES etc.)
  • optional internal admin/backend tools

Should not need direct access to:

  • module DBs
  • module business backends for normal runtime

Can talk to:

  • Platform Core DB
  • optional billing provider APIs
  • optional internal messaging bus
  • optional Redis if used for local acceleration

Should not need direct access to:

  • Auth DB
  • module DBs

Can talk to:

  • Auth Backend
  • Platform Core Backend
  • optional admin DB/config DB

Can talk to:

  • Main DB
  • Auth/JWKS or Auth internal access route
  • optional Redis if using shared cache path

Can talk to:

  • Finance DB
  • Auth/JWKS or Auth internal access route

Can talk to:

  • Market/Touring DB
  • Auth/JWKS or Auth internal access route

Can talk to:

  • Venue DB
  • Auth/JWKS or Auth internal access route

Can talk to:

  • AI DB/storage
  • Auth/JWKS or Auth internal access route
  • optional model providers

Should never accept public internet traffic directly.


This section does not force one provider, but it defines what the hosting model must support.

Services run in containers or processes on VMs. Good when:

  • you want direct control
  • you already use VPS infrastructure
  • you want simpler networking and lower moving parts early

Pattern B — container platform / orchestrator

Section titled “Pattern B — container platform / orchestrator”

Services run in managed containers / Kubernetes / Nomad / etc. Good when:

  • you want stronger scaling and orchestration
  • you have DevOps maturity for it
  • frontends on managed frontend hosting/CDN
  • APIs on containers/VMs
  • DB/Redis managed This is often the most practical for a growing product.

Given your current environment history and the service split, a pragmatic production layout is:

  • Frontends on managed static/app hosting or dedicated app servers behind CDN
  • Auth, Core, Admin, and module backends in containers on private app hosts or managed app platform
  • PostgreSQL managed where possible
  • Redis managed where possible
  • reverse proxy or load balancer in front

This gives:

  • cleaner deployments
  • safer DB operations
  • less ops burden on the most critical stateful services

Each of these should be independently deployable:

  • auth-backend
  • platform-core-backend
  • platform-admin-backend
  • basic-backend
  • finance-backend
  • market-backend
  • touring-backend (if separate)
  • venue-backend
  • ai-backend

Each needs:

  • its own config
  • its own runtime health checks
  • its own logs
  • its own release history

Type: PostgreSQL
Name: auth_db

Stores:

  • users
  • sessions
  • company memberships
  • module grants
  • permissions
  • delegation
  • token/versioning-related state
  • audit/security records if stored in DB
  • create managed or dedicated PostgreSQL instance/database
  • enforce TLS for DB connections where supported
  • create app role/user with least privilege
  • apply migrations through CI/CD
  • enable PITR if provider supports it
  • define backup schedule
  • test restore

Hot-path DB. Expect frequent reads for:

  • user/session checks
  • membership checks
  • access aggregation Therefore:
  • proper indexes are mandatory
  • pooling must be tuned
  • read patterns must be measured

Type: PostgreSQL
Name: platform_core_db

Stores:

  • modules
  • packages
  • add-ons
  • package/add-on mappings
  • company subscriptions
  • company add-ons
  • entitlement versioning
  • entitlement history optionally
  • create separate DB or at minimum separate schema
  • migrations managed independently from Auth
  • backup/restore independently
  • secure private connectivity only

Type: current main business DB (MongoDB Main per current documentation)

Stores:

  • core/basic business data
  • document exact DB cluster/instance
  • confirm connection policy
  • confirm backup policy
  • define read/write credentials for Basic Backend only
  • review indexes for core business routes

Type: PostgreSQL Finance

  • verify production DB sizing
  • verify migrations
  • verify backup and restore policy
  • ensure Finance backend has only required role access

Type: MongoDB Market/Touring (current direction)

  • confirm whether shared cluster or DB namespace
  • document collections and ownership
  • backup schedule
  • restore test
  • security policy for shared Market/Touring service

Type: TBD

  • finalize DB engine
  • define owner service
  • define backup plan
  • define access credentials

Type: TBD

May include:

  • relational DB
  • document store
  • object storage
  • vector storage depending on AI design
  • finalize data types
  • define retention rules
  • define cost controls
  • define model artifact storage if any

These apply to Auth and Platform Core and Finance where relevant.

For each application DB:

  • create separate DB user per service
  • no superuser for application runtime
  • migrations may use a stronger controlled role
  • app user should only access its own schema/tables

Tasks:

  • configure max connections per service
  • decide whether pooling is done:
    • in-app
    • via PgBouncer
    • or managed DB proxy
  • prevent connection exhaustion during spikes

Tasks:

  • choose migration tool
  • maintain migration repo/folder
  • run in CI/CD before app rollout when appropriate
  • protect prod with migration review
  • support rollback strategy for reversible migrations
  • define non-reversible migration warnings

Tasks:

  • daily full backups at minimum
  • WAL/PITR if supported
  • encrypted storage for backups
  • retention policy
  • restore test at least periodically

Track:

  • CPU
  • memory
  • storage
  • connection count
  • slow queries
  • replication lag if any
  • lock contention
  • migration failures

These apply to Main and Market/Touring if still on Mongo.

Tasks:

  • document version
  • document replica set/sharding status
  • confirm backup method
  • confirm auth enabled
  • confirm TLS enabled if supported/used
  • confirm private network only

Tasks:

  • separate DB users per service
  • least privilege roles
  • no shared root creds in app runtime

Tasks:

  • automated backups
  • retention policy
  • restore drill
  • document RPO/RTO expectations

Tasks:

  • review indexes on hot collections
  • review unbounded collection growth
  • review heavy aggregation/query paths
  • monitor oplog/replica health if applicable

Redis is critical but is not source of truth.

Redis may be used for:

  • access-context cache
  • session hot cache
  • revocation markers
  • login rate limiting
  • refresh throttling
  • password-reset throttling
  • company resolution cache
  • optional internal access snapshot cache
  • any other hot-path cache that preserves correctness when lost

Preferred:

  • managed Redis or highly available Redis service Alternative:
  • self-managed Redis with persistence and restart strategy

Mandatory rule:

  • if Redis is empty, wrong, or unavailable, the system must still be able to rebuild truth from DBs
  • correctness must not depend on Redis being the sole store of critical state
access:{companyId}:{membershipId}:{accessVersion}:{entitlementVersion}
session:{sessionId}
revoked:session:{sessionId}
ratelimit:login:ip:{ip}
ratelimit:login:email:{email}
company-map:{raw-x-org}

Recommended starting points:

  • access cache: 5–15 minutes
  • session hot cache: aligned to short token/session needs
  • revocation marker: at least until all access tokens tied to that session are expired
  • rate limit keys: according to endpoint policy

Define expected behavior if Redis is unavailable:

  • login may temporarily lose rate limiting if no fallback exists
  • session/access lookups should fall back to DB
  • service should degrade, not catastrophically fail, unless a specific route is designed otherwise
  • access context should be refetched from Auth/DB path
  • performance degrades but correctness remains

Track:

  • memory usage
  • CPU
  • evictions
  • latency
  • keyspace misses/hits
  • connection count
  • persistence failures if persistence enabled

Infra must manage these separately:

  • JWT private signing keys
  • JWT public key metadata/JWKS config
  • DB credentials per service
  • Redis credentials
  • internal API keys / service tokens
  • billing provider keys
  • email provider keys
  • AI provider keys
  • admin bootstrap credentials if any
  • TLS/cert-related secrets where applicable
  • never hardcode secrets in repo
  • never share one DB password across all services
  • rotate secrets on schedule or after incident
  • use environment injection or secret manager
  • restrict who can read prod secrets

Tasks:

  • generate production RSA keypair
  • keep private key in secure secret store
  • expose public key via JWKS
  • support key rotation using kid
  • document rotation runbook
  • ensure old public keys remain available until old tokens expire

Tasks:

  • define how Auth ↔ Core and Admin ↔ Core authenticate internally
  • options:
    • internal API key
    • mTLS
    • signed service tokens
    • private allowlist + shared auth
  • choose one and document rollout

13. Environment variables and config management

Section titled “13. Environment variables and config management”

Each service should have:

  • typed config
  • required vs optional env validation
  • startup failure if critical env missing
  • no hidden config defaults for security-sensitive values

Examples:

  • AUTH_DB_URL
  • REDIS_URL
  • JWT_ISSUER
  • JWT_AUDIENCE
  • JWT_PRIVATE_KEY
  • JWT_PUBLIC_KID
  • ACCESS_TOKEN_TTL
  • REFRESH_TOKEN_TTL
  • INTERNAL_API_KEY (if used)
  • email provider config
  • rate limit settings

Examples:

  • CORE_DB_URL
  • internal auth mechanism config
  • billing provider config if introduced
  • feature toggles for self-service billing

Each backend needs:

  • its DB URL
  • Auth/JWKS config
  • service name
  • x-org policy config if applicable
  • optional Redis config if used
  • logging/metrics config

Even though this is infra, document that frontend deployments need:

  • auth base URL
  • backend base URLs per service
  • admin base URL
  • environment flags
  • public domain config

For each service:

  • lint
  • type/build step
  • test step
  • security/dependency scan if possible
  • artifact/container build
  • deployment step
  • post-deploy health check

For services with DB migrations:

  • migrations should run in controlled step
  • migration success must be checked before full rollout
  • rollback policy documented

Recommended:

  • rolling deploy or blue/green/canary where supported
  • no big-bang deploy for auth-critical changes
  • stage auth and core carefully before frontend consuming new contracts

Preferred order:

  1. Core DB changes
  2. Core backend changes
  3. Auth DB changes
  4. Auth backend changes
  5. Module backend changes
  6. frontend changes

Exact order may vary by compatibility, but do not deploy frontend assuming contracts that backend has not shipped yet.

Each deployment must define:

  • what can be rolled back immediately
  • what DB migrations make rollback harder
  • what feature flags can disable broken behavior quickly

  • /health
  • /ready

Liveness only answers:

  • process is alive
  • event loop/runtime not deadlocked

Readiness should answer:

  • service can handle traffic now
  • critical dependencies reachable

Examples:

  • Auth DB reachable
  • Redis reachable if configured as required dependency
  • signing keys loaded
  • optional Core connectivity if route requires it? Usually not hard dependency for process readiness, but should be separately monitored
  • Core DB reachable
  • internal auth config loaded if needed
  • own DB reachable
  • Auth/JWKS config loaded
  • optional Redis reachable if required

16. Internal service-to-service authentication

Section titled “16. Internal service-to-service authentication”

Auth must call Core.
Admin backend must call Auth and Core.
Module backends may call Auth internal context APIs.

These are not browser calls and should not rely on public user JWT flows.

Simplest early-stage pattern. Pros:

  • easy
  • fast to implement Cons:
  • weaker than mTLS/service identity
  • rotation discipline needed

Pros:

  • stronger separation
  • auditable identity Cons:
  • more implementation complexity

Pros:

  • strongest model Cons:
  • most operational complexity

Start with:

  • private networking
  • internal API key or signed internal service token
  • explicit allowlist on internal services

Then evolve if needed.

  • choose internal auth mechanism
  • document headers/validation
  • ensure Core rejects public unauthorized traffic
  • ensure internal routes are not accidentally exposed

17. Ingress / reverse proxy / gateway tasks

Section titled “17. Ingress / reverse proxy / gateway tasks”
  • TLS termination
  • host-based routing
  • request size limits where needed
  • timeout configuration
  • rate limiting where appropriate
  • forwarding headers correctly
  • access logs

Examples:

  • auth.kisum.io/* → Auth
  • admin.kisum.io/* → Admin frontend
  • app.kisum.io/* → main frontend
  • api-v2-finance.kisum.dev/* → Finance backend
  • etc.

Ensure backends can trust and parse:

  • X-Forwarded-For
  • X-Forwarded-Proto
  • host headers if sitting behind reverse proxy/LB

Define:

  • connect timeout
  • read timeout
  • idle timeout
  • upstream timeout per service type AI endpoints may need longer policies than auth.

For every service ask:

  • must this be public?
  • must this route be public?
  • should this be private/internal only?

Tasks:

  • enforce strict issuer/audience verification
  • rate limit login
  • rotate refresh tokens
  • store refresh tokens hashed
  • revoke sessions cleanly
  • audit auth-sensitive actions

Tasks:

  • keep internal-only if possible
  • validate internal auth on all internal routes
  • audit entitlement changes
  • protect product catalog changes

Tasks:

  • never trust frontend visibility
  • always validate JWT
  • always resolve access context
  • never assume one module implies another unless explicitly designed

Tasks:

  • strong RBAC for platform admins
  • audit every subscription/catalog change
  • protect high-risk actions with confirmation UX/server checks
  • consider step-up auth later for highest-risk actions if needed

Every service should log in structured form with:

  • timestamp
  • service name
  • environment
  • request ID
  • route
  • user/session/company identifiers when safe
  • outcome status
  • error code/message
  • login success/failure
  • refresh success/failure
  • logout
  • logout-all
  • session revoked
  • access aggregation failure
  • access denied due to inactive/revoked state
  • subscription changes
  • add-on changes
  • catalog changes
  • entitlement version bumps
  • auth verification failure
  • missing/invalid x-org
  • access denied by module
  • access denied by permission
  • product/catalog create/update/delete
  • organization approval/rejection
  • company subscription changes

Per service:

  • request count
  • latency
  • error rate
  • 4xx/5xx rate
  • CPU
  • memory
  • restart count
  • login success rate
  • login failure rate
  • refresh success rate
  • revoked-session usage
  • access cache hit/miss
  • access merge latency
  • Core entitlement lookup latency
  • entitlement lookup latency
  • catalog mutation count
  • entitlement version bump count
  • subscription/add-on change rate
  • access denied by module
  • access denied by permission
  • x-org resolution failures
  • internal access-context fetch latency

Create alerts for:

  • service down
  • readiness failing
  • DB unavailable
  • Redis unavailable
  • elevated 5xx
  • elevated login failures
  • unusual auth denial spikes
  • slow entitlement/access aggregation
  • certificate expiry nearing
  • backup failures

Recommended for:

  • frontend request → auth → core
  • frontend request → backend → auth access context
  • admin change → core/auth → cache invalidation

Use request IDs and tracing headers consistently across:

  • auth
  • core
  • admin backend
  • module backends

This is especially useful for debugging:

  • “why did user lose access?”
  • “why did module show but API returned 403?”
  • “why did new subscription not appear immediately?”

Every stateful DB must have:

  • automated backups
  • defined retention
  • restore testing
  • documented RPO/RTO expectations

Redis backup is optional depending on role, but:

  • if persistence enabled, define retention and recovery
  • if pure cache, losing it should only impact performance

At least periodically test:

  • Auth DB restore
  • Core DB restore
  • Finance DB restore
  • Mongo restore for Main / Market/Touring

Likely restore priority:

  1. Auth DB
  2. Core DB
  3. Redis (or allow rebuild)
  4. Basic/Main DB
  5. Finance DB
  6. other module DBs

Why:

  • without Auth and Core, platform access model is unusable

23. Disaster recovery and business continuity

Section titled “23. Disaster recovery and business continuity”

Per critical service define:

  • RPO (acceptable data loss)
  • RTO (acceptable downtime)
  • Auth: highest
  • Core: highest
  • Finance: high
  • Basic/Main: high
  • Market/Touring: medium-high
  • Venue/AI: medium depending on business usage
  • decide if warm standby required for Auth/Core DBs
  • document failover strategy
  • document how DNS/LB routing would switch if region/service fails
  • ensure secrets available in DR environment
  • ensure restore scripts are not tribal knowledge

Before prod launch estimate:

  • daily active users
  • concurrent auth requests
  • access bootstrap frequency
  • module backend QPS
  • admin change frequency
  • report-heavy workloads
  • AI load patterns

Auth will be on the hot path for:

  • login
  • refresh
  • session restore
  • access aggregation
  • backend access checks (depending on chosen pattern)

So:

  • scale Auth horizontally if needed
  • ensure DB and Redis can support it
  • cache wisely without hiding truth in cache

Core traffic volume should be lower than Auth, but:

  • entitlement reads will happen often through Auth
  • writes are lower but highly important

Module backends scale based on business traffic. Finance and AI may have very different resource profiles:

  • Finance: transactional/reporting
  • AI: potentially heavy CPU/network calls

Recommended order:

  1. provision Auth DB + Core DB + Redis
  2. deploy Auth
  3. deploy Core
  4. validate internal Auth ↔ Core
  5. deploy Business backends with new auth validation path
  6. deploy Admin backend
  7. deploy frontends
  8. run staged smoke tests

For existing systems:

  • enable compatibility mode where needed
  • cut over one backend at a time if necessary
  • verify module access after each backend migration

At minimum test:

  • Auth health/readiness
  • login
  • /auth/me
  • /auth/me/access
  • Core entitlement read
  • one backend permission-allowed request
  • one backend permission-denied request
  • admin subscription change invalidates access

26.1 “User cannot access module” runbook

Section titled “26.1 “User cannot access module” runbook”

Check:

  1. is Auth healthy?
  2. is Core healthy?
  3. is JWT valid?
  4. is session revoked?
  5. is x-org correct?
  6. does company own module in Core?
  7. does membership have module grant?
  8. does access cache need invalidation?
  9. does backend enforce wrong permission key?

26.2 “Login failures suddenly spike” runbook

Section titled “26.2 “Login failures suddenly spike” runbook”

Check:

  1. Auth DB health
  2. Redis/rate limit behavior
  3. signing key/config changes
  4. DNS/LB issues
  5. client deployment issues

26.3 “Subscription changed but UI still old” runbook

Section titled “26.3 “Subscription changed but UI still old” runbook”

Check:

  1. Core wrote entitlement change successfully?
  2. entitlement version bumped?
  3. invalidation signal sent?
  4. Auth cache cleared?
  5. frontend refetched /auth/me/access?
  6. backend still using stale access snapshot?

Expected behavior:

  • performance degrades
  • truth still rebuilds from DB Tasks:
  • verify services fail open/closed correctly per route
  • restore Redis
  • confirm access cache repopulates

26.5 “JWT verification failing in backends” runbook

Section titled “26.5 “JWT verification failing in backends” runbook”

Check:

  1. JWKS reachable
  2. kid known
  3. issuer/audience config correct
  4. clock skew/time sync issues
  5. key rotation event incomplete

27. Security review checklist before production

Section titled “27. Security review checklist before production”
  • all public domains use TLS
  • Core is internal-only or properly protected
  • DBs and Redis are private
  • JWT private keys stored securely
  • JWKS exposed correctly
  • refresh tokens hashed
  • login rate limits enabled
  • admin routes protected
  • service-to-service auth documented and enabled
  • backups enabled
  • restore tested
  • alerts configured
  • access logs enabled
  • secrets not stored in repo
  • prod/staging fully separated

  • service deployed
  • health/readiness working
  • auth_db provisioned
  • Redis connected
  • JWT keys configured
  • JWKS public
  • logs/metrics enabled
  • alerts enabled
  • backups enabled
  • service deployed
  • platform_core_db provisioned
  • internal-only exposure configured
  • service auth from Auth/Admin configured
  • logs/metrics enabled
  • backups enabled
  • public API routed
  • main DB connected
  • auth validation configured
  • x-org policy enabled
  • logs/metrics enabled
  • public API routed
  • finance DB connected
  • auth validation configured
  • logs/metrics enabled
  • routing configured
  • DB connected
  • shared/separate service decision documented
  • auth validation configured
  • logs/metrics enabled
  • domain/path configured
  • DB finalized
  • auth validation configured
  • logs/metrics enabled
  • domain configured
  • DB/storage finalized
  • upstream AI provider secrets configured
  • auth validation configured
  • logs/metrics enabled
  • private access only
  • persistence decision documented
  • memory/TTL policy set
  • monitoring enabled
  • all records created
  • staging/prod separated
  • certs valid
  • renewal verified
  • expiry alerts enabled

29. Recommended infra deliverables after this document

Section titled “29. Recommended infra deliverables after this document”

After this phase, the next concrete infra artifacts should be:

  1. Environment matrix

    • local/dev/staging/prod values and owners
  2. Service inventory sheet

    • each service, domain, runtime, repo, owner, DB, secrets, alerts
  3. Network matrix

    • who can call whom
  4. Secrets inventory

    • all required secrets by environment
  5. DB backup/restore runbook

    • step-by-step recovery
  6. Deployment runbook

    • release, smoke tests, rollback
  7. Incident playbook

    • auth outage
    • core outage
    • redis outage
    • db outage
    • cert expiry
    • stale access issue

Infrastructure must guarantee these truths:

Auth is the identity and access truth.
Core is the commercial entitlement truth.
Redis is only acceleration.
PostgreSQL and service-owned DBs are the real source of truth.
Backends enforce access consistently.
Private services stay private.
Backups and restore are mandatory.
Monitoring and alerting are not optional.

And the most important infra rule of all:

The system must remain correct even when cache is cold, stale, or unavailable.