Infrastructure Implementation Tasks/Plan
Detailed Infra Tasks (Redis / DB / Services / Domains / Networking / Security / Deployment / Operations)
Section titled “Detailed Infra Tasks (Redis / DB / Services / Domains / Networking / Security / Deployment / Operations)”Audience: DevOps engineers, platform engineers, backend leads, security engineers, SRE/operations, architecture leads
Status: execution-grade infrastructure specification
Goal: define, in practical detail, how to deploy, secure, connect, observe, scale, and operate the full Kisum platform infrastructure for:
- Auth Backend
- Platform Core Backend
- Platform Admin Backend
- Basic/Core Backend
- Finance Backend
- Market Backend
- Touring Backend
- Venue Backend
- AI Backend
- PostgreSQL databases
- Redis
- domains / DNS / TLS
- networking / ingress / egress
- secrets / credentials
- backups / restore
- logging / metrics / tracing
- CI/CD / release strategy
- incidents / runbooks / disaster recovery
0. How to read this document
Section titled “0. How to read this document”This document is intentionally long and explicit.
It is not a product brief.
It is not a high-level architecture note.
It is not a slide deck.
It is meant to answer questions like:
- where exactly should each service run?
- what domains should point where?
- what databases exist?
- which services can talk to which others?
- where does Redis live?
- what must be private vs public?
- how should environment variables be managed?
- how should certificates be handled?
- what gets backed up?
- how do we deploy without breaking production?
- what happens when Redis is down?
- how do we rotate JWT signing keys safely?
- how do we restore auth if the DB is corrupted?
- what monitoring and alerts are mandatory before go-live?
Where there is still an application-level TBD, this document states that clearly rather than inventing fake certainty.
1. Infra objectives
Section titled “1. Infra objectives”The infrastructure must support the agreed architecture:
- Auth = identity, sessions, memberships, roles, permissions, delegation, access aggregation
- Platform Core = packages, add-ons, modules, company entitlements
- Business Backends = business data and business logic
- Frontend = consumes backend APIs; does not own security truth
- Redis = performance layer, never source of truth
- PostgreSQL = source of truth for relational system state
The infrastructure must also support these business/technical rules:
- JWT remains small and identity-only.
- Access is recomputed from Auth + Core data, not embedded in JWT.
- Core App is only visible when Basic subscription is active.
- Add-on modules may be active even if Basic is inactive.
- Business backends must validate and enforce access consistently.
- Platform Core is internal for entitlements and should not become a public authorization engine.
- Frontend must call Auth for access bootstrap, not Core directly.
2. Top-level infrastructure inventory
Section titled “2. Top-level infrastructure inventory”The platform should be thought of as these deployable groups:
2.1 Public-facing services
Section titled “2.1 Public-facing services”These accept browser/client traffic from the public internet.
- Frontend App
- Platform Admin Frontend
- Auth Backend
- Platform Admin Backend
- Basic Backend
- Finance Backend
- Market Backend
- Touring Backend
- Venue Backend
- AI Backend
2.2 Internal-only services
Section titled “2.2 Internal-only services”These should not be directly callable by browsers/public clients unless there is a very specific decision to expose them.
- Platform Core Backend
- internal supporting jobs/workers, if introduced
- optional internal cache warming jobs
- optional scheduled reconciliation jobs
2.3 Data services
Section titled “2.3 Data services”These hold or accelerate platform state.
- PostgreSQL for Auth
- PostgreSQL for Platform Core
- MongoDB for Main/Basic (current)
- PostgreSQL for Finance
- MongoDB for Market/Touring (current)
- Venue DB (TBD)
- AI DB / storage (TBD)
- Redis
2.4 Platform services
Section titled “2.4 Platform services”These support routing, delivery, and operations.
- DNS provider
- TLS certificate management
- reverse proxy / ingress
- CI/CD
- secret manager or secret injection mechanism
- logging sink
- metrics backend
- alerting system
- backup jobs
- restore tooling
3. Recommended environment layout
Section titled “3. Recommended environment layout”At minimum, infrastructure should support:
- local
- dev
- staging
- production
Optional:
- preview
- qa
- disaster-recovery / warm standby
3.1 Local
Section titled “3.1 Local”Purpose:
- individual development
- integration tests with minimal dependencies
Should support:
- local Auth
- local Core
- optional local Redis
- local PostgreSQL
- mocked or dev versions of other backends
3.2 Dev
Section titled “3.2 Dev”Purpose:
- shared development environment
- integration testing by multiple engineers
Should support:
- all services deployable
- test domains
- real Redis
- real Postgres instances
- non-production secrets
- isolated data
3.3 Staging
Section titled “3.3 Staging”Purpose:
- production-like validation
- release candidate testing
- smoke tests
- load testing before prod
Should support:
- same routing pattern as prod
- same TLS pattern as prod
- same secret layout pattern as prod
- realistic DB/Redis services
- monitoring and alerts enabled
3.4 Production
Section titled “3.4 Production”Purpose:
- customer traffic
- platform admin traffic
- real business operations
Must have:
- secure TLS
- monitored DB/Redis
- backups
- restore runbooks
- incident alerting
- environment separation from staging
- rollout and rollback plan
4. Domain and DNS plan
Section titled “4. Domain and DNS plan”4.1 Recommended domains
Section titled “4.1 Recommended domains”Public domains
Section titled “Public domains”auth.kisum.io→ Auth Backendapp.kisum.io→ main frontend / Core App shell / module routesadmin.kisum.io→ platform admin frontendapi-v2.kisum.dev(or production equivalent under kisum.io if desired) → Basic Backendapi-v2-finance.kisum.dev→ Finance Backendapi-v2-market.kisum.dev→ Market Backendapi-v2-touring.kisum.dev→ Touring Backend (if separate)api.kisum.dev/venueor dedicated subdomain → Venue Backendapi-v2-ai.kisum.dev→ AI Backend
Internal domains
Section titled “Internal domains”core.kisum.ioorplatform-core.kisum.io→ Platform Core Backend- optional internal admin-to-core/internal-to-auth hostnames depending on network design
4.2 Domain strategy notes
Section titled “4.2 Domain strategy notes”Option A — keep current mixed domain pattern temporarily
Section titled “Option A — keep current mixed domain pattern temporarily”This reflects your current system reality and reduces migration friction.
Option B — normalize all production domains later
Section titled “Option B — normalize all production domains later”Examples:
api-basic.kisum.ioapi-finance.kisum.ioapi-market.kisum.io
This can be cleaner long term, but not required immediately.
4.3 DNS records
Section titled “4.3 DNS records”For each public hostname, define:
- A / AAAA or CNAME depending on hosting model
- TTL appropriate for your infra
- separate staging/dev DNS namespace
- no shared records between prod and non-prod
Examples:
auth.kisum.ioauth-staging.kisum.iocore-staging.kisum.ioapp-staging.kisum.io
4.4 DNS operational tasks
Section titled “4.4 DNS operational tasks”- document all DNS records in infra repo
- avoid manual undocumented DNS edits
- define ownership for DNS changes
- use lower TTL before cutovers or migrations
- record certificate dependencies for each hostname
5. TLS / certificates
Section titled “5. TLS / certificates”5.1 Every public domain must use TLS
Section titled “5.1 Every public domain must use TLS”Mandatory for:
- frontend
- auth
- admin
- all public APIs
5.2 Internal-only services
Section titled “5.2 Internal-only services”Recommended:
- still use TLS if routed through internal ingress/service mesh
- otherwise restrict to private networking and authenticated service calls
5.3 Certificate management tasks
Section titled “5.3 Certificate management tasks”- choose cert strategy:
- managed certs via cloud LB
- or Let’s Encrypt via ingress / reverse proxy
- ensure auto-renewal
- alert before expiration
- test renewals in staging
5.4 HSTS and secure headers
Section titled “5.4 HSTS and secure headers”For frontend and auth domains, configure:
- HSTS
- X-Content-Type-Options
- X-Frame-Options or CSP equivalent as appropriate
- Referrer-Policy
- secure cookie policy if cookies are used anywhere
- CSP for frontends if feasible
6. Networking model
Section titled “6. Networking model”6.1 Public vs private services
Section titled “6.1 Public vs private services”Public
Section titled “Public”- Auth Backend
- Frontend(s)
- Platform Admin Backend
- Basic Backend
- Module backends
Private/internal
Section titled “Private/internal”- Platform Core Backend
- databases
- Redis
- internal job runners
6.2 Principle of least network exposure
Section titled “6.2 Principle of least network exposure”Default rule:
- if a service does not need public internet traffic, do not expose it publicly
Platform Core is the most important example:
- it should be reachable by Auth and Admin backend
- it should not be a public browser API
6.3 Network segmentation
Section titled “6.3 Network segmentation”Recommended segmentation:
- ingress/public subnet or layer
- app/services private subnet or network
- data subnet or private data layer
At minimum:
- public reverse proxy or LB
- private app-to-app connectivity
- private DB and Redis connectivity
6.4 Allowed traffic matrix
Section titled “6.4 Allowed traffic matrix”Auth Backend
Section titled “Auth Backend”Can talk to:
- Auth DB
- Redis
- Platform Core Backend
- optional email provider (SES etc.)
- optional internal admin/backend tools
Should not need direct access to:
- module DBs
- module business backends for normal runtime
Platform Core Backend
Section titled “Platform Core Backend”Can talk to:
- Platform Core DB
- optional billing provider APIs
- optional internal messaging bus
- optional Redis if used for local acceleration
Should not need direct access to:
- Auth DB
- module DBs
Platform Admin Backend
Section titled “Platform Admin Backend”Can talk to:
- Auth Backend
- Platform Core Backend
- optional admin DB/config DB
Basic Backend
Section titled “Basic Backend”Can talk to:
- Main DB
- Auth/JWKS or Auth internal access route
- optional Redis if using shared cache path
Finance Backend
Section titled “Finance Backend”Can talk to:
- Finance DB
- Auth/JWKS or Auth internal access route
Market/Touring Backend
Section titled “Market/Touring Backend”Can talk to:
- Market/Touring DB
- Auth/JWKS or Auth internal access route
Venue Backend
Section titled “Venue Backend”Can talk to:
- Venue DB
- Auth/JWKS or Auth internal access route
AI Backend
Section titled “AI Backend”Can talk to:
- AI DB/storage
- Auth/JWKS or Auth internal access route
- optional model providers
Databases and Redis
Section titled “Databases and Redis”Should never accept public internet traffic directly.
7. Service hosting and deployment model
Section titled “7. Service hosting and deployment model”This section does not force one provider, but it defines what the hosting model must support.
7.1 Supported hosting patterns
Section titled “7.1 Supported hosting patterns”Pattern A — VM-based deployment
Section titled “Pattern A — VM-based deployment”Services run in containers or processes on VMs. Good when:
- you want direct control
- you already use VPS infrastructure
- you want simpler networking and lower moving parts early
Pattern B — container platform / orchestrator
Section titled “Pattern B — container platform / orchestrator”Services run in managed containers / Kubernetes / Nomad / etc. Good when:
- you want stronger scaling and orchestration
- you have DevOps maturity for it
Pattern C — mixed model
Section titled “Pattern C — mixed model”- frontends on managed frontend hosting/CDN
- APIs on containers/VMs
- DB/Redis managed This is often the most practical for a growing product.
7.2 Recommended pragmatic layout
Section titled “7.2 Recommended pragmatic layout”Given your current environment history and the service split, a pragmatic production layout is:
- Frontends on managed static/app hosting or dedicated app servers behind CDN
- Auth, Core, Admin, and module backends in containers on private app hosts or managed app platform
- PostgreSQL managed where possible
- Redis managed where possible
- reverse proxy or load balancer in front
This gives:
- cleaner deployments
- safer DB operations
- less ops burden on the most critical stateful services
7.3 Deployment units
Section titled “7.3 Deployment units”Each of these should be independently deployable:
- auth-backend
- platform-core-backend
- platform-admin-backend
- basic-backend
- finance-backend
- market-backend
- touring-backend (if separate)
- venue-backend
- ai-backend
Each needs:
- its own config
- its own runtime health checks
- its own logs
- its own release history
8. Database plan
Section titled “8. Database plan”8.1 Auth DB
Section titled “8.1 Auth DB”Type: PostgreSQL
Name: auth_db
Stores:
- users
- sessions
- company memberships
- module grants
- permissions
- delegation
- token/versioning-related state
- audit/security records if stored in DB
Infra tasks
Section titled “Infra tasks”- create managed or dedicated PostgreSQL instance/database
- enforce TLS for DB connections where supported
- create app role/user with least privilege
- apply migrations through CI/CD
- enable PITR if provider supports it
- define backup schedule
- test restore
Sizing considerations
Section titled “Sizing considerations”Hot-path DB. Expect frequent reads for:
- user/session checks
- membership checks
- access aggregation Therefore:
- proper indexes are mandatory
- pooling must be tuned
- read patterns must be measured
8.2 Platform Core DB
Section titled “8.2 Platform Core DB”Type: PostgreSQL
Name: platform_core_db
Stores:
- modules
- packages
- add-ons
- package/add-on mappings
- company subscriptions
- company add-ons
- entitlement versioning
- entitlement history optionally
Infra tasks
Section titled “Infra tasks”- create separate DB or at minimum separate schema
- migrations managed independently from Auth
- backup/restore independently
- secure private connectivity only
8.3 Main / Basic DB
Section titled “8.3 Main / Basic DB”Type: current main business DB (MongoDB Main per current documentation)
Stores:
- core/basic business data
Infra tasks
Section titled “Infra tasks”- document exact DB cluster/instance
- confirm connection policy
- confirm backup policy
- define read/write credentials for Basic Backend only
- review indexes for core business routes
8.4 Finance DB
Section titled “8.4 Finance DB”Type: PostgreSQL Finance
Infra tasks
Section titled “Infra tasks”- verify production DB sizing
- verify migrations
- verify backup and restore policy
- ensure Finance backend has only required role access
8.5 Market/Touring DB
Section titled “8.5 Market/Touring DB”Type: MongoDB Market/Touring (current direction)
Infra tasks
Section titled “Infra tasks”- confirm whether shared cluster or DB namespace
- document collections and ownership
- backup schedule
- restore test
- security policy for shared Market/Touring service
8.6 Venue DB
Section titled “8.6 Venue DB”Type: TBD
Infra tasks
Section titled “Infra tasks”- finalize DB engine
- define owner service
- define backup plan
- define access credentials
8.7 AI DB / storage
Section titled “8.7 AI DB / storage”Type: TBD
May include:
- relational DB
- document store
- object storage
- vector storage depending on AI design
Infra tasks
Section titled “Infra tasks”- finalize data types
- define retention rules
- define cost controls
- define model artifact storage if any
9. PostgreSQL operational tasks
Section titled “9. PostgreSQL operational tasks”These apply to Auth and Platform Core and Finance where relevant.
9.1 Roles and privileges
Section titled “9.1 Roles and privileges”For each application DB:
- create separate DB user per service
- no superuser for application runtime
- migrations may use a stronger controlled role
- app user should only access its own schema/tables
9.2 Connection pooling
Section titled “9.2 Connection pooling”Tasks:
- configure max connections per service
- decide whether pooling is done:
- in-app
- via PgBouncer
- or managed DB proxy
- prevent connection exhaustion during spikes
9.3 Migrations
Section titled “9.3 Migrations”Tasks:
- choose migration tool
- maintain migration repo/folder
- run in CI/CD before app rollout when appropriate
- protect prod with migration review
- support rollback strategy for reversible migrations
- define non-reversible migration warnings
9.4 Backups
Section titled “9.4 Backups”Tasks:
- daily full backups at minimum
- WAL/PITR if supported
- encrypted storage for backups
- retention policy
- restore test at least periodically
9.5 Monitoring
Section titled “9.5 Monitoring”Track:
- CPU
- memory
- storage
- connection count
- slow queries
- replication lag if any
- lock contention
- migration failures
10. MongoDB operational tasks
Section titled “10. MongoDB operational tasks”These apply to Main and Market/Touring if still on Mongo.
10.1 Cluster/config review
Section titled “10.1 Cluster/config review”Tasks:
- document version
- document replica set/sharding status
- confirm backup method
- confirm auth enabled
- confirm TLS enabled if supported/used
- confirm private network only
10.2 Access control
Section titled “10.2 Access control”Tasks:
- separate DB users per service
- least privilege roles
- no shared root creds in app runtime
10.3 Backup and restore
Section titled “10.3 Backup and restore”Tasks:
- automated backups
- retention policy
- restore drill
- document RPO/RTO expectations
10.4 Performance
Section titled “10.4 Performance”Tasks:
- review indexes on hot collections
- review unbounded collection growth
- review heavy aggregation/query paths
- monitor oplog/replica health if applicable
11. Redis plan
Section titled “11. Redis plan”Redis is critical but is not source of truth.
11.1 Redis responsibilities
Section titled “11.1 Redis responsibilities”Redis may be used for:
- access-context cache
- session hot cache
- revocation markers
- login rate limiting
- refresh throttling
- password-reset throttling
- company resolution cache
- optional internal access snapshot cache
- any other hot-path cache that preserves correctness when lost
11.2 Redis deployment recommendations
Section titled “11.2 Redis deployment recommendations”Preferred:
- managed Redis or highly available Redis service Alternative:
- self-managed Redis with persistence and restart strategy
11.3 Redis must not become truth
Section titled “11.3 Redis must not become truth”Mandatory rule:
- if Redis is empty, wrong, or unavailable, the system must still be able to rebuild truth from DBs
- correctness must not depend on Redis being the sole store of critical state
11.4 Redis key design
Section titled “11.4 Redis key design”Access context cache
Section titled “Access context cache”access:{companyId}:{membershipId}:{accessVersion}:{entitlementVersion}Session cache
Section titled “Session cache”session:{sessionId}Revocation marker
Section titled “Revocation marker”revoked:session:{sessionId}Login rate limits
Section titled “Login rate limits”ratelimit:login:ip:{ip}ratelimit:login:email:{email}Company resolution cache
Section titled “Company resolution cache”company-map:{raw-x-org}11.5 TTL policy
Section titled “11.5 TTL policy”Recommended starting points:
- access cache: 5–15 minutes
- session hot cache: aligned to short token/session needs
- revocation marker: at least until all access tokens tied to that session are expired
- rate limit keys: according to endpoint policy
11.6 Redis failure behavior
Section titled “11.6 Redis failure behavior”Define expected behavior if Redis is unavailable:
- login may temporarily lose rate limiting if no fallback exists
- session/access lookups should fall back to DB
- service should degrade, not catastrophically fail, unless a specific route is designed otherwise
Backends
Section titled “Backends”- access context should be refetched from Auth/DB path
- performance degrades but correctness remains
11.7 Redis monitoring
Section titled “11.7 Redis monitoring”Track:
- memory usage
- CPU
- evictions
- latency
- keyspace misses/hits
- connection count
- persistence failures if persistence enabled
12. Secrets and credential management
Section titled “12. Secrets and credential management”12.1 Secret categories
Section titled “12.1 Secret categories”Infra must manage these separately:
- JWT private signing keys
- JWT public key metadata/JWKS config
- DB credentials per service
- Redis credentials
- internal API keys / service tokens
- billing provider keys
- email provider keys
- AI provider keys
- admin bootstrap credentials if any
- TLS/cert-related secrets where applicable
12.2 Rules
Section titled “12.2 Rules”- never hardcode secrets in repo
- never share one DB password across all services
- rotate secrets on schedule or after incident
- use environment injection or secret manager
- restrict who can read prod secrets
12.3 JWT key management
Section titled “12.3 JWT key management”Tasks:
- generate production RSA keypair
- keep private key in secure secret store
- expose public key via JWKS
- support key rotation using
kid - document rotation runbook
- ensure old public keys remain available until old tokens expire
12.4 Internal service auth secrets
Section titled “12.4 Internal service auth secrets”Tasks:
- define how Auth ↔ Core and Admin ↔ Core authenticate internally
- options:
- internal API key
- mTLS
- signed service tokens
- private allowlist + shared auth
- choose one and document rollout
13. Environment variables and config management
Section titled “13. Environment variables and config management”13.1 General principle
Section titled “13.1 General principle”Each service should have:
- typed config
- required vs optional env validation
- startup failure if critical env missing
- no hidden config defaults for security-sensitive values
13.2 Auth config categories
Section titled “13.2 Auth config categories”Examples:
AUTH_DB_URLREDIS_URLJWT_ISSUERJWT_AUDIENCEJWT_PRIVATE_KEYJWT_PUBLIC_KIDACCESS_TOKEN_TTLREFRESH_TOKEN_TTLINTERNAL_API_KEY(if used)- email provider config
- rate limit settings
13.3 Platform Core config categories
Section titled “13.3 Platform Core config categories”Examples:
CORE_DB_URL- internal auth mechanism config
- billing provider config if introduced
- feature toggles for self-service billing
13.4 Module backend config categories
Section titled “13.4 Module backend config categories”Each backend needs:
- its DB URL
- Auth/JWKS config
- service name
x-orgpolicy config if applicable- optional Redis config if used
- logging/metrics config
13.5 Frontend config categories
Section titled “13.5 Frontend config categories”Even though this is infra, document that frontend deployments need:
- auth base URL
- backend base URLs per service
- admin base URL
- environment flags
- public domain config
14. CI/CD tasks
Section titled “14. CI/CD tasks”14.1 Repo pipeline requirements
Section titled “14.1 Repo pipeline requirements”For each service:
- lint
- type/build step
- test step
- security/dependency scan if possible
- artifact/container build
- deployment step
- post-deploy health check
14.2 Migration handling
Section titled “14.2 Migration handling”For services with DB migrations:
- migrations should run in controlled step
- migration success must be checked before full rollout
- rollback policy documented
14.3 Deployment strategy
Section titled “14.3 Deployment strategy”Recommended:
- rolling deploy or blue/green/canary where supported
- no big-bang deploy for auth-critical changes
- stage auth and core carefully before frontend consuming new contracts
14.4 Release order
Section titled “14.4 Release order”Preferred order:
- Core DB changes
- Core backend changes
- Auth DB changes
- Auth backend changes
- Module backend changes
- frontend changes
Exact order may vary by compatibility, but do not deploy frontend assuming contracts that backend has not shipped yet.
14.5 Rollback strategy
Section titled “14.5 Rollback strategy”Each deployment must define:
- what can be rolled back immediately
- what DB migrations make rollback harder
- what feature flags can disable broken behavior quickly
15. Service startup and health checks
Section titled “15. Service startup and health checks”15.1 Every service should expose
Section titled “15.1 Every service should expose”/health/ready
15.2 Liveness
Section titled “15.2 Liveness”Liveness only answers:
- process is alive
- event loop/runtime not deadlocked
15.3 Readiness
Section titled “15.3 Readiness”Readiness should answer:
- service can handle traffic now
- critical dependencies reachable
Examples:
Auth readiness checks
Section titled “Auth readiness checks”- Auth DB reachable
- Redis reachable if configured as required dependency
- signing keys loaded
- optional Core connectivity if route requires it? Usually not hard dependency for process readiness, but should be separately monitored
Core readiness checks
Section titled “Core readiness checks”- Core DB reachable
- internal auth config loaded if needed
Module backend readiness
Section titled “Module backend readiness”- own DB reachable
- Auth/JWKS config loaded
- optional Redis reachable if required
16. Internal service-to-service authentication
Section titled “16. Internal service-to-service authentication”16.1 Problem
Section titled “16.1 Problem”Auth must call Core.
Admin backend must call Auth and Core.
Module backends may call Auth internal context APIs.
These are not browser calls and should not rely on public user JWT flows.
16.2 Options
Section titled “16.2 Options”Option A — Internal API key
Section titled “Option A — Internal API key”Simplest early-stage pattern. Pros:
- easy
- fast to implement Cons:
- weaker than mTLS/service identity
- rotation discipline needed
Option B — Signed service token
Section titled “Option B — Signed service token”Pros:
- stronger separation
- auditable identity Cons:
- more implementation complexity
Option C — mTLS/service mesh identity
Section titled “Option C — mTLS/service mesh identity”Pros:
- strongest model Cons:
- most operational complexity
16.3 Recommendation
Section titled “16.3 Recommendation”Start with:
- private networking
- internal API key or signed internal service token
- explicit allowlist on internal services
Then evolve if needed.
16.4 Tasks
Section titled “16.4 Tasks”- choose internal auth mechanism
- document headers/validation
- ensure Core rejects public unauthorized traffic
- ensure internal routes are not accidentally exposed
17. Ingress / reverse proxy / gateway tasks
Section titled “17. Ingress / reverse proxy / gateway tasks”17.1 Public ingress responsibilities
Section titled “17.1 Public ingress responsibilities”- TLS termination
- host-based routing
- request size limits where needed
- timeout configuration
- rate limiting where appropriate
- forwarding headers correctly
- access logs
17.2 Route ownership
Section titled “17.2 Route ownership”Examples:
auth.kisum.io/*→ Authadmin.kisum.io/*→ Admin frontendapp.kisum.io/*→ main frontendapi-v2-finance.kisum.dev/*→ Finance backend- etc.
17.3 Proxy headers
Section titled “17.3 Proxy headers”Ensure backends can trust and parse:
X-Forwarded-ForX-Forwarded-Proto- host headers if sitting behind reverse proxy/LB
17.4 Timeouts
Section titled “17.4 Timeouts”Define:
- connect timeout
- read timeout
- idle timeout
- upstream timeout per service type AI endpoints may need longer policies than auth.
18. Security hardening tasks
Section titled “18. Security hardening tasks”18.1 Public exposure review
Section titled “18.1 Public exposure review”For every service ask:
- must this be public?
- must this route be public?
- should this be private/internal only?
18.2 Auth hardening
Section titled “18.2 Auth hardening”Tasks:
- enforce strict issuer/audience verification
- rate limit login
- rotate refresh tokens
- store refresh tokens hashed
- revoke sessions cleanly
- audit auth-sensitive actions
18.3 Core hardening
Section titled “18.3 Core hardening”Tasks:
- keep internal-only if possible
- validate internal auth on all internal routes
- audit entitlement changes
- protect product catalog changes
18.4 Module backend hardening
Section titled “18.4 Module backend hardening”Tasks:
- never trust frontend visibility
- always validate JWT
- always resolve access context
- never assume one module implies another unless explicitly designed
18.5 Admin hardening
Section titled “18.5 Admin hardening”Tasks:
- strong RBAC for platform admins
- audit every subscription/catalog change
- protect high-risk actions with confirmation UX/server checks
- consider step-up auth later for highest-risk actions if needed
19. Logging tasks
Section titled “19. Logging tasks”19.1 Structured logging standard
Section titled “19.1 Structured logging standard”Every service should log in structured form with:
- timestamp
- service name
- environment
- request ID
- route
- user/session/company identifiers when safe
- outcome status
- error code/message
19.2 Must-log events
Section titled “19.2 Must-log events”- login success/failure
- refresh success/failure
- logout
- logout-all
- session revoked
- access aggregation failure
- access denied due to inactive/revoked state
- subscription changes
- add-on changes
- catalog changes
- entitlement version bumps
Business backends
Section titled “Business backends”- auth verification failure
- missing/invalid
x-org - access denied by module
- access denied by permission
Admin backend
Section titled “Admin backend”- product/catalog create/update/delete
- organization approval/rejection
- company subscription changes
20. Metrics and monitoring tasks
Section titled “20. Metrics and monitoring tasks”20.1 Service-level metrics
Section titled “20.1 Service-level metrics”Per service:
- request count
- latency
- error rate
- 4xx/5xx rate
- CPU
- memory
- restart count
20.2 Auth-specific metrics
Section titled “20.2 Auth-specific metrics”- login success rate
- login failure rate
- refresh success rate
- revoked-session usage
- access cache hit/miss
- access merge latency
- Core entitlement lookup latency
20.3 Core-specific metrics
Section titled “20.3 Core-specific metrics”- entitlement lookup latency
- catalog mutation count
- entitlement version bump count
- subscription/add-on change rate
20.4 Backend-specific metrics
Section titled “20.4 Backend-specific metrics”- access denied by module
- access denied by permission
x-orgresolution failures- internal access-context fetch latency
20.5 Alerting
Section titled “20.5 Alerting”Create alerts for:
- service down
- readiness failing
- DB unavailable
- Redis unavailable
- elevated 5xx
- elevated login failures
- unusual auth denial spikes
- slow entitlement/access aggregation
- certificate expiry nearing
- backup failures
21. Tracing tasks
Section titled “21. Tracing tasks”21.1 Distributed tracing scope
Section titled “21.1 Distributed tracing scope”Recommended for:
- frontend request → auth → core
- frontend request → backend → auth access context
- admin change → core/auth → cache invalidation
21.2 Trace propagation
Section titled “21.2 Trace propagation”Use request IDs and tracing headers consistently across:
- auth
- core
- admin backend
- module backends
21.3 Why it matters
Section titled “21.3 Why it matters”This is especially useful for debugging:
- “why did user lose access?”
- “why did module show but API returned 403?”
- “why did new subscription not appear immediately?”
22. Backup and restore plan
Section titled “22. Backup and restore plan”22.1 Databases
Section titled “22.1 Databases”Every stateful DB must have:
- automated backups
- defined retention
- restore testing
- documented RPO/RTO expectations
22.2 Redis
Section titled “22.2 Redis”Redis backup is optional depending on role, but:
- if persistence enabled, define retention and recovery
- if pure cache, losing it should only impact performance
22.3 Restore drills
Section titled “22.3 Restore drills”At least periodically test:
- Auth DB restore
- Core DB restore
- Finance DB restore
- Mongo restore for Main / Market/Touring
22.4 Restore order in major incident
Section titled “22.4 Restore order in major incident”Likely restore priority:
- Auth DB
- Core DB
- Redis (or allow rebuild)
- Basic/Main DB
- Finance DB
- other module DBs
Why:
- without Auth and Core, platform access model is unusable
23. Disaster recovery and business continuity
Section titled “23. Disaster recovery and business continuity”23.1 Define RPO / RTO
Section titled “23.1 Define RPO / RTO”Per critical service define:
- RPO (acceptable data loss)
- RTO (acceptable downtime)
Suggested criticality
Section titled “Suggested criticality”- Auth: highest
- Core: highest
- Finance: high
- Basic/Main: high
- Market/Touring: medium-high
- Venue/AI: medium depending on business usage
23.2 DR tasks
Section titled “23.2 DR tasks”- decide if warm standby required for Auth/Core DBs
- document failover strategy
- document how DNS/LB routing would switch if region/service fails
- ensure secrets available in DR environment
- ensure restore scripts are not tribal knowledge
24. Capacity planning tasks
Section titled “24. Capacity planning tasks”24.1 Initial capacity questions
Section titled “24.1 Initial capacity questions”Before prod launch estimate:
- daily active users
- concurrent auth requests
- access bootstrap frequency
- module backend QPS
- admin change frequency
- report-heavy workloads
- AI load patterns
24.2 Auth capacity
Section titled “24.2 Auth capacity”Auth will be on the hot path for:
- login
- refresh
- session restore
- access aggregation
- backend access checks (depending on chosen pattern)
So:
- scale Auth horizontally if needed
- ensure DB and Redis can support it
- cache wisely without hiding truth in cache
24.3 Core capacity
Section titled “24.3 Core capacity”Core traffic volume should be lower than Auth, but:
- entitlement reads will happen often through Auth
- writes are lower but highly important
24.4 Backend capacity
Section titled “24.4 Backend capacity”Module backends scale based on business traffic. Finance and AI may have very different resource profiles:
- Finance: transactional/reporting
- AI: potentially heavy CPU/network calls
25. Deployment sequencing tasks
Section titled “25. Deployment sequencing tasks”25.1 Initial rollout sequence
Section titled “25.1 Initial rollout sequence”Recommended order:
- provision Auth DB + Core DB + Redis
- deploy Auth
- deploy Core
- validate internal Auth ↔ Core
- deploy Business backends with new auth validation path
- deploy Admin backend
- deploy frontends
- run staged smoke tests
25.2 Incremental rollout
Section titled “25.2 Incremental rollout”For existing systems:
- enable compatibility mode where needed
- cut over one backend at a time if necessary
- verify module access after each backend migration
25.3 Smoke tests after each deploy
Section titled “25.3 Smoke tests after each deploy”At minimum test:
- Auth health/readiness
- login
/auth/me/auth/me/access- Core entitlement read
- one backend permission-allowed request
- one backend permission-denied request
- admin subscription change invalidates access
26. Runbooks
Section titled “26. Runbooks”26.1 “User cannot access module” runbook
Section titled “26.1 “User cannot access module” runbook”Check:
- is Auth healthy?
- is Core healthy?
- is JWT valid?
- is session revoked?
- is
x-orgcorrect? - does company own module in Core?
- does membership have module grant?
- does access cache need invalidation?
- does backend enforce wrong permission key?
26.2 “Login failures suddenly spike” runbook
Section titled “26.2 “Login failures suddenly spike” runbook”Check:
- Auth DB health
- Redis/rate limit behavior
- signing key/config changes
- DNS/LB issues
- client deployment issues
26.3 “Subscription changed but UI still old” runbook
Section titled “26.3 “Subscription changed but UI still old” runbook”Check:
- Core wrote entitlement change successfully?
- entitlement version bumped?
- invalidation signal sent?
- Auth cache cleared?
- frontend refetched
/auth/me/access? - backend still using stale access snapshot?
26.4 “Redis down” runbook
Section titled “26.4 “Redis down” runbook”Expected behavior:
- performance degrades
- truth still rebuilds from DB Tasks:
- verify services fail open/closed correctly per route
- restore Redis
- confirm access cache repopulates
26.5 “JWT verification failing in backends” runbook
Section titled “26.5 “JWT verification failing in backends” runbook”Check:
- JWKS reachable
kidknown- issuer/audience config correct
- clock skew/time sync issues
- key rotation event incomplete
27. Security review checklist before production
Section titled “27. Security review checklist before production”- all public domains use TLS
- Core is internal-only or properly protected
- DBs and Redis are private
- JWT private keys stored securely
- JWKS exposed correctly
- refresh tokens hashed
- login rate limits enabled
- admin routes protected
- service-to-service auth documented and enabled
- backups enabled
- restore tested
- alerts configured
- access logs enabled
- secrets not stored in repo
- prod/staging fully separated
28. Infra checklist by component
Section titled “28. Infra checklist by component”28.1 Auth infra checklist
Section titled “28.1 Auth infra checklist”- service deployed
- health/readiness working
-
auth_dbprovisioned - Redis connected
- JWT keys configured
- JWKS public
- logs/metrics enabled
- alerts enabled
- backups enabled
28.2 Core infra checklist
Section titled “28.2 Core infra checklist”- service deployed
-
platform_core_dbprovisioned - internal-only exposure configured
- service auth from Auth/Admin configured
- logs/metrics enabled
- backups enabled
28.3 Basic backend infra checklist
Section titled “28.3 Basic backend infra checklist”- public API routed
- main DB connected
- auth validation configured
-
x-orgpolicy enabled - logs/metrics enabled
28.4 Finance backend infra checklist
Section titled “28.4 Finance backend infra checklist”- public API routed
- finance DB connected
- auth validation configured
- logs/metrics enabled
28.5 Market/Touring infra checklist
Section titled “28.5 Market/Touring infra checklist”- routing configured
- DB connected
- shared/separate service decision documented
- auth validation configured
- logs/metrics enabled
28.6 Venue infra checklist
Section titled “28.6 Venue infra checklist”- domain/path configured
- DB finalized
- auth validation configured
- logs/metrics enabled
28.7 AI infra checklist
Section titled “28.7 AI infra checklist”- domain configured
- DB/storage finalized
- upstream AI provider secrets configured
- auth validation configured
- logs/metrics enabled
28.8 Redis infra checklist
Section titled “28.8 Redis infra checklist”- private access only
- persistence decision documented
- memory/TTL policy set
- monitoring enabled
28.9 DNS/TLS checklist
Section titled “28.9 DNS/TLS checklist”- all records created
- staging/prod separated
- certs valid
- renewal verified
- expiry alerts enabled
29. Recommended infra deliverables after this document
Section titled “29. Recommended infra deliverables after this document”After this phase, the next concrete infra artifacts should be:
-
Environment matrix
- local/dev/staging/prod values and owners
-
Service inventory sheet
- each service, domain, runtime, repo, owner, DB, secrets, alerts
-
Network matrix
- who can call whom
-
Secrets inventory
- all required secrets by environment
-
DB backup/restore runbook
- step-by-step recovery
-
Deployment runbook
- release, smoke tests, rollback
-
Incident playbook
- auth outage
- core outage
- redis outage
- db outage
- cert expiry
- stale access issue
30. Final summary
Section titled “30. Final summary”Infrastructure must guarantee these truths:
Auth is the identity and access truth.Core is the commercial entitlement truth.Redis is only acceleration.PostgreSQL and service-owned DBs are the real source of truth.Backends enforce access consistently.Private services stay private.Backups and restore are mandatory.Monitoring and alerting are not optional.And the most important infra rule of all:
The system must remain correct even when cache is cold, stale, or unavailable.