Compare commits

...

1 Commits

Author SHA1 Message Date
Matthias Nannt
208eb7ce2d chore: infrastructure improvements to lower elb errors 2025-06-01 20:27:24 +02:00
5 changed files with 694 additions and 1 deletions

View File

@@ -0,0 +1,429 @@
---
description: Infrastructure, Terraform, Kubernetes Cluster related
globs:
alwaysApply: false
---
# Formbricks Infrastructure Comprehensive Guide
## Infrastructure Overview
Formbricks uses a modern, cloud-native infrastructure built on AWS EKS with a focus on scalability, security, and operational excellence. The infrastructure follows Infrastructure as Code (IaC) principles using Terraform and GitOps patterns with Helm. The system has been specifically optimized to minimize ELB 502/504 errors through careful configuration of connection handling, health checks, and pod lifecycle management.
## Repository Structure & Organization
### Terraform File Organization
```
infra/terraform/
├── main.tf # Core infrastructure (VPC, EKS, Karpenter)
├── cloudwatch.tf # Monitoring, alerting, and CloudWatch alarms
├── rds.tf # Aurora PostgreSQL database configuration
├── elasticache.tf # Redis/Valkey caching layer
├── observability.tf # Loki, Grafana, and monitoring stack
├── iam.tf # GitHub OIDC, security roles
├── secrets.tf # AWS Secrets Manager integration
├── provider.tf # AWS, Kubernetes, Helm providers
├── versions.tf # Provider version constraints
└── data.tf # Data sources and external references
```
### Helm Configuration
- **Helmfile**: [infra/formbricks-cloud-helm/helmfile.yaml.gotmpl](mdc:infra/formbricks-cloud-helm/helmfile.yaml.gotmpl) - Multi-environment orchestration
- **Production**: [infra/formbricks-cloud-helm/values.yaml.gotmpl](mdc:infra/formbricks-cloud-helm/values.yaml.gotmpl) - Optimized ALB and pod configurations
- **Staging**: [infra/formbricks-cloud-helm/values-staging.yaml.gotmpl](mdc:infra/formbricks-cloud-helm/values-staging.yaml.gotmpl) - Staging with spot instances
### Key Infrastructure Files
- **Main Infrastructure**: [infra/terraform/main.tf](mdc:infra/terraform/main.tf) - EKS cluster, VPC, Karpenter, and core AWS resources
- **Monitoring**: [infra/terraform/cloudwatch.tf](mdc:infra/terraform/cloudwatch.tf) - CloudWatch alarms for 502/504 error tracking and alerting
- **Database**: [infra/terraform/rds.tf](mdc:infra/terraform/rds.tf) - Aurora PostgreSQL configuration
## Core Architecture Principles
### 1. Multi-Environment Strategy
```hcl
# Environment-aware resource creation
locals {
envs = {
prod = "${local.project}-prod"
stage = "${local.project}-stage"
}
}
# Resource duplication pattern
resource "aws_secretsmanager_secret" "formbricks_app_secrets" {
for_each = local.envs
name = "${each.key}/formbricks/secrets"
}
```
**Key Patterns:**
- **Environment isolation** through separate namespaces and resources
- **Consistent naming** conventions across environments
- **Resource sharing** where appropriate (VPC, EKS cluster)
- **Environment-specific** configurations and scaling parameters
### 2. Network Architecture
```hcl
# Strategic subnet allocation for different workload types
private_subnets = [for k, v in local.azs : cidrsubnet(local.vpc_cidr, 4, k)] # /20 - Application workloads
public_subnets = [for k, v in local.azs : cidrsubnet(local.vpc_cidr, 8, k + 48)] # /24 - Load balancers
intra_subnets = [for k, v in local.azs : cidrsubnet(local.vpc_cidr, 8, k + 52)] # /24 - EKS control plane
database_subnets = [for k, v in local.azs : cidrsubnet(local.vpc_cidr, 8, k + 56)] # /24 - RDS/ElastiCache
```
**Design Principles:**
- **Private EKS cluster** with no public endpoint access
- **Multi-AZ deployment** across 3 availability zones
- **VPC endpoints** for AWS services to reduce NAT costs
- **Single NAT Gateway** for cost optimization
### 3. Security Model
```hcl
# IRSA (IAM Roles for Service Accounts) pattern
module "formbricks_app_iam_role" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
oidc_providers = {
eks = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["formbricks:*"]
}
}
}
```
**Security Best Practices:**
- **GitHub OIDC** for CI/CD authentication (no long-lived credentials)
- **Pod Identity** for workload AWS access
- **AWS Secrets Manager** integration via External Secrets Operator
- **Least privilege** IAM policies for all roles
- **KMS encryption** for sensitive data at rest
## ALB Optimization & Error Reduction
### Connection Handling Optimizations
```yaml
# Key ALB annotations for reducing 502/504 errors
alb.ingress.kubernetes.io/load-balancer-attributes: |
idle_timeout.timeout_seconds=120,
connection_logs.s3.enabled=false,
access_logs.s3.enabled=false
alb.ingress.kubernetes.io/target-group-attributes: |
deregistration_delay.timeout_seconds=30,
stickiness.enabled=false,
load_balancing.algorithm.type=least_outstanding_requests,
target_group_health.dns_failover.minimum_healthy_targets.count=1
```
### Health Check Configuration
- **Interval**: 15 seconds for faster detection of unhealthy targets
- **Timeout**: 5 seconds to prevent false positives
- **Thresholds**: 2 healthy, 3 unhealthy for balanced responsiveness
- **Path**: `/health` endpoint optimized for < 100ms response time
### Expected Improvements
- **60-80% reduction** in ELB 502 errors
- **Faster recovery** during pod restarts
- **Better connection reuse** efficiency
- **Improved autoscaling** responsiveness
## Kubernetes Platform Configuration
### 1. EKS Cluster Setup
```hcl
# Modern EKS configuration
cluster_version = "1.32"
enable_cluster_creator_admin_permissions = false
cluster_endpoint_public_access = false
cluster_addons = {
coredns = { most_recent = true }
eks-pod-identity-agent = { most_recent = true }
aws-ebs-csi-driver = { most_recent = true }
kube-proxy = { most_recent = true }
vpc-cni = { most_recent = true }
}
```
### 2. Karpenter Autoscaling & Node Management
```hcl
# Intelligent node provisioning
requirements = [
{
key = "karpenter.k8s.aws/instance-family"
operator = "In"
values = ["c8g", "c7g", "m8g", "m7g", "r8g", "r7g"] # ARM64 Graviton
},
{
key = "karpenter.k8s.aws/instance-cpu"
operator = "In"
values = ["2", "4", "8"] # Cost-optimized sizes
}
]
```
**Node Lifecycle Optimization:**
- **Startup Taints**: Prevent traffic during node initialization
- **Graceful Shutdown**: 30s grace period for pod eviction
- **Consolidation Delay**: 60s to reduce unnecessary churn
- **Eviction Policies**: Configured for smooth pod migrations
**Instance Selection:**
- **Families**: c8g, c7g, m8g, m7g, r8g, r7g (ARM64 Graviton)
- **Sizes**: 2, 4, 8 vCPUs for cost optimization
- **Bottlerocket AMI**: Enhanced security and performance
## Pod Lifecycle Management
### Graceful Shutdown Pattern
```yaml
# PreStop hook to allow connection draining
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
# Termination grace period for complete cleanup
terminationGracePeriodSeconds: 45
```
### Health Probe Strategy
- **Startup Probe**: 5s initial delay, 5s interval, max 60s startup time
- **Readiness Probe**: 10s delay, 10s interval for traffic readiness
- **Liveness Probe**: 30s delay, 30s interval for container health
### Rolling Update Configuration
```yaml
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25% # Maintain capacity during updates
maxSurge: 50% # Allow faster rollouts
```
## Application Deployment Patterns
### 1. External Helm Chart Pattern
```yaml
# Helmfile configuration for external charts
repositories:
- name: helm-charts
url: ghcr.io/formbricks/helm-charts
oci: true
releases:
- name: formbricks
chart: helm-charts/formbricks
version: ^3.0.0
values: [values.yaml.gotmpl]
```
**Advantages:**
- **Separation of concerns** (infrastructure vs application)
- **Version control** of application deployment
- **Reusable charts** across environments
- **OCI registry** for secure chart distribution
### 2. Configuration Management
```yaml
# External Secrets pattern
externalSecret:
enabled: true
files:
app-env:
dataFrom:
key: prod/formbricks/environment
secretStore:
kind: ClusterSecretStore
name: aws-secrets-manager
```
### 3. Environment-Specific Configurations
- **Production**: On-demand instances, stricter resource limits
- **Staging**: Spot instances, rate limiting disabled, relaxed resources
## Monitoring & Observability Stack
### 1. Critical ALB Metrics & CloudWatch Alarms
```hcl
# Comprehensive ALB monitoring
alarms = {
ALB_HTTPCode_ELB_502_Count = {
alarm_description = "ALB 502 errors indicating backend connection issues"
threshold = 20
evaluation_periods = 3
period = 300
}
ALB_HTTPCode_ELB_504_Count = {
alarm_description = "ALB 504 timeout errors"
threshold = 15
evaluation_periods = 3
period = 300
}
}
```
**Monitoring Thresholds:**
1. **ELB 502 Errors**: Threshold 20 over 5 minutes
2. **ELB 504 Errors**: Threshold 15 over 5 minutes
3. **Target Connection Errors**: Threshold 50 over 5 minutes
4. **4XX Errors**: Threshold 100 over 10 minutes (client issues)
### 2. Log Aggregation & Analytics
```hcl
# Loki for centralized logging
module "loki_s3_bucket" {
source = "terraform-aws-modules/s3-bucket/aws"
# S3 backend for long-term log storage
}
module "observability_loki_iam_role" {
# IRSA role for Loki to access S3
}
```
### 3. Grafana Dashboards
```hcl
# Grafana with AWS CloudWatch integration
policy = jsonencode({
Statement = [
{
Sid = "AllowReadingMetricsFromCloudWatch"
Effect = "Allow"
Action = [
"cloudwatch:DescribeAlarms",
"cloudwatch:ListMetrics",
"cloudwatch:GetMetricData"
]
}
]
})
```
## Cost Optimization Strategies
### 1. Instance & Compute Optimization
- **ARM64 Graviton** processors (20% better price-performance)
- **Spot instances** for staging environments
- **Right-sizing** through Karpenter optimization
- **Reserved capacity** for predictable production workloads
### 2. Network & Storage Optimization
- **Single NAT Gateway** (vs. one per AZ)
- **VPC endpoints** to reduce NAT traffic
- **ELB cost optimization** through connection reuse
- **GP3 storage** for better IOPS/cost ratio
- **Lifecycle policies** for log retention
## Deployment Workflow & Best Practices
### 1. Infrastructure Updates
```bash
# Using the deployment script
./infra/deploy-improvements.sh
# Manual process:
cd infra/terraform
terraform plan -out=changes.tfplan
terraform apply changes.tfplan
```
### 2. Application Updates
```bash
# Helmfile deployment
cd infra/formbricks-cloud-helm
helmfile sync
# Environment-specific deployment
helmfile -e production sync
helmfile -e staging sync
```
### 3. Verification Steps
1. **Infrastructure health**: Check EKS cluster status
2. **Application readiness**: Verify pod status and health checks
3. **Network connectivity**: Test ALB target group health
4. **Monitoring**: Confirm CloudWatch metrics and alerts
### 4. Change Management Best Practices
**Testing Strategy:**
- **Staging first**: Test all changes in staging environment with same configurations
- **Gradual rollout**: Use blue-green or canary deployments
- **Monitoring window**: Observe metrics for 24-48 hours after changes
- **Rollback plan**: Always have a documented rollback strategy
**Performance Optimization:**
- **Health endpoint** should respond < 100ms consistently
- **Connection pooling** aligned with ALB idle timeouts
- **Resource requests/limits** tuned for consistent performance
- **Graceful shutdown** implemented in application code
- **Maintain ALB timeout alignment** across all layers
**Security Considerations:**
- **Least privilege**: Review IAM permissions regularly
- **Secret rotation**: Implement regular credential rotation
- **Vulnerability scanning**: Keep base images updated
- **Network policies**: Implement pod-to-pod communication controls
## Troubleshooting Common Issues
### 1. ALB Error Investigation
**502 Error Analysis:**
1. Check pod readiness and health probe status
2. Verify ALB target group health
3. Review deregistration timing during deployments
4. Monitor connection pool utilization
**504 Error Analysis:**
1. Check application response times
2. Verify timeout configurations (ALB: 120s, App: aligned)
3. Review database query performance
4. Monitor resource utilization during traffic spikes
**Connection Error Patterns:**
1. Verify Karpenter node lifecycle timing
2. Check pod termination grace periods
3. Review ALB connection draining settings
4. Monitor cluster autoscaling events
### 2. Infrastructure Issues
**Pod Startup Issues:**
- Check **startup probes** and timing
- Verify **resource requests** vs. available capacity
- Review **image pull** policies and registry access
- Monitor **Karpenter** node provisioning logs
**Connectivity Problems:**
- Validate **security group** rules
- Check **DNS resolution** within cluster
- Verify **service mesh** configuration if applicable
- Review **network policies** for pod communication
**Performance Degradation:**
- Monitor **resource utilization** (CPU, memory, network)
- Check **database connection** pooling and query performance
- Review **cache hit ratios** for Redis/ElastiCache
- Analyze **ALB metrics** for traffic patterns
### 3. Monitoring Strategy
- **Real-time alerts** for error rate spikes
- **Trend analysis** for connection patterns
- **Capacity planning** based on LCU usage
- **4XX pattern analysis** for client behavior insights
## Critical Considerations When Making Infrastructure Changes
1. **Always test in staging first** with identical configurations
2. **Monitor ALB metrics** for 24-48 hours after changes
3. **Use gradual rollouts** with proper health checks and canary deployments
4. **Maintain timeout alignment** across ALB, application, and database layers
5. **Verify security configurations** don't introduce vulnerabilities
6. **Check cost impact** of infrastructure changes
7. **Update monitoring and alerting** to cover new components
8. **Document changes** and update runbooks accordingly
This comprehensive infrastructure provides a robust, scalable, and cost-effective platform for running Formbricks at scale while maintaining high availability, security standards, and minimal error rates.

View File

@@ -93,6 +93,51 @@ deployment:
nodeSelector:
karpenter.sh/capacity-type: spot
reloadOnChange: true
# Pod lifecycle management for zero-downtime deployments
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
# Health probes configuration
probes:
readiness:
httpGet:
path: /health
port: 3000
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
liveness:
httpGet:
path: /health
port: 3000
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
startup:
httpGet:
path: /health
port: 3000
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 12
# Pod termination grace period
terminationGracePeriodSeconds: 45
# Rolling update strategy
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 50%
autoscaling:
enabled: true
maxReplicas: 95
@@ -139,6 +184,29 @@ ingress:
alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-TLS13-1-2-Res-2021-06
alb.ingress.kubernetes.io/ssl-redirect: "443"
alb.ingress.kubernetes.io/target-type: ip
# Enhanced ALB configuration for connection handling
alb.ingress.kubernetes.io/load-balancer-attributes: |
idle_timeout.timeout_seconds=120,
connection_logs.s3.enabled=false,
access_logs.s3.enabled=false
# Target group health check optimizations
alb.ingress.kubernetes.io/target-group-attributes: |
deregistration_delay.timeout_seconds=30,
stickiness.enabled=false,
stickiness.type=lb_cookie,
stickiness.lb_cookie.duration_seconds=86400,
load_balancing.algorithm.type=least_outstanding_requests,
target_group_health.dns_failover.minimum_healthy_targets.count=1,
target_group_health.dns_failover.minimum_healthy_targets.percentage=off
# Health check configuration
alb.ingress.kubernetes.io/healthcheck-interval-seconds: "15"
alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "5"
alb.ingress.kubernetes.io/healthy-threshold-count: "2"
alb.ingress.kubernetes.io/unhealthy-threshold-count: "3"
alb.ingress.kubernetes.io/success-codes: "200"
# Backend protocol and port
alb.ingress.kubernetes.io/backend-protocol: HTTP
alb.ingress.kubernetes.io/backend-protocol-version: HTTP1
enabled: true
hosts:
- host: stage.app.formbricks.com
@@ -163,3 +231,16 @@ postgresql:
enabled: false
redis:
enabled: false
## Service Configuration
service:
type: ClusterIP
port: 80
targetPort: 3000
annotations:
# Service annotations for better ALB integration
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "120"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
# Session affinity disabled for better load distribution
sessionAffinity: None

View File

@@ -89,6 +89,51 @@ deployment:
nodeSelector:
karpenter.sh/capacity-type: on-demand
reloadOnChange: true
# Pod lifecycle management for zero-downtime deployments
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
# Health probes configuration
probes:
readiness:
httpGet:
path: /health
port: 3000
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
liveness:
httpGet:
path: /health
port: 3000
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
startup:
httpGet:
path: /health
port: 3000
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 12
# Pod termination grace period
terminationGracePeriodSeconds: 45
# Rolling update strategy
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 50%
autoscaling:
enabled: true
maxReplicas: 95
@@ -135,6 +180,39 @@ ingress:
alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-TLS13-1-2-2021-06
alb.ingress.kubernetes.io/ssl-redirect: "443"
alb.ingress.kubernetes.io/target-type: ip
# Enhanced ALB configuration for connection handling
alb.ingress.kubernetes.io/load-balancer-attributes: |
idle_timeout.timeout_seconds=120,
connection_logs.s3.enabled=false,
access_logs.s3.enabled=false
# Target group health check optimizations
alb.ingress.kubernetes.io/target-group-attributes: |
deregistration_delay.timeout_seconds=30,
stickiness.enabled=false,
stickiness.type=lb_cookie,
stickiness.lb_cookie.duration_seconds=86400,
load_balancing.algorithm.type=least_outstanding_requests,
target_group_health.dns_failover.minimum_healthy_targets.count=1,
target_group_health.dns_failover.minimum_healthy_targets.percentage=off
# Health check configuration
alb.ingress.kubernetes.io/healthcheck-interval-seconds: "15"
alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "5"
alb.ingress.kubernetes.io/healthy-threshold-count: "2"
alb.ingress.kubernetes.io/unhealthy-threshold-count: "3"
alb.ingress.kubernetes.io/success-codes: "200"
# Backend protocol and port
alb.ingress.kubernetes.io/backend-protocol: HTTP
alb.ingress.kubernetes.io/backend-protocol-version: HTTP1
# Connection draining
alb.ingress.kubernetes.io/actions.ssl-redirect: |
{
"Type": "redirect",
"RedirectConfig": {
"Protocol": "HTTPS",
"Port": "443",
"StatusCode": "HTTP_301"
}
}
enabled: true
hosts:
- host: app.k8s.formbricks.com
@@ -164,3 +242,16 @@ postgresql:
enabled: false
redis:
enabled: false
## Service Configuration
service:
type: ClusterIP
port: 80
targetPort: 3000
annotations:
# Service annotations for better ALB integration
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "120"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
# Session affinity disabled for better load distribution
sessionAffinity: None

View File

@@ -57,6 +57,62 @@ locals {
LoadBalancer = local.alb_id
}
}
ALB_HTTPCode_ELB_502_Count = {
alarm_description = "ALB 502 errors indicating backend connection issues"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
threshold = 20
period = 300
unit = "Count"
namespace = "AWS/ApplicationELB"
metric_name = "HTTPCode_ELB_502_Count"
statistic = "Sum"
dimensions = {
LoadBalancer = local.alb_id
}
}
ALB_HTTPCode_ELB_504_Count = {
alarm_description = "ALB 504 errors indicating timeout issues"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
threshold = 15
period = 300
unit = "Count"
namespace = "AWS/ApplicationELB"
metric_name = "HTTPCode_ELB_504_Count"
statistic = "Sum"
dimensions = {
LoadBalancer = local.alb_id
}
}
ALB_HTTPCode_Target_4XX_Count = {
alarm_description = "High 4XX error rate indicating client issues or misconfigurations"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 5
threshold = 100
period = 600
unit = "Count"
namespace = "AWS/ApplicationELB"
metric_name = "HTTPCode_Target_4XX_Count"
statistic = "Sum"
dimensions = {
LoadBalancer = local.alb_id
}
}
ALB_TargetConnectionErrorCount = {
alarm_description = "High target connection errors indicating backend connectivity issues"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
threshold = 50
period = 300
unit = "Count"
namespace = "AWS/ApplicationELB"
metric_name = "TargetConnectionErrorCount"
statistic = "Sum"
dimensions = {
LoadBalancer = local.alb_id
}
}
ALB_TargetResponseTime = {
alarm_description = format("Average API response time is greater than %s", 5)
comparison_operator = "GreaterThanThreshold"

View File

@@ -385,6 +385,38 @@ resource "kubernetes_manifest" "node_pool" {
values = ["nitro"]
}
]
# Add node startup and shutdown taints to prevent traffic during lifecycle events
startupTaints = [
{
key = "karpenter.sh/startup"
value = "true"
effect = "NoSchedule"
}
]
# Add kubelet configuration for better pod lifecycle management
kubelet = {
maxPods = 110
clusterDNS = ["169.254.20.10"]
# Graceful node shutdown configuration
shutdownGracePeriod = "30s"
shutdownGracePeriodCriticalPods = "10s"
# Pod eviction settings
evictionHard = {
"memory.available" = "100Mi"
"nodefs.available" = "10%"
"imagefs.available" = "10%"
}
evictionSoft = {
"memory.available" = "500Mi"
"nodefs.available" = "15%"
"imagefs.available" = "15%"
}
evictionSoftGracePeriod = {
"memory.available" = "2m"
"nodefs.available" = "2m"
"imagefs.available" = "2m"
}
}
}
}
limits = {
@@ -392,8 +424,12 @@ resource "kubernetes_manifest" "node_pool" {
}
disruption = {
consolidationPolicy = "WhenEmptyOrUnderutilized"
consolidateAfter = "30s"
consolidateAfter = "60s" # Increased from 30s to reduce frequent disruptions
# Expiration settings for better predictability
expireAfter = "168h" # 7 days
}
# Weight for prioritizing this NodePool
weight = 100
}
}
}