AWS Infrastructure: Beyond the Console
Clicking 'Launch Instance' is technical debt. A comprehensive guide to Infrastructure as Code (Terraform), VPC Networking, and avoiding the $20,000 NAT Gateway surprise.
There is a phase in every startup’s life called “ClickOps”.
You log into the AWS Console. You search for EC2. You click “Launch Instance”. You name it legacy-server-do-not-touch.
It works. It feels productive.
Two years later, that server crashes.
The engineer who built it left. Nobody knows what OS version it was running. Nobody knows which Security Groups were open. Nobody knows where the SSH keys are.
The business is offline.
At Maison Code Paris, we enforce a strict rule: The Infrastructure is the Code. If it isn’t in Terraform (or Pulumi/CDK), it doesn’t exist. This guide is a deep dive into building production-grade AWS infrastructure that is resilient, secure, and doesn’t bankrupt you.
Why Maison Code Discusses This
We manage infrastructure for brands that cannot fail. When a client launches a collaboration with a global superstar, traffic spikes 100x in 60 seconds. If the load balancer isn’t warmed up, if the database isn’t scalable, the site crashes. We architect for Elasticity. We use AWS not as a “Server Host” but as a “Programmable Utility”. We help CTOs migrate from brittle “Pet Servers” to resilient “Cattle Fleets”.
1. The Core Philosophy: Infrastructure as Code (IaC)
Why do we write infrastructure in HCL (HashiCorp Configuration Language)?
- Reproducibility: You can spin up a “Staging” environment that is an exact clone of “Production” in 10 minutes.
- Auditability:
git blametells you exactly who opened Port 22 to the public and when. - Disaster Recovery: If
us-east-1goes down, you change one variableregion = "eu-west-1"and redeploy.
The Terraform Stack
We don’t manage state locally. We use a remote backend (S3 + DynamoDB for locking).
# main.tf
terraform {
backend "s3" {
bucket = "maisoncode-terraform-state"
key = "prod/infrastructure.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
provider "aws" {
region = "us-east-1"
default_tags {
tags = {
Environment = "Production"
Project = "E-Commerce"
ManagedBy = "Terraform"
}
}
}
2. Networking: The VPC Trap
The biggest mistake developers make is using the “Default VPC”. The Default VPC puts everything in Public Subnets. Your database has a public IP. This is reckless. We architect 3-Tier Networks.
graph TD
User -->|HTTPS| ALB[Application Load Balancer]
subgraph VPC
subgraph Public Subnet
ALB
NAT[NAT Gateway]
end
subgraph Private Subnet
App[App Server / Lambda]
end
subgraph Database Subnet
RDS[Postgres RDS]
end
end
App -->|SQL| RDS
App -->|Outbound| NAT
Tier 1: Public Subnet
- Contains: Load Balancers (ALB), NAT Gateways, Bastion Hosts.
- Routing: Has an Internet Gateway (IGW). 0.0.0.0/0 -> IGW.
Tier 2: Private Subnet (App Layer)
- Contains: EC2 Instances, Fargate Containers, Lambda ENIs.
- Routing: No Internet Gateway. 0.0.0.0/0 -> NAT Gateway (for outbound updates).
- Safety: Cannot be reached from the internet directly.
Tier 3: Internal Subnet (Data Layer)
- Contains: RDS, ElastiCache, Redshift.
- Routing: No Internet Access at all. No NAT.
- Safety: Total isolation.
3. The $20k NAT Gateway Surprise
AWS bandwidth pricing is predatory. A NAT Gateway charges you per hour + per GB processed. If your application downloads 10TB of images from S3, and standard S3 traffic goes through the NAT Gateway, you pay ~$450. Fix: S3 Gateway Endpoints. This is a virtual “wormhole” from your VPC to S3 that bypasses the NAT. It is Free.
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
}
Always provision Gateway Endpoints for S3 and DynamoDB.
4. Compute: EC2 vs Fargate vs Lambda
Which compute engine fits luxury e-commerce?
EC2 (Virtual Machines)
- Use Case: Legacy apps, Stateful services (WebSockets), Databases (if self-hosted).
- Maison Code Verdict: Avoid. Too much maintenance (OS patching).
ECS Fargate (Serverless Containers)
- Use Case: Long-running Node.js/Docker apps.
- Pros: No OS patching. You define “CPU: 2, RAM: 4GB”. AWS runs it.
- Maison Code Verdict: Standard choice for Next.js servers sending HTML.
Lambda (Functions)
- Use Case: Event-driven tasks (Image resizing, Order processing).
- Pros: Scale to zero.
- Maison Code Verdict: Outstanding for “Glue Code” and background workers.
5. The Database: RDS Aurora Serverless v2
Traditional RDS requires you to pick an instance size (db.m5.large).
If traffic spikes, you crash. If traffic drops, you overpay.
Aurora Serverless v2 scales compute capacity up and down locally in milliseconds.
It is the perfect match for the “Flash Sale” model of luxury drops.
resource "aws_rds_cluster" "default" {
cluster_identifier = "aurora-cluster-demo"
engine = "aurora-postgresql"
engine_mode = "provisioned"
serverlessv2_scaling_configuration {
min_capacity = 0.5 # $40/mo
max_capacity = 64.0 # Heavy Load
}
}
When the drop ends, it scales back to 0.5 ACU (Aurora Capacity Units). You pay for what you use.
6. Content Delivery: CloudFront
You cannot serve assets from S3 directly to users. S3 is slow (Time to First Byte). CloudFront is mandatory. It is a Global CDN. Critical Configuration: Cache Policies. Don’t just cache everything.
- Images: Cache for 1 year.
- API Responses: Cache for 0 seconds (or 60 seconds if public).
- HTML: Cache for 0 seconds (Server Side Rendered).
Security: Use OAC (Origin Access Control) to lock your S3 bucket so only CloudFront can read it. Users cannot bypass the CDN.
7. The Well-Architected Framework (6 Pillars)
AWS provides a checklist called the “Well-Architected Framework”. We audit every client against it.
- Operational Excellence: Automate everything. No manual changes.
- Security: Encrypt everything. Least Privilege.
- Reliability: Multi-AZ deployments. Self-healing systems.
- Performance Efficiency: Use Serverless/Spot to scale content.
- Cost Optimization: Analyze bills daily. Tag every resource.
- Sustainability: Use Graviton (ARM) processors. They use 60% less energy.
8. Disaster Recovery: RTO vs RPO
If AWS us-east-1 burns down, what happens?
You define 2 metrics:
- RTO (Recovery Time Objective): How long to get back online? (Goal: < 1 hour).
- RPO (Recovery Point Objective): How much data can we lose? (Goal: < 5 minutes). Strategy:
- Database: Cross-Region Read Replicas (US -> EU). Promotion takes 5 minutes.
- Files: S3 Cross-Region Replication (CRR).
- Code: Multi-Region Terraform Apply.
9. Security: The Principle of Least Privilege
IAM (Identity Access Management) is the firewall for humans.
Never use the Root Account. Lock it in a safe.
Never give AdministratorAccess to a developer.
Create detailed policies strictly scoped to resources.
{
"Effect": "Allow",
"Action": ["s3:PutObject"],
"Resource": "arn:aws:s3:::maisoncode-uploads/images/*"
}
This user can upload images. They cannot delete the bucket. They cannot read the financial reports.
8. Cost Optimization: Spot Instances
For Batch Data Processing (e.g., Image Optimization jobs that run at night), don’t use On-Demand instances. Use Spot Instances. They are spare AWS capacity sold at a 70-90% discount. The catch: AWS can terminate them with 2 minutes warning. If your app is stateless (like a queue worker), this doesn’t matter. It handles the termination and restarts elsewhere. This saves thousands of dollars per month on background jobs.
9. Development Workflow: Ephemeral Environments
How do you test infrastructure changes? We use “Ephemerals”. When a PR is opened in GitHub:
- GitHub Actions triggers Terraform.
- It creates a new Workspace
pr-123. - It deploys a full stack (VPC + Fargate + RDS).
- It runs E2E tests.
- It runs
terraform destroy.
This gives complete confidence that code changes won’t break production.
10. Conclusion
AWS is a chainsaw. It is powerful enough to cut down a forest, but dangerous enough to cut off your leg. Clicking in the console is playing with toys. Writing Terraform is Engineering. For 2026, we are moving towards “Infrastructure from Code” (using SST or Pulumi to infer infra from code), but the fundamentals of VPCs and IAM remain immutable.
Is your cloud a mess?
Do you have “Unknown Servers” running? Is your bill a mystery?