Half Life
Cloud & DevOps

Terraform - AWS

A brief tutorial on terraform

By Naga Sai Rao37 min read

Terraform lets you describe your entire cloud setup as code: instead of clicking through the AWS console to create servers, load balancers, and permissions, you write text files that declare what you want, and Terraform makes reality match. This guide is a from-scratch tutorial and revision reference for using Terraform with AWS, covering how Terraform actually works (state, plan, apply, the dependency graph), the language itself (HCL, variables, modules, loops), and then the essential services you will provision constantly: IAM and roles, networking (VPC), S3, EC2, ALB, ACM certificates, ECS containers, CloudFront, and Lambda with API Gateway for serverless APIs. By the end you should be able to read and write real infrastructure for most common architectures. Every concept is explained plainly, then shown with working configuration you can adapt.

A note on Terraform vs OpenTofu. In 2023 HashiCorp changed Terraform's license to the Business Source License (BSL), and the community forked the last open-source version into OpenTofu, now governed by the Linux Foundation under the open MPL license. They share the same language (HCL), the same provider ecosystem, the same state file format, and nearly the same CLI, so everything in this guide applies to both; you just type terraform or tofu. OpenTofu has added a few features Terraform's open CLI lacks (notably built-in state encryption). For learning and for most teams, either works and the concepts are identical.

What Terraform Is and How It Thinks

Terraform is a declarative infrastructure-as-code tool. You do not write steps ("create a server, then attach a disk, then open a port"); you declare the desired end state ("a server with this disk and these ports"), and Terraform figures out the actions needed to reach it. Run it again with no changes and it does nothing, because reality already matches your declaration.

Three ideas make the whole thing work:

  • Providers are plugins that teach Terraform how to talk to a platform's API. The AWS provider knows how to create EC2 instances, S3 buckets, and so on. There are thousands of providers (AWS, Azure, GCP, Cloudflare, GitHub).
  • Resources are the things you declare: one aws_instance, one aws_s3_bucket. Each maps to a real object in the cloud.
  • State is Terraform's record of what it has created and how those resources map to real cloud objects. This is the concept people most need to understand, covered in depth below.

The core loop you will run constantly:

bash
terraform init      # download providers and set up the working directory
terraform plan      # preview: what will change to reach the desired state?
terraform apply     # make it happen (after showing the plan and asking to confirm)
terraform destroy   # tear it all down

plan is the safety feature that makes Terraform trustworthy: it shows you exactly what will be created, changed, or destroyed before anything happens, so there are no surprises.

State: How Terraform Remembers

State is the single most important concept, and the one that causes the most trouble when misunderstood. When Terraform creates a resource, it records in a state file (terraform.tfstate, a JSON file) the mapping between your configuration (aws_instance.web) and the real object in AWS (instance i-0abc123). On the next run, Terraform reads the state, checks reality, compares both to your configuration, and computes the difference.

Without state, Terraform would have no way to know that the aws_instance.web in your code is the same server it made last time, so it could not update it in place or know to delete it when you remove it from code. State is the memory that makes declarative management possible.

text
Your .tf config     Terraform state         Real AWS
(desired state)     (what I made)           (actual)
   web instance  <->  i-0abc123        <->  running instance i-0abc123

When you change your config and run plan, Terraform does a three-way comparison: desired (config) versus recorded (state) versus actual (a refresh against the AWS API). The plan is the set of actions to make actual match desired.

Gotcha: never edit the state file by hand, and never lose it. The state file is the source of truth for what Terraform manages. If you lose it, Terraform forgets everything it made and will try to recreate resources that already exist. If you hand-edit it and get it wrong, you corrupt Terraform's understanding of reality. Treat it as precious.

Remote state and locking

By default the state file sits on your local disk, which is fine for learning but wrong for teams: two people running apply at once would corrupt it, and a lost laptop loses the state. The standard solution on AWS is a remote backend: store the state in an S3 bucket, and use a lock (DynamoDB, or S3's native locking in newer versions) so only one person can apply at a time.

hcl
# backend.tf: store state remotely in S3, with locking
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/network.tfstate" # path within the bucket
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"             # prevents concurrent applies
  }
}
Gotcha: state can contain secrets. If a resource has a sensitive value (a database password, a generated key), it is stored in the state file in plain text. This is why the state bucket must be private and encrypted, and why OpenTofu's built-in state encryption is valued. Never commit a state file to git.

The Language: HCL Basics

Terraform configuration is written in HCL (HashiCorp Configuration Language), a declarative language built around blocks. The building blocks you will use everywhere:

Provider block configures a provider (here, which AWS region and default tags to apply to everything):

hcl
terraform {
  required_version = ">= 1.7"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"     # allow 5.x, a good pinning practice
    }
  }
}

provider "aws" {
  region = "us-east-1"
  default_tags {
    tags = {
      Project   = "myapp"
      ManagedBy = "terraform"    # tag everything so you know what Terraform owns
    }
  }
}

Resource block declares a thing to create. The pattern is resource "<type>" "<local_name>", where the type comes from the provider and the local name is how you refer to it elsewhere in your code:

hcl
resource "aws_s3_bucket" "assets" {
  bucket = "myapp-assets-bucket"
}

Referencing resources is how Terraform builds its dependency graph. You refer to an attribute of one resource from another using type.local_name.attribute:

hcl
resource "aws_s3_bucket_versioning" "assets" {
  bucket = aws_s3_bucket.assets.id   # reference creates a dependency
  versioning_configuration { status = "Enabled" }
}

Because aws_s3_bucket_versioning.assets references aws_s3_bucket.assets.id, Terraform knows the bucket must exist first. You do not order things manually; references create the order.

Data source reads something that already exists (that Terraform did not create), so you can reference it. The pattern is data "<type>" "<name>":

hcl
# Look up the latest Amazon Linux 2023 AMI instead of hardcoding an ID
data "aws_ami" "al2023" {
  most_recent = true
  owners      = ["amazon"]
  filter {
    name   = "name"
    values = ["al2023-ami-*-x86_64"]
  }
}
# use it: ami = data.aws_ami.al2023.id

Outputs expose values after apply (an IP, a URL), for humans or for other Terraform configs to consume:

hcl
output "bucket_name" {
  value = aws_s3_bucket.assets.bucket
}

Reading HCL Syntax

Before the examples get denser, here are the syntax pieces that appear throughout this guide. Learn these once and no code block below will look cryptic.

String interpolation ${...} embeds an expression inside a string. Outside a string you reference values directly; inside one you wrap them:

hcl
bucket = var.name                    # direct reference, no ${}
bucket = "myapp-${var.name}-assets"  # inside a string, needs ${}

count.index is the current item's number (0, 1, 2, ...) inside a resource that uses count. It is how each copy gets a different value:

hcl
resource "aws_subnet" "public" {
  count      = 2
  cidr_block = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index) # 0, then 1
}

Indexing [N] accesses one specific item from a counted resource by position, because a counted resource is a list:

hcl
subnet_id = aws_subnet.public[0].id   # the first public subnet

The splat operator [*] takes an attribute from every item in a counted resource, producing a list. This is the answer to "what is aws_subnet.private[*].id": it is the id of every private subnet, as a list, which is exactly what things like ALBs and ECS services want (they take a list of subnets):

hcl
aws_subnet.private[*].id
# produces: ["subnet-0abc", "subnet-0def"]
# equivalent to the for expression:
[for s in aws_subnet.private : s.id]

each.key and each.value are the for_each equivalent of count.index. When you loop with for_each, each.key is the current key and each.value the current value:

hcl
resource "aws_s3_bucket" "buckets" {
  for_each = toset(["assets", "logs"])
  bucket   = "myapp-${each.key}"   # "myapp-assets", then "myapp-logs"
}

for expressions build a new list or map by transforming another. The list form uses [...], the map form uses {... => ...}:

hcl
[for s in var.names : upper(s)]        # list -> list
{for k, v in var.items : k => v.size}  # map -> map (the ACM section uses this form)

The ternary condition ? a : b is an inline if/else:

hcl
instance_type = var.environment == "prod" ? "t3.medium" : "t3.micro"

Heredoc <<-EOF ... EOF is a multi-line string, used for startup scripts and inline text. The - lets you indent the block and strips the leading whitespace:

hcl
user_data = <<-EOF
  #!/bin/bash
  yum install -y nginx
EOF

jsonencode({...}) converts an HCL object into a JSON string. IAM policies and ECS container definitions are JSON, so you write readable HCL and let jsonencode produce the JSON, instead of hand-writing it:

hcl
policy = jsonencode({
  Version   = "2012-10-17"
  Statement = [{ Effect = "Allow", Action = "s3:GetObject", Resource = "*" }]
})

Type conversions toset(), tolist(), tomap() change a value's collection type. The common one is toset(), because for_each requires a set or map, not a plain list, so you often see for_each = toset([...]).

With those in hand, every symbol in the configuration below has a name and a meaning, and the resource examples become straightforward to read.

Variables and Reuse

Hardcoding values makes config rigid. Input variables parameterize it. A variable lives in three separate places, and seeing all three at once is the key to not getting lost: you declare it (say it exists), assign it (give it a value), and use it (read it with var.).

1. Declare the variable, conventionally in a file named variables.tf. This defines that the input exists, its type, and an optional default:

hcl
# variables.tf
variable "environment" {
  type        = string
  description = "Deployment environment"
  default     = "dev"          # used only if no value is supplied
}

variable "instance_count" {
  type    = number
  default = 2
}

2. Assign a value, most commonly in a file named terraform.tfvars. A value here overrides the default:

hcl
# terraform.tfvars
environment    = "prod"
instance_count = 4

3. Use it anywhere in your resources with var.<name>. Note the prefix: you declare it as variable "environment" but read it as var.environment:

hcl
# main.tf
resource "aws_instance" "web" {
  count         = var.instance_count                                    # reads 4
  instance_type = var.environment == "prod" ? "t3.medium" : "t3.micro"  # reads "prod"
}

So the value flows variables.tf (declare) to terraform.tfvars (assign) to main.tf (use). Terraform loads all .tf files in a directory together, so splitting them is purely for humans; you could put everything in one file. Besides terraform.tfvars, you can assign a value with a -var flag (terraform apply -var="environment=prod") or an environment variable (export TF_VAR_environment=prod). The precedence when more than one is set: CLI -var wins, then .tfvars files, then TF_VAR_ environment variables, then the default. If there is no default and no value supplied anywhere, Terraform stops and prompts you. This separation is what lets the same code deploy dev and prod: identical .tf files, a different .tfvars.

Locals are computed values you reuse within a config (not inputs, so they cannot be set from outside), and outputs (above) expose results after apply:

hcl
locals {
  name_prefix = "${var.environment}-myapp"   # derived once, reused everywhere as local.name_prefix
}

Loops: count and for_each

To create many similar resources, you use count (by number) or for_each (by a map or set). for_each is usually better because adding or removing an item does not shift the others.

hcl
# count: three identical subnets, indexed 0,1,2
resource "aws_subnet" "public" {
  count      = 3
  vpc_id     = aws_vpc.main.id
  cidr_block = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
}

# for_each: named buckets from a map (stable keys, safe to add/remove)
resource "aws_s3_bucket" "buckets" {
  for_each = toset(["assets", "logs", "backups"])
  bucket   = "myapp-${each.key}"
}
Gotcha: count and the shifting-index problem. With count, resources are tracked by position ([0], [1], [2]). If you remove the middle item from a list, everything after it shifts index, and Terraform will destroy and recreate those resources. for_each tracks by key (["assets"]), so removing one leaves the others untouched. Prefer for_each for anything that might change.

Modules: Packaging Infrastructure

A module is a reusable, parameterized bundle of Terraform configuration, the equivalent of a function. You define inputs (variables), resources, and outputs once, then call it many times with different inputs. Every Terraform configuration is itself a module (the "root module"); you just add child modules for reuse.

hcl
# Calling a module (local or from the registry)
module "vpc" {
  source = "./modules/vpc"      # or a registry path like "terraform-aws-modules/vpc/aws"

  cidr_block  = "10.0.0.0/16"
  environment = var.environment
}

# Use the module's outputs
resource "aws_instance" "web" {
  subnet_id = module.vpc.public_subnet_ids[0]
}

Modules are how you avoid copy-pasting and how you share standardized building blocks across teams. The public Terraform Registry has well-maintained modules (like terraform-aws-modules/vpc/aws) that encode best practices, and using them is often smarter than hand-rolling networking.

Best practice: structure by environment. A common layout is a modules/ directory of reusable components, and separate environments/dev, environments/prod directories that call those modules with different variables and their own state. This keeps environments isolated so a change to dev cannot accidentally alter prod.

IAM and Roles: Permissions as Code

IAM (Identity and Access Management) controls who can do what in AWS, and it is where beginners get most confused, so it is worth going slowly. The pieces:

  • A policy is a JSON document listing permissions (allow or deny specific actions on specific resources).
  • A role is an identity that can be assumed by something (an EC2 instance, an ECS task, a Lambda) to gain the permissions its policies grant. Roles are how AWS services get permissions without hardcoded credentials.
  • A trust policy (assume-role policy) says who is allowed to assume the role.

The key mental model: you attach a role to a compute resource, and that resource then acts with the role's permissions. No access keys in your code.

hcl
# 1. A role that EC2 instances are allowed to assume (the trust policy)
resource "aws_iam_role" "app" {
  name = "myapp-instance-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "ec2.amazonaws.com" } # EC2 may assume this role
      Action    = "sts:AssumeRole"
    }]
  })
}

# 2. A permissions policy: what the role is allowed to do
resource "aws_iam_role_policy" "app_s3" {
  name = "read-assets"
  role = aws_iam_role.app.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["s3:GetObject"]
      Resource = "${aws_s3_bucket.assets.arn}/*" # only this bucket's objects
    }]
  })
}

# 3. An instance profile: the wrapper that lets an EC2 instance use the role
resource "aws_iam_instance_profile" "app" {
  name = "myapp-instance-profile"
  role = aws_iam_role.app.name
}
Gotcha: the instance profile is a required wrapper. An EC2 instance cannot use a role directly; it uses an instance profile, which is a container for exactly one role. You attach the instance profile to the instance, not the role. ECS tasks and Lambda attach the role directly, so this extra step is EC2-specific and a common source of confusion.
Best practice: least privilege, and prefer managed roles over keys. Grant only the actions and resources actually needed, scope Resource to specific ARNs rather than "*", and never bake AWS access keys into code or environment variables on a server. Attach a role instead, so credentials are temporary and rotated automatically. For CI/CD (like GitHub Actions), use OIDC to assume a role rather than storing long-lived keys.

Networking: The VPC Everything Sits In

Almost every AWS resource lives inside a VPC (Virtual Private Cloud), your private network in AWS. You rarely provision compute without one, so understand the pieces:

  • VPC: the overall network, defined by an IP range (CIDR block like 10.0.0.0/16).
  • Subnets: subdivisions of the VPC, each in one availability zone. Public subnets can reach the internet directly; private subnets cannot (used for databases and app servers you want shielded).
  • Internet Gateway: lets public subnets reach the internet.
  • NAT Gateway: lets private subnets make outbound connections (like downloading updates) without being reachable from outside.
  • Route tables: rules that send traffic to the right gateway.
  • Security groups: virtual firewalls attached to resources, controlling inbound and outbound traffic.
hcl
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
}

# Look up the region's availability zones so subnets spread across them
data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_subnet" "public" {
  count                   = 2
  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true
}

# Private subnets (for ECS tasks, databases): same idea, no public IP.
# Offset the CIDR index so they do not overlap the public subnets above.
resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index + 10)
  availability_zone = data.aws_availability_zones.available.names[count.index]
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

# A security group allowing inbound HTTP and all outbound
resource "aws_security_group" "web" {
  name   = "web-sg"
  vpc_id = aws_vpc.main.id

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]   # anyone can reach port 80
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"            # all protocols
    cidr_blocks = ["0.0.0.0/0"]   # can reach anywhere outbound
  }
}

In practice most teams use the community terraform-aws-modules/vpc/aws module rather than writing all of this by hand, because getting subnets, route tables, and NAT gateways right across multiple availability zones is fiddly. But you should understand the pieces so the module's inputs make sense.

Gotcha: security groups vs NACLs. Security groups are stateful (if you allow inbound, the response is automatically allowed out) and attach to resources. Network ACLs are stateless and attach to subnets. For most work you use security groups; reach for NACLs only for subnet-wide coarse rules.

S3: Object Storage

S3 stores objects (files) in buckets. It is the simplest service to start with and appears in almost every architecture (static assets, backups, logs, data lakes). Modern S3 configuration is split across several resources rather than one giant block, which trips up people used to older examples.

hcl
resource "aws_s3_bucket" "assets" {
  bucket = "myapp-assets-unique-name"   # bucket names are globally unique
}

# Versioning is now its own resource (not a block inside the bucket)
resource "aws_s3_bucket_versioning" "assets" {
  bucket = aws_s3_bucket.assets.id
  versioning_configuration { status = "Enabled" }
}

# Block all public access unless you explicitly need a public bucket
resource "aws_s3_bucket_public_access_block" "assets" {
  bucket                  = aws_s3_bucket.assets.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# Server-side encryption
resource "aws_s3_bucket_server_side_encryption_configuration" "assets" {
  bucket = aws_s3_bucket.assets.id
  rule {
    apply_server_side_encryption_by_default { sse_algorithm = "AES256" }
  }
}
Gotcha: the modern split-resource pattern. Older tutorials show versioning, encryption, and lifecycle rules as blocks inside aws_s3_bucket. Current AWS provider versions moved these into separate resources (aws_s3_bucket_versioning, aws_s3_bucket_server_side_encryption_configuration, and so on). If you copy an old example and it errors, this split is usually why.
Gotcha: bucket names are globally unique. Not unique to your account, unique across all of AWS. If myapp-assets is taken, your apply fails. Add a random suffix or your org name.

EC2: Virtual Servers

EC2 gives you virtual machines. The core resource is aws_instance, and you typically combine it with the AMI lookup, a security group, and the instance profile from the IAM section.

hcl
resource "aws_instance" "web" {
  ami                    = data.aws_ami.al2023.id     # from the data source earlier
  instance_type          = "t3.micro"
  subnet_id              = aws_subnet.public[0].id
  vpc_security_group_ids = [aws_security_group.web.id]
  iam_instance_profile   = aws_iam_instance_profile.app.name  # the role wrapper

  user_data = <<-EOF
    #!/bin/bash
    yum install -y nginx
    systemctl enable --now nginx
  EOF

  tags = { Name = "web-server" }
}

user_data is a startup script that runs when the instance first boots, commonly used to install and start software. For anything beyond a demo, though, you would bake software into a custom AMI or (better) run containers on ECS rather than configuring servers by hand.

Gotcha: changing user_data or the AMI replaces the instance. Some attribute changes can be applied in place, but others (like the AMI or often user_data) force Terraform to destroy and recreate the instance. plan always tells you when a change is a replacement (shown as -/+), so read it before applying.

ALB: Load Balancing

An Application Load Balancer (ALB) distributes incoming HTTP/HTTPS traffic across multiple targets (instances or containers), which is how you run more than one server and survive one failing. It has three parts that confuse people until you see them together:

  • The load balancer itself (the public entry point, sitting in public subnets).
  • A target group: the pool of things to send traffic to, with a health check.
  • A listener: the rule that says "traffic arriving on port 443 goes to this target group."
hcl
resource "aws_lb" "main" {
  name               = "myapp-alb"
  load_balancer_type = "application"
  subnets            = aws_subnet.public[*].id      # spans public subnets
  security_groups    = [aws_security_group.web.id]
}

resource "aws_lb_target_group" "app" {
  name        = "myapp-tg"
  port        = 80
  protocol    = "HTTP"
  vpc_id      = aws_vpc.main.id
  target_type = "ip"                                # "ip" for ECS Fargate; "instance" for EC2

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
  }
}

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.main.arn
  port              = 443
  protocol          = "HTTPS"
  certificate_arn   = aws_acm_certificate.main.arn  # a TLS cert from ACM (see the next section)

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.app.arn
  }
}
Gotcha: the target group type must match your compute. Use target_type = "instance" when registering EC2 instances, and target_type = "ip" for ECS Fargate tasks (which get their own IPs). Mismatching this is a common reason targets never become healthy.
Gotcha: health checks decide everything. The ALB only sends traffic to targets that pass the health check. If your health check path returns anything other than success, the target is marked unhealthy and gets no traffic, which looks like "my app is down" even though the container is running. Make sure the health check path actually exists and returns 200.

ACM: TLS Certificates

ACM (AWS Certificate Manager) issues and renews the TLS/SSL certificates that give you HTTPS, for free, and it auto-renews them so they never expire on you. The ALB listener and CloudFront distribution above both referenced a certificate ARN; this is where it comes from. The usual flow is: request a certificate for your domain, prove you own the domain via DNS validation (ACM gives you a CNAME record to add), and reference the validated certificate.

hcl
# 1. Request a certificate for your domain
resource "aws_acm_certificate" "main" {
  domain_name               = "myapp.com"
  subject_alternative_names = ["*.myapp.com"]   # also cover subdomains
  validation_method         = "DNS"

  lifecycle {
    create_before_destroy = true   # avoids downtime when the cert is replaced
  }
}

# 2. Create the DNS validation records ACM asks for (here, in Route 53)
resource "aws_route53_record" "cert_validation" {
  for_each = {
    for dvo in aws_acm_certificate.main.domain_validation_options : dvo.domain_name => {
      name   = dvo.resource_record_name
      type   = dvo.resource_record_type
      record = dvo.resource_record_value
    }
  }
  zone_id = aws_route53_zone.main.zone_id
  name    = each.value.name
  type    = each.value.type
  records = [each.value.record]
  ttl     = 60
}

# 3. Wait until validation completes, then the cert is usable
resource "aws_acm_certificate_validation" "main" {
  certificate_arn         = aws_acm_certificate.main.arn
  validation_record_fqdns = [for r in aws_route53_record.cert_validation : r.fqdn]
}

The three-resource pattern (request, create the validation DNS records, wait for validation) is standard and worth memorizing, because a certificate is not usable until it is validated, and referencing an unvalidated certificate in a listener fails. This example assumes your domain's DNS is hosted in Route 53 as an aws_route53_zone.main (Route 53 is AWS's DNS service); if your DNS lives elsewhere, you add the validation CNAME record there instead. For strict correctness, resources that consume the certificate should reference aws_acm_certificate_validation.main.certificate_arn rather than the raw certificate, so Terraform waits for validation before using it; the ALB example above used the plain ARN for brevity.

Gotcha: DNS validation beats email validation, and the region rule. Use validation_method = "DNS" because it auto-renews silently forever, whereas email validation needs a human to click a link on every renewal. And as noted in the CloudFront section, a certificate used by CloudFront must be requested in us-east-1 regardless of your other resources' region; a certificate for an ALB must be in the ALB's own region. If a cert "cannot be found" by CloudFront, wrong region is almost always why.

ECS: Running Containers

ECS (Elastic Container Service) runs Docker containers for you. It is the most involved service here because containers have several coordinating pieces, so take it slowly. With the Fargate launch type (the simpler, serverless option), you do not manage any servers; AWS runs the containers.

The pieces:

  • A cluster: a logical grouping for your services.
  • A task definition: the blueprint for a container (image, CPU, memory, ports, environment, which roles to use).
  • A service: keeps a desired number of tasks running, replaces failed ones, and connects them to the load balancer.
  • Two roles: the execution role (lets ECS pull the image and write logs) and the task role (the permissions your app code itself needs, like reading S3).

The examples below reference a few resources defined elsewhere in this guide (the aws_iam_role.app task role from the IAM section, and private subnets and a security group like those in the networking section). The one role not yet shown is the execution role, so here it is first:

hcl
# The execution role: used by ECS itself to pull the image and write logs.
resource "aws_iam_role" "ecs_execution" {
  name = "myapp-ecs-execution-role"
  assume_role_policy = jsonencode({
    Version   = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "ecs-tasks.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

# The AWS-managed policy that grants exactly the ECR-pull and logs permissions ECS needs.
resource "aws_iam_role_policy_attachment" "ecs_execution" {
  role       = aws_iam_role.ecs_execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
hcl
resource "aws_ecs_cluster" "main" {
  name = "myapp-cluster"
}

# The blueprint for one container
resource "aws_ecs_task_definition" "app" {
  family                   = "myapp"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"        # required for Fargate
  cpu                      = "256"
  memory                   = "512"
  execution_role_arn       = aws_iam_role.ecs_execution.arn  # pulls image, logs
  task_role_arn            = aws_iam_role.app.arn             # your app's permissions

  container_definitions = jsonencode([{
    name      = "app"
    image     = "123456789.dkr.ecr.us-east-1.amazonaws.com/myapp:latest"
    portMappings = [{ containerPort = 80 }]
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = "/ecs/myapp"
        "awslogs-region"        = "us-east-1"
        "awslogs-stream-prefix" = "app"
      }
    }
  }])
}

# The service that keeps N copies running and wires them to the ALB
resource "aws_ecs_service" "app" {
  name            = "myapp-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    subnets         = aws_subnet.private[*].id          # tasks run in private subnets
    security_groups = [aws_security_group.app.id]
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn      # register tasks with the ALB
    container_name   = "app"
    container_port   = 80
  }
}
Gotcha: two different roles, and mixing them up. The execution role is used by the ECS agent to pull the container image from ECR and ship logs to CloudWatch; it needs ECR and logs permissions. The task role is assumed by your running application to call AWS APIs (read S3, write DynamoDB); it needs your app's permissions. Putting your app's S3 permissions on the execution role, or forgetting the execution role entirely (so the image will not pull), are the two classic ECS mistakes.
Gotcha: Fargate tasks belong in private subnets with the ALB in front. Put the tasks in private subnets (no direct internet exposure) and let the public ALB route to them. Tasks then need a NAT gateway for outbound calls (like pulling the image if not using a VPC endpoint). Running tasks in public subnets works but is less secure.

CloudFront: The CDN in Front

CloudFront is AWS's content delivery network: it caches your content at edge locations worldwide so users get fast responses from a nearby server, and it sits in front of an origin (an S3 bucket for static sites, or an ALB for dynamic apps). It also gives you HTTPS and shields your origin.

hcl
# Origin Access Control: lets CloudFront read the private S3 bucket (see gotcha below)
resource "aws_cloudfront_origin_access_control" "assets" {
  name                              = "assets-oac"
  origin_access_control_origin_type = "s3"
  signing_behavior                  = "always"
  signing_protocol                  = "sigv4"
}

resource "aws_cloudfront_distribution" "cdn" {
  enabled             = true
  default_root_object = "index.html"

  origin {
    domain_name              = aws_s3_bucket.assets.bucket_regional_domain_name
    origin_id                = "s3-assets"
    origin_access_control_id = aws_cloudfront_origin_access_control.assets.id
  }

  default_cache_behavior {
    target_origin_id       = "s3-assets"
    viewer_protocol_policy = "redirect-to-https"   # force HTTPS
    allowed_methods        = ["GET", "HEAD"]
    cached_methods         = ["GET", "HEAD"]
    cache_policy_id        = "658327ea-f89d-4fab-a63d-7e88639e58f6" # AWS managed CachingOptimized
  }

  restrictions {
    geo_restriction { restriction_type = "none" }
  }

  viewer_certificate {
    cloudfront_default_certificate = true
  }
}
Gotcha: serving a private S3 bucket through CloudFront. The modern, correct way to let CloudFront read a private bucket (so users cannot bypass the CDN and hit S3 directly) is Origin Access Control (OAC), which replaced the older Origin Access Identity. You create an aws_cloudfront_origin_access_control, attach it to the origin, and add a bucket policy allowing that CloudFront distribution to read. This keeps the bucket fully private while CloudFront serves it.
Gotcha: CloudFront certificates must be in us-east-1. An ACM certificate used by CloudFront must be created in the us-east-1 region regardless of where the rest of your infrastructure lives. This catches many people whose cert "does not show up" because it is in the wrong region.

Lambda and API Gateway: Serverless APIs

Not every API needs servers or containers. Lambda runs your code in response to events with no server to manage, billed only for the time it runs, and API Gateway puts an HTTP endpoint in front of it, so together they build a serverless API that scales to zero when idle and up automatically under load. This is a completely different compute model from EC2 and ECS, and for many APIs it is simpler and cheaper.

The pieces for a basic HTTP API:

  • A Lambda function: your code, packaged as a zip (or container image), with a handler and a runtime.
  • An execution role: what the function is allowed to do (at minimum, write its logs to CloudWatch).
  • An API Gateway: the HTTP front door that routes requests to the function.
  • A permission: allowing API Gateway to invoke the function.
hcl
# 1. The role Lambda runs as (must at least allow writing logs)
resource "aws_iam_role" "lambda" {
  name = "myapp-lambda-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

# AWS-managed policy that grants CloudWatch Logs write access
resource "aws_iam_role_policy_attachment" "lambda_logs" {
  role       = aws_iam_role.lambda.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

# 2. Package the code (Terraform can zip a local folder for you)
data "archive_file" "lambda" {
  type        = "zip"
  source_dir  = "${path.module}/src"
  output_path = "${path.module}/lambda.zip"
}

# 3. The function itself
resource "aws_lambda_function" "api" {
  function_name    = "myapp-api"
  role             = aws_iam_role.lambda.arn
  handler          = "index.handler"          # file.exportedFunction
  runtime          = "nodejs20.x"
  filename         = data.archive_file.lambda.output_path
  source_code_hash = data.archive_file.lambda.output_base64sha256  # redeploy on change
  timeout          = 10
  memory_size      = 256

  environment {
    # Environment variables your function reads at runtime. Wire in real resource
    # attributes here, e.g. TABLE_NAME = aws_dynamodb_table.items.name
    variables = { LOG_LEVEL = "info" }
  }
}

Now the HTTP front door. The modern HTTP API (aws_apigatewayv2_*) is simpler and cheaper than the older REST API and is the right default for most services:

hcl
resource "aws_apigatewayv2_api" "main" {
  name          = "myapp-http-api"
  protocol_type = "HTTP"
}

# Connect the API to the Lambda
resource "aws_apigatewayv2_integration" "lambda" {
  api_id                 = aws_apigatewayv2_api.main.id
  integration_type       = "AWS_PROXY"                 # pass the whole request to Lambda
  integration_uri        = aws_lambda_function.api.invoke_arn
  payload_format_version = "2.0"
}

# Route: which requests go to that integration
resource "aws_apigatewayv2_route" "get_items" {
  api_id    = aws_apigatewayv2_api.main.id
  route_key = "GET /items"
  target    = "integrations/${aws_apigatewayv2_integration.lambda.id}"
}

# A stage is a deployed version of the API (auto-deploy on change)
resource "aws_apigatewayv2_stage" "default" {
  api_id      = aws_apigatewayv2_api.main.id
  name        = "$default"
  auto_deploy = true
}

# 4. Allow API Gateway to invoke the Lambda (without this, calls get 500s)
resource "aws_lambda_permission" "apigw" {
  statement_id  = "AllowAPIGatewayInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.api.function_name
  principal     = "apigateway.amazonaws.com"
  source_arn    = "${aws_apigatewayv2_api.main.execution_arn}/*/*"
}
Gotcha: forgetting the invoke permission. The single most common serverless mistake is defining the function, the integration, and the route, then getting a 500 on every call because API Gateway is not allowed to invoke the Lambda. The aws_lambda_permission resource is what grants that; without it the wiring looks complete but fails at runtime.
Gotcha: source_code_hash and stale deploys. Without source_code_hash, Terraform does not notice when your zipped code changed (only config changes), so apply reports "no changes" and your new code never deploys. Setting source_code_hash to the zip's hash makes Terraform redeploy whenever the code changes. This confuses people whose function "will not update."
Gotcha: HTTP API vs REST API. API Gateway has two flavors. The older REST API (aws_api_gateway_*) has more features (request validation, API keys, usage plans) but is more verbose and costs more. The newer HTTP API (aws_apigatewayv2_*, shown here) is cheaper, faster, and simpler, and is the right default unless you specifically need a REST-API-only feature. Do not mix the two resource families.
Lambda in a VPC, only when needed. By default a Lambda runs outside your VPC and can reach the public internet but not private resources like an RDS database. To reach private resources you attach it to your VPC subnets, but then it loses default internet access and needs a NAT gateway or VPC endpoints for outbound calls. Attach Lambda to a VPC only when it must reach private resources, because it adds complexity and cold-start cost.

A Full Picture: How the Services Connect

Putting the tour together, a typical containerized web application in Terraform looks like this, and seeing the flow clarifies why each service exists:

text
Users
  |
  v
CloudFront (CDN, HTTPS, caching)
  |
  +---> S3 bucket (static assets: JS, CSS, images) via OAC
  |
  v
Application Load Balancer (in public subnets)
  |  health-checked routing
  v
ECS Fargate tasks (in private subnets)
  |  task role
  +---> S3, databases, other AWS APIs
  |
  (image pulled via execution role from ECR)
  |
CloudWatch Logs  <--- logs from tasks

Traffic enters through CloudFront (fast, cached, HTTPS). Static files come straight from S3. Dynamic requests pass to the ALB, which distributes them across ECS tasks running your containers in private subnets. Those tasks use their task role to reach other AWS services, and stream logs to CloudWatch. IAM roles wire the permissions, and the VPC is the network it all sits in. Every service in this guide has a place in that flow.

The serverless variant swaps the middle: instead of ALB plus ECS tasks, requests go to API Gateway, which invokes Lambda functions that reach the same downstream services (S3, databases) via the function's execution role and stream logs to the same CloudWatch. Everything else (CloudFront, S3, ACM, IAM, the data stores) stays the same. Choosing between the two is the compute decision covered in the interview questions: containers for steady or long-running workloads, serverless for spiky or event-driven ones.

How Terraform Maintains All of This

Stepping back to the mechanics, here is what Terraform actually does when you run apply, now that you have seen real resources:

  1. Refresh: it reads the current real state of every resource from the AWS API and updates its understanding.
  2. Build the dependency graph: using your references (aws_lb_target_group.app.arn inside the service, and so on), it works out what depends on what and the order to act in. Independent resources are created in parallel; dependent ones wait.
  3. Compute the diff: it compares desired (your config) to actual, producing a plan of creates, updates, in-place changes, and replacements.
  4. Apply: it executes the plan in dependency order, updating the state file as each resource succeeds.

This is why you never script the order yourself and why removing a resource from your code causes Terraform to destroy it: the code is the desired state, and Terraform's job is always to make reality match it, nothing more and nothing less.

Gotcha: drift. If someone changes a resource manually in the AWS console, reality no longer matches state. On the next plan, Terraform detects this "drift" and proposes to change it back to what your code says. This is a feature (your code is the source of truth), but it surprises people who made a quick manual fix and then had Terraform revert it. The rule on a Terraform-managed team: change things in code, not the console.

Best Practices and Common Pitfalls

A consolidated set of habits that keep Terraform projects healthy.

Use remote state with locking from day one on a team. Local state is fine solo, but the moment two people share infrastructure, you need an S3 backend with locking to avoid corruption.
Pin provider and module versions. Use version = "~> 5.0" so an upgrade cannot silently break you. Commit the .terraform.lock.hcl file so everyone uses the same provider versions.
Prefer for_each over count for anything that might grow or shrink, to avoid the index-shifting destroy-and-recreate problem.
Tag everything, ideally with default_tags on the provider, so every resource is traceable to a project and marked as Terraform-managed.
Run plan and read it before every apply. The plan is your safety net; the -/+ replacement lines especially deserve attention, because a replacement can mean downtime or data loss.
Never store secrets in plain .tf files or commit state. Use AWS Secrets Manager or SSM Parameter Store for secrets, reference them via data sources, and keep state in a private encrypted backend.
Structure code by environment with separate state. Isolated state per environment means a mistake in dev cannot touch prod.
Use terraform fmt and terraform validate. fmt keeps formatting consistent; validate catches syntax errors before a plan. Both are cheap and worth running constantly (and in CI).
Do not hardcode account-specific values. Use data sources (aws_caller_identity, aws_region, AMI lookups) so the same code works across accounts and regions.

Interview Questions Worth Knowing

The questions that come up most, with the answers that show you understand the mechanics rather than just the commands.

What is the difference between terraform plan and terraform apply? plan computes and displays the set of changes needed to make reality match your configuration, without changing anything; it is a dry run. apply executes those changes. plan does a refresh (reads real state from the provider), compares desired versus actual, and prints creates, updates, in-place changes, and replacements. You can save a plan (plan -out) and apply that exact plan later, which is what CI/CD pipelines do so the applied changes are exactly the reviewed ones.
What is state and why does Terraform need it? State is Terraform's record mapping each resource in your config to the real object it created in the cloud. It needs it because the config only says what you want, not which existing objects those are; state is the memory linking aws_instance.web to i-0abc123, which is what lets Terraform update in place, detect drift, and know what to destroy when you remove code. Without state, Terraform could not tell "update this" from "create a new one."
How do you manage state for a team? Use a remote backend (S3) with locking (DynamoDB or S3 native locking), so state is shared, durable, and only one apply can run at a time. Local state does not work for teams because it is not shared and concurrent applies corrupt it. The state should be in a private, encrypted bucket because it can contain secrets in plain text.
What is the difference between count and for_each? Both create multiple resources. count indexes by number ([0], [1]), so removing a middle item shifts every later index and forces destroy-and-recreate. for_each keys by a map or set key (["assets"]), so items are tracked by stable identity and removing one does not disturb the others. Prefer for_each for anything that might change; use count for a fixed number of truly identical resources or for conditional creation (count = var.enabled ? 1 : 0).
How does Terraform know the order to create resources? It builds a dependency graph from the references in your code. When resource A references an attribute of resource B (subnet_id = aws_subnet.main.id), Terraform knows B must exist first. It creates independent resources in parallel and dependent ones in order. You almost never specify order manually; if you must express a dependency that is not a reference, depends_on forces it.
What is drift and how does Terraform handle it? Drift is when the real infrastructure no longer matches state because something changed outside Terraform (a manual console edit). On the next plan, the refresh detects the difference and Terraform proposes to bring reality back in line with your code, since the code is the source of truth. The team discipline is to make changes in code, not the console.
What is the difference between a resource and a data source? A resource is something Terraform creates and manages (its lifecycle is yours). A data source only reads something that already exists (created elsewhere, or by another Terraform config) so you can reference its attributes; Terraform never modifies or destroys it.
What is the difference between the ECS execution role and task role? The execution role is used by the ECS agent to pull the container image from ECR and write logs to CloudWatch. The task role is assumed by your running application to call AWS APIs it needs (read S3, write DynamoDB). They exist separately because "what ECS needs to start the container" and "what your code needs at runtime" are different permission sets, and least privilege means not merging them.
How do you handle secrets in Terraform? Do not put them in plain .tf files or commit state (which stores them in plain text). Store secrets in AWS Secrets Manager or SSM Parameter Store, reference them at apply time via data sources, restrict and encrypt the state backend, and for provider credentials use temporary credentials via roles or OIDC rather than static keys.
What does terraform import do? It brings an existing resource that was created outside Terraform under Terraform management, by writing it into state and matching it to a resource block you define. It is how you adopt pre-existing infrastructure without recreating it. Newer versions also support declarative import blocks.
Why prefer roles and OIDC over access keys? Static access keys are long-lived secrets that leak through code, logs, and environments, and must be rotated manually. A role provides temporary, automatically-rotated credentials, and OIDC (for CI systems like GitHub Actions) lets a pipeline assume a role with no stored secret at all. Fewer long-lived secrets means less to leak.
What happens if two people run apply at the same time? Without locking, they can corrupt the state file by writing over each other. With a locking backend (DynamoDB or S3 native locking), the second apply is blocked until the first releases the lock. This is the main reason remote state with locking is mandatory for teams.
When would you choose Lambda over ECS, or the reverse? Choose Lambda and API Gateway for event-driven or spiky workloads, short-lived requests, and anything where scaling to zero when idle saves money; there are no servers or containers to manage. Choose ECS (or EC2) for long-running processes, workloads needing more than Lambda's time or memory limits, heavy or steady traffic where always-on containers are cheaper than per-invocation billing, or when you need full control over the runtime. A rough rule: bursty and small favors Lambda; steady and large favors containers.

The throughline across all of it: Terraform is a declarative engine that keeps a record (state) of what it manages, compares your code to reality on every run, and makes only the changes needed to close the gap. Learn to think in terms of desired state and references rather than steps, understand that state is the memory making it all work, and the individual services become straightforward, because each is just another resource block whose attributes you look up and wire together. With the mechanics and the core services here, you can read and write infrastructure for most common AWS architectures, and extend to any other service by reading its resource documentation and slotting it into the same patterns.

N

Naga Sai Rao

Some things fade fast. Some last. Learn the ones that last.

About the author →

Related