AWS ETL Tools Cheat Sheet
1When to Use What
The decision matrix. Pick the row that matches your problem.
2AWS Glue
TL;DR
Managed Spark behind an API. You write PySpark, AWS runs the cluster. Includes a Hive-compatible Data Catalog (central metastore for your lake).
Architecture
Watch out
Cold start: 5-10 min. DPU billing rounds up. Crawlers misclassify schemas silently. Glue 4.0 still runs Spark 3.3. At scale, cost per DPU-hour ($0.44) adds up fast.
Cheat Snippet
aws glue start-job-run \ --job-name my-etl \ --arguments='--input=s3://raw/,--output=s3://clean/'
Pricing
$0.44/DPU-hour. 1 DPU = 4 vCPU + 16GB. Min 2 DPUs. Crawlers billed separately.
3Amazon EMR
TL;DR
Raw Hadoop/Spark clusters you control. Pick instance types, install custom JARs, tune spark.executor.memory. EMR Serverless exists but you lose fine control.
Use Case
- • 50TB+ daily click stream processing
- • ML model training on Spark MLlib
- • Interactive Presto analytics on massive datasets
Watch out
You ARE the ops team. Cluster sizing is black magic. Over-provision = burn cash. Under-provision = OOM. Spot interruptions kill long Spark stages. No auto-scaling by default.
Cheat Snippet
aws emr create-cluster \ --release-label emr-7.0.0 \ --applications Name=Spark Name=Hive \ --instance-type m5.xlarge --instance-count 5
Pricing
EC2 cost + EMR surcharge (~20%). Use spot for task nodes (up to 90% savings, risk of interruption).
4Amazon Athena
TL;DR
Presto-as-a-service. Point at S3, write SQL, get results. No infrastructure. Technically ELT (query-time transform) but everyone searches "ETL" for it.
Cost Comparison
Watch out
$5 per TB scanned. Unpartitioned CSV = 100x cost vs partitioned Parquet. No indexes. Complex joins on huge datasets are slow. Always use CTAS to materialize.
Cheat Snippet
CREATE TABLE clean WITH ( format='PARQUET', partitioned_by=ARRAY['dt'] ) AS SELECT user_id, event, dt FROM raw WHERE dt > '2026-01-01';
5Kinesis (Streams + Firehose)
TL;DR
Data Streams: Managed Kafka-like shards. You read with consumers (Lambda, KCL).
Firehose: Zero-admin delivery pipe. Buffers → dumps to S3/Redshift/OpenSearch.
Data Analytics: Managed Apache Flink for streaming SQL/Java.
Streaming Architecture
Watch out
Firehose min buffer = 60s (NOT real-time). Lambda transform timeout = 3 min, 6MB payload max. No deduplication. Shard splitting is manual and disruptive.
Cheat Snippet
aws firehose create-delivery-stream \ --delivery-stream-name clicks \ --s3-destination-configuration \ BucketARN=arn:aws:s3:::lake,\ CompressionFormat=SNAPPY
6AWS DMS
TL;DR
Reads the source DB's transaction log (binlog/WAL/redo) and replays on target. Full-load + ongoing CDC replication. NOT a transformation engine -it moves data.
Migration Flow
Watch out
Replication instance = single EC2 (no auto-scaling). LOB columns are brutally slow. Monitor replication lag or you lose data. NOT an ETL tool -minimal transformation only.
Cheat Snippet
aws dms create-replication-task \ --replication-task-id ora-to-pg \ --source-endpoint-arn $SRC \ --target-endpoint-arn $TGT \ --migration-type full-load-and-cdc
7Service Comparison
| Service | Arch | Engine | Mode | Pricing |
|---|---|---|---|---|
| Glue | Serverless | Spark | Batch | $0.44/DPU-hr |
| EMR | Provisioned | Spark/Hadoop | Batch | EC2 + 20% |
| Athena | Serverless | Trino | Ad-hoc | $5/TB scan |
| Kinesis | Serverless | Custom | Stream | $0.029/GB |
| DMS | Provisioned | CDC | Migration | Instance-hr |
| Lambda | Serverless | Any runtime | Event | Per invoke |
| Step Fn | Serverless | State machine | Orchestr. | Per transition |
| Redshift | Provisioned | Columnar | Warehouse | Node-hr |
8Glue: Data Lake ETL
What is AWS Glue? Managed Spark for batch ETL. Includes a Data Catalog shared by Athena, EMR, Redshift.
Hub-and-Spoke Architecture
9EMR: Big Data at Scale
What is Amazon EMR? Full Hadoop/Spark/Presto clusters. You pick instances, install JARs, tune memory.
Cluster Processing (50TB/day)
10Athena: SQL on S3
What is Amazon Athena? Serverless Presto/Trino. Point at S3, write SQL, get results. Pay per TB scanned.
Query → Scan → Results
Analyst
SELECT...
$5/TB
scanned
CTAS materialization
CREATE TABLE clean_events WITH (format='PARQUET', partitioned_by=ARRAY['dt']) AS SELECT user_id, action, ts, dt FROM raw_events WHERE dt > '2026-01-01' AND action IS NOT NULL;
11Kinesis: Streaming Pipelines
What is Amazon Kinesis? Streaming family. Data Streams = managed shards. Firehose = delivery. Flink = streaming SQL.
Multi-shard Fan-out
12DMS: Database Migration
What is AWS DMS? Reads source DB transaction logs and replays on target. Full-load + ongoing CDC.
Zero-Downtime Migration
13Batch ETL Pattern
Standard nightly ETL: raw → transform → load
Crawler discovers schema → PySpark transforms → COPY loads → analysts query
14Streaming ETL Pattern
Real-time clickstream / IoT ingestion
Buffer 60s → Snappy compress → partition by dt → SQL analytics
15Key Limits & Quotas
16Lake Formation & Governance
Lake Formation -Fine-grained access control on your data lake. Column/row-level security on Glue Catalog tables.
- • Manages permissions instead of S3 bucket policies
- • Tag-based access control (LF-Tags)
- • Cross-account data sharing without S3 replication
- • Works with Glue, Athena, Redshift Spectrum, EMR
Glue Data Quality -DQDL rules on Glue jobs. Fail pipelines on bad data.
Rules = [ ColumnValues "age" between 0 and 150, Completeness "email" > 0.95, IsUnique "user_id" ]
17Cost Cheat Sheet
What you actually pay (us-east-1, 2026)
Pro tip: Glue + Athena on Parquet is the cheapest serverless stack. EMR on spot is cheapest at petabyte scale.
The Complete Cheat Sheet for AWS ETL Tools
This cheat sheet for AWS ETL tools is the fastest way to compare AWS Glue, Amazon EMR, Amazon Athena, Kinesis Firehose, and AWS DMS. Each tool card includes the TL;DR (what it actually is under the hood), the exact use case, the hidden traps that will burn your budget, and a copy-paste CLI snippet.
Whether you are building a batch data pipeline, a real-time streaming architecture, or migrating databases, this AWS ETL tools cheat sheet gives you the decision matrix, architecture diagrams, and pricing comparison to pick the right tool in 30 seconds.
Unlike AWS documentation, this cheat sheet AWS ETL tools reference is brutally honest about limitations. We tell you about Glue's cold start, Athena's scan costs, EMR's operational overhead, Firehose's 60-second buffer, and DMS's single-instance bottleneck.
AWS ETL Tools Cheat Sheet FAQ
What is a cheat sheet for AWS ETL tools?expand_more
A cheat sheet for AWS ETL tools is a quick-reference guide that compares AWS data processing services, AWS Glue, Amazon EMR, Amazon Athena, Kinesis Firehose, and AWS DMS, by architecture, pricing, use cases, and limitations. This cheat sheet AWS ETL tools page helps data engineers pick the right tool without reading 50 pages of documentation.
What is the difference between AWS Glue and Amazon EMR?expand_more
AWS Glue is serverless Spark, you submit a job and AWS manages the cluster. Amazon EMR gives you full control over Hadoop/Spark clusters (instance types, tuning, custom JARs). Use Glue for standard batch ETL under 10TB. Use EMR for petabyte-scale jobs where you need to tune executor memory and install custom libraries.
When should I use Amazon Athena instead of AWS Glue?expand_more
Use Athena for ad-hoc SQL queries on data already in S3, no infrastructure needed, pay per TB scanned. Use Glue when you need to transform and move data (ETL jobs, crawlers, data catalog). Athena is technically ELT (query-time transformation). The key cost trap: unpartitioned CSV data in Athena will 10x your bill vs partitioned Parquet.
What are the hidden costs of AWS ETL tools?expand_more
AWS Glue: DPU-hours round up, 5-10 min cold start burns money. Athena: $5/TB scanned, CSV vs Parquet is a 10x cost difference. EMR: over-provisioned clusters and spot interruptions. Kinesis Firehose: not true real-time (60s buffer minimum). DMS: replication instance is a single EC2 with no auto-scaling.
What is the best AWS ETL tool for real-time streaming?expand_more
Kinesis Firehose for managed ingestion (buffer → S3/Redshift/OpenSearch). Kinesis Data Streams + Lambda for record-level processing. Kinesis Data Analytics (Managed Apache Flink) for complex streaming SQL/Java. Firehose is simplest but has a minimum 60-second buffer, it's near-real-time, not true real-time.
How do I choose between AWS DMS and AWS Glue for data migration?expand_more
DMS is for database-to-database migration with CDC (change data capture). It reads transaction logs and replays them. AWS Glue is for ETL, extract, transform, load with Spark. If you're migrating Oracle to Aurora with zero downtime, use DMS. If you're transforming and loading data into a data lake, use Glue.
Ready to practice data engineering interviews?
Practice with an AI interviewer that tests your ability to design data pipelines and articulate trade-offs in real time.
Try Crackr freearrow_forwardContinue learning
System Design Cheat Sheet
Load balancing, caching, databases, and more
Big O Cheat Sheet
Time & space complexity reference
All Cheat Sheets
Browse all cheat sheets
Blind 75
The original 75 LeetCode problems
DSA Tracker
177 problems across 16 DSA topics
Algorithm Visualizers
See DSA in action
Company Questions
Real interview questions by company