A well-designed data lake makes queries fast and maintenance simple. A poorly designed one becomes a nightmare of scattered files and confusing schemas. In this chapter, we’ll design the structure that will hold your AIS data using the medallion architecture—an industry-standard pattern used by organizations from startups to enterprises.
But first, let’s understand the data we’ll be working with.
What is AIS Data?
The Automatic Identification System (AIS) is a maritime tracking system where ships broadcast their identity, position, course, and speed. Originally designed for collision avoidance, AIS has become a rich data source for:
- Maritime safety — Real-time vessel tracking and collision prevention
- Port management — Understanding traffic patterns and optimizing berth allocation
- Environmental monitoring — Tracking emissions and detecting illegal fishing
- Supply chain analytics — Predicting arrival times and optimizing logistics
- Machine learning — Anomaly detection, route prediction, predictive maintenance
AIS Message Types
Ships broadcast different message types at varying frequencies:
| Type | Name | Frequency | Contains |
|---|---|---|---|
| 1-3 | Position Report | 2-10 seconds | MMSI, lat/lon, speed, course, heading |
| 5 | Static and Voyage | 6 minutes | Vessel name, dimensions, destination, ETA |
| 18-19 | Class B Position | 30 seconds | Similar to 1-3 but for smaller vessels |
| 24 | Static Data | 6 minutes | Vessel details for Class B |
| 27 | Long-range | 3 minutes | Position for satellite AIS |
For this tutorial, we’ll focus on position reports (types 1-3, 18-19)—the most common and useful for analytics.
Why AIS is Ideal for Learning Data Engineering
AIS data has characteristics that make it perfect for building real-world data pipelines:
- High volume — Millions of messages per day globally
- Real-time — Sub-second updates from vessels
- Geospatial — Every message includes lat/lon coordinates
- Publicly available — Free access from sources like DMA (Danish Maritime Authority)
- ML-ready — Ideal for trajectory prediction and anomaly detection
Now let’s design a data lake that can handle this data at scale.
The Medallion Architecture
The medallion architecture organizes data into three quality tiers, each with a distinct purpose:
| Layer | Quality | Purpose | Consumers |
|---|---|---|---|
| Bronze | Raw | Land data exactly as received | Data engineers (debugging) |
| Silver | Clean | Validated, typed, partitioned, queryable | Data scientists, analysts |
| Gold | Curated | Aggregated, business-logic applied | BI tools, ML models, APIs |
Data flows through these layers with increasing refinement:
- Bronze receives raw files exactly as downloaded—CSV, JSON, whatever the source provides
- Silver transforms raw data into clean, validated, efficiently-stored Parquet files
- Gold aggregates silver data into business-ready datasets optimized for specific use cases
Why Three Layers?
This separation provides several benefits:
Debugging capability: When something looks wrong in a gold dataset, you can trace back through silver to bronze to understand exactly what the source provided.
Reprocessing flexibility: If you improve your transformation logic, you can regenerate silver and gold layers without re-downloading data.
Performance optimization: Gold datasets are pre-aggregated for fast queries, while silver remains flexible for ad-hoc analysis.
Clear responsibilities: Each layer has a single job, making the pipeline easier to understand and maintain.
Our Data Lake Structure
We’ll store everything in a single MinIO bucket called ais-lake, organized by layer and data type:
s3://ais-lake/
├── bronze/ # Temporary staging
│ └── dma/
│ ├── aisdk-2024-01-15.zip
│ ├── aisdk-2024-01-16.zip
│ └── ...
│
├── silver/ # Clean, partitioned Parquet
│ └── ais/
│ └── source=dma/
│ ├── dt=2024-01-15/
│ │ ├── part-0.parquet
│ │ └── part-1.parquet
│ └── ...
│
├── gold/ # Business-ready aggregations
│ ├── vessel_tracks/
│ │ └── dt=2024-01-15/
│ │ └── tracks.parquet
│ ├── port_density/
│ │ └── dt=2024-01-15/
│ │ └── density.parquet
│ └── daily_stats/
│ └── dt=2024-01-15/
│ └── stats.parquet
│
└── dagster-logs/ # Execution logs
└── runs/
Hive-Style Partitioning
Notice the key=value folder names like source=dma and dt=2024-01-15. This is Hive-style partitioning, and it enables powerful query optimization.
When you query with WHERE dt='2024-01-15', DuckDB only reads files in that specific partition directory—it doesn’t scan data from other dates. This partition pruning dramatically speeds up queries on large datasets.
Why dt=YYYY-MM-DD?
We use a single date column rather than separate year/month/day partitions:
# Our choice
dt=2024-01-15/
# Alternative (not recommended)
year=2024/month=01/day=15/
The single-column approach is better for our use case because:
- Queries usually filter by date range:
WHERE dt BETWEEN '2024-01-01' AND '2024-01-31' - Fewer directory levels means faster file listing operations
- Date comparisons are more intuitive than filtering multiple columns
Bronze Layer: Raw Staging
The bronze layer is a temporary landing zone for raw files. Data stays here only long enough to be converted to Parquet.
- Format: Files exactly as downloaded (CSV, ZIP, etc.)
- Lifetime: Long-term (years), useful in case you want to recalculate silver
- Validation: Minimal—just verify the file is readable
Why Not Convert Directly?
You might wonder why we don’t convert directly to silver during download. The separate bronze stage provides:
- Failure isolation: If Parquet conversion fails, you don’t lose the download
- Debugging access: You can inspect raw data when validation fails
- Reprocessing ability: You can re-run conversion with different settings
Since bronze files are deleted after processing, storage impact is minimal.
Silver Layer: Clean, Queryable Data
The silver layer is the source of truth for cleaned AIS data. This is where most analysis happens.
- Format: Parquet with Snappy compression
- Schema: Strongly typed and validated
- Partitioning: By source and date
- Lifetime: Long-term (years)
Standard Schema
All AIS data in silver follows this schema:
CREATE TABLE silver.ais (
-- Vessel identifier
mmsi BIGINT NOT NULL,
-- Timestamp (always UTC)
timestamp TIMESTAMP NOT NULL,
-- Position (WGS84)
latitude DOUBLE,
longitude DOUBLE,
-- Dynamics
speed_over_ground DOUBLE,
course_over_ground DOUBLE,
heading INTEGER,
-- Navigation status
navigational_status INTEGER,
-- Message metadata
message_type INTEGER,
-- Partitioning columns
source VARCHAR,
dt DATE
)
PARTITIONED BY (source, dt);
This schema captures the essential fields for maritime analytics while keeping the data lean. Additional fields like vessel name and destination are better stored in separate dimension tables.
Why Snappy Compression?
Parquet supports multiple compression codecs. We chose Snappy because:
| Codec | Compression | Speed | Best For |
|---|---|---|---|
| Snappy | 5-10x | Fast | Frequently queried data |
| GZIP | 10-20x | Slow | Archival storage |
| ZSTD | 10-15x | Medium | Balance of both |
Since silver data is queried often, Snappy’s fast decompression is worth the slightly larger file size.
Storage Estimates
Raw CSV compresses dramatically when converted to Parquet:
Daily CSV: ~500MB compressed
Daily Parquet (Snappy): ~100MB
Yearly: ~36GB
10 years: ~360GB
This fits comfortably in a 1TB disk with room to spare.
Gold Layer: Business-Ready Datasets
The gold layer contains pre-aggregated datasets optimized for specific use cases. These are the tables that power dashboards, feed ML models, and answer business questions.
Vessel Tracks
Use case: Analyze individual vessel movements over time
CREATE TABLE gold.vessel_tracks (
mmsi BIGINT,
track_date DATE,
points_count INTEGER,
total_distance_nm DOUBLE,
avg_speed DOUBLE,
max_speed DOUBLE,
start_lat DOUBLE,
start_lon DOUBLE,
end_lat DOUBLE,
end_lon DOUBLE,
dt DATE
);
This dataset groups position reports by vessel and date, calculating summary statistics that would be expensive to compute on demand.
Port Density
Use case: Heatmaps of vessel activity near ports
CREATE TABLE gold.port_density (
port_name VARCHAR,
grid_cell VARCHAR,
vessel_count INTEGER,
avg_speed DOUBLE,
dt DATE
);
Pre-computed grid cells make it possible to render density maps instantly instead of aggregating millions of points on each request.
Daily Statistics
Use case: Pipeline monitoring and trend analysis
CREATE TABLE gold.daily_stats (
total_messages BIGINT,
unique_vessels INTEGER,
avg_speed DOUBLE,
coverage_area_km2 DOUBLE,
dt DATE
);
Gold datasets are small—often just a few KB per day—so they can be kept indefinitely.
Partitioning Strategy
What to Partition By
Date: Essential for time-series data. Enables efficient range queries and incremental processing.
Source: Useful when you have multiple data providers with potentially different schemas or quality levels.
What NOT to Partition By
MMSI (vessel ID): Too high cardinality. Millions of unique values would create millions of tiny files, making directory listing painfully slow.
Hour: Too granular for daily batch processing. Use the timestamp column within the date partition instead.
Geographic region: Queries rarely filter by fixed regions. Better to use geospatial indexes or post-query filtering.
Schema Evolution
Data sources occasionally change their formats. Our strategy for handling this:
Bronze: Accept any schema. Store as-is without validation.
Silver: Map to standard schema. Add new columns as nullable. Use default values for missing columns. Log mismatches for investigation.
Gold: Use versioned datasets. If a gold schema changes significantly, create a new version (vessel_tracks_v2) rather than breaking existing consumers.
This approach keeps the pipeline resilient to upstream changes while maintaining stable interfaces for downstream users.
Retention Policies
With 1TB of storage, we need lifecycle policies:
| Layer | Retention | Reason |
|---|---|---|
| Bronze | 7 days | Only needed for reprocessing |
| Silver | 3+ years | Source of truth for analytics |
| Gold | Permanent | Small size, high value |
MinIO supports lifecycle policies that automatically delete old bronze files:
<LifecycleConfiguration>
<Rule>
<ID>delete-bronze-after-7-days</ID>
<Filter>
<Prefix>bronze/</Prefix>
</Filter>
<Status>Enabled</Status>
<Expiration>
<Days>7</Days>
</Expiration>
</Rule>
</LifecycleConfiguration>
Naming Conventions
Consistent naming prevents confusion as your data lake grows:
- Lowercase everything:
bronze/, notBronze/ - No spaces: Use underscores (
vessel_tracks) or hyphens - Hive partitioning:
key=value/format - ISO dates:
YYYY-MM-DD, neverMM-DD-YYYY - Descriptive names:
vessel_tracks, notvt
Good paths:
silver/ais/source=dma/dt=2024-01-15/part-0.parquet
gold/vessel_tracks/dt=2024-01-15/tracks.parquet
Avoid:
Silver/AIS/DMA/2024-01-15.parquet # Mixed case, no partitioning
gold/vt/20240115/data.parquet # Abbreviation, wrong date format
Creating the Structure
Use the MinIO client (mc) to create the bucket and initial structure:
# Configure mc with your MinIO credentials
mc alias set local http://<MINIO_IP>:9000 <access-key> <secret-key>
# Create the bucket
mc mb local/ais-lake
# Create layer placeholders
echo "placeholder" | mc pipe local/ais-lake/bronze/.keep
echo "placeholder" | mc pipe local/ais-lake/silver/.keep
echo "placeholder" | mc pipe local/ais-lake/gold/.keep
Note: S3/MinIO doesn’t have real folders—they’re virtual constructs based on object key prefixes. The .keep files ensure the “folders” appear in the console.
What You’ve Designed
Your data lake now has:
- Clear organization with bronze, silver, and gold layers
- Efficient partitioning by source and date for fast queries
- Standard schemas that make data predictable and queryable
- Retention policies that keep storage under control
- Naming conventions that prevent confusion
This structure scales from gigabytes to terabytes without changes. The same patterns work on a home server or a cloud data platform.
What’s Next
With the data lake structure defined, we need to understand where the data comes from. Chapter 4 covers connecting to AIS data feeds and building the download pipeline.
Designing a data lake for your organization? We help companies implement modern data architectures. Get in touch to discuss your project.
Comments