Designing Your Data Lake: The Medallion Architecture Pattern

A well-designed data lake makes queries fast and maintenance simple. A poorly designed one becomes a nightmare of scattered files and confusing schemas. In this chapter, we’ll design the structure that will hold your AIS data using the medallion architecture—an industry-standard pattern used by organizations from startups to enterprises.

But first, let’s understand the data we’ll be working with.

What is AIS Data?

The Automatic Identification System (AIS) is a maritime tracking system where ships broadcast their identity, position, course, and speed. Originally designed for collision avoidance, AIS has become a rich data source for:

Maritime safety — Real-time vessel tracking and collision prevention
Port management — Understanding traffic patterns and optimizing berth allocation
Environmental monitoring — Tracking emissions and detecting illegal fishing
Supply chain analytics — Predicting arrival times and optimizing logistics
Machine learning — Anomaly detection, route prediction, predictive maintenance

AIS Message Types

Ships broadcast different message types at varying frequencies:

Type	Name	Frequency	Contains
1-3	Position Report	2-10 seconds	MMSI, lat/lon, speed, course, heading
5	Static and Voyage	6 minutes	Vessel name, dimensions, destination, ETA
18-19	Class B Position	30 seconds	Similar to 1-3 but for smaller vessels
24	Static Data	6 minutes	Vessel details for Class B
27	Long-range	3 minutes	Position for satellite AIS

For this tutorial, we’ll focus on position reports (types 1-3, 18-19)—the most common and useful for analytics.

Why AIS is Ideal for Learning Data Engineering

AIS data has characteristics that make it perfect for building real-world data pipelines:

High volume — Millions of messages per day globally
Real-time — Sub-second updates from vessels
Geospatial — Every message includes lat/lon coordinates
Publicly available — Free access from sources like DMA (Danish Maritime Authority)
ML-ready — Ideal for trajectory prediction and anomaly detection

Now let’s design a data lake that can handle this data at scale.

The Medallion Architecture

The medallion architecture organizes data into three quality tiers, each with a distinct purpose:

Layer	Quality	Purpose	Consumers
Bronze	Raw	Land data exactly as received	Data engineers (debugging)
Silver	Clean	Validated, typed, partitioned, queryable	Data scientists, analysts
Gold	Curated	Aggregated, business-logic applied	BI tools, ML models, APIs

Data flows through these layers with increasing refinement:

Bronze receives raw files exactly as downloaded—CSV, JSON, whatever the source provides
Silver transforms raw data into clean, validated, efficiently-stored Parquet files
Gold aggregates silver data into business-ready datasets optimized for specific use cases

Why Three Layers?

This separation provides several benefits:

Debugging capability: When something looks wrong in a gold dataset, you can trace back through silver to bronze to understand exactly what the source provided.

Reprocessing flexibility: If you improve your transformation logic, you can regenerate silver and gold layers without re-downloading data.

Performance optimization: Gold datasets are pre-aggregated for fast queries, while silver remains flexible for ad-hoc analysis.

Clear responsibilities: Each layer has a single job, making the pipeline easier to understand and maintain.

Our Data Lake Structure

We’ll store everything in a single MinIO bucket called ais-lake, organized by layer and data type:

s3://ais-lake/
├── bronze/                           # Temporary staging
│   └── dma/
│       ├── aisdk-2024-01-15.zip
│       ├── aisdk-2024-01-16.zip
│       └── ...
│
├── silver/                           # Clean, partitioned Parquet
│   └── ais/
│       └── source=dma/
│           ├── dt=2024-01-15/
│           │   ├── part-0.parquet
│           │   └── part-1.parquet
│           └── ...
│
├── gold/                             # Business-ready aggregations
│   ├── vessel_tracks/
│   │   └── dt=2024-01-15/
│   │       └── tracks.parquet
│   ├── port_density/
│   │   └── dt=2024-01-15/
│   │       └── density.parquet
│   └── daily_stats/
│       └── dt=2024-01-15/
│           └── stats.parquet
│
└── dagster-logs/                     # Execution logs
    └── runs/

Hive-Style Partitioning

Notice the key=value folder names like source=dma and dt=2024-01-15. This is Hive-style partitioning, and it enables powerful query optimization.

When you query with WHERE dt='2024-01-15', DuckDB only reads files in that specific partition directory—it doesn’t scan data from other dates. This partition pruning dramatically speeds up queries on large datasets.

Why `dt=YYYY-MM-DD`?

We use a single date column rather than separate year/month/day partitions:

# Our choice
dt=2024-01-15/

# Alternative (not recommended)
year=2024/month=01/day=15/

The single-column approach is better for our use case because:

Queries usually filter by date range: WHERE dt BETWEEN '2024-01-01' AND '2024-01-31'
Fewer directory levels means faster file listing operations
Date comparisons are more intuitive than filtering multiple columns

Bronze Layer: Raw Staging

The bronze layer is a temporary landing zone for raw files. Data stays here only long enough to be converted to Parquet.

Format: Files exactly as downloaded (CSV, ZIP, etc.)
Lifetime: Long-term (years), useful in case you want to recalculate silver
Validation: Minimal—just verify the file is readable

Why Not Convert Directly?

You might wonder why we don’t convert directly to silver during download. The separate bronze stage provides:

Failure isolation: If Parquet conversion fails, you don’t lose the download
Debugging access: You can inspect raw data when validation fails
Reprocessing ability: You can re-run conversion with different settings

Since bronze files are deleted after processing, storage impact is minimal.

Silver Layer: Clean, Queryable Data

The silver layer is the source of truth for cleaned AIS data. This is where most analysis happens.

Format: Parquet with Snappy compression
Schema: Strongly typed and validated
Partitioning: By source and date
Lifetime: Long-term (years)

Standard Schema

All AIS data in silver follows this schema:

CREATE TABLE silver.ais (
    -- Vessel identifier
    mmsi BIGINT NOT NULL,

    -- Timestamp (always UTC)
    timestamp TIMESTAMP NOT NULL,

    -- Position (WGS84)
    latitude DOUBLE,
    longitude DOUBLE,

    -- Dynamics
    speed_over_ground DOUBLE,
    course_over_ground DOUBLE,
    heading INTEGER,

    -- Navigation status
    navigational_status INTEGER,

    -- Message metadata
    message_type INTEGER,

    -- Partitioning columns
    source VARCHAR,
    dt DATE
)
PARTITIONED BY (source, dt);

This schema captures the essential fields for maritime analytics while keeping the data lean. Additional fields like vessel name and destination are better stored in separate dimension tables.

Why Snappy Compression?

Parquet supports multiple compression codecs. We chose Snappy because:

Codec	Compression	Speed	Best For
Snappy	5-10x	Fast	Frequently queried data
GZIP	10-20x	Slow	Archival storage
ZSTD	10-15x	Medium	Balance of both

Since silver data is queried often, Snappy’s fast decompression is worth the slightly larger file size.

Storage Estimates

Raw CSV compresses dramatically when converted to Parquet:

Daily CSV: ~500MB compressed
Daily Parquet (Snappy): ~100MB
Yearly: ~36GB
10 years: ~360GB

This fits comfortably in a 1TB disk with room to spare.

Gold Layer: Business-Ready Datasets

The gold layer contains pre-aggregated datasets optimized for specific use cases. These are the tables that power dashboards, feed ML models, and answer business questions.

Vessel Tracks

Use case: Analyze individual vessel movements over time

CREATE TABLE gold.vessel_tracks (
    mmsi BIGINT,
    track_date DATE,
    points_count INTEGER,
    total_distance_nm DOUBLE,
    avg_speed DOUBLE,
    max_speed DOUBLE,
    start_lat DOUBLE,
    start_lon DOUBLE,
    end_lat DOUBLE,
    end_lon DOUBLE,
    dt DATE
);

This dataset groups position reports by vessel and date, calculating summary statistics that would be expensive to compute on demand.

Port Density

Use case: Heatmaps of vessel activity near ports

CREATE TABLE gold.port_density (
    port_name VARCHAR,
    grid_cell VARCHAR,
    vessel_count INTEGER,
    avg_speed DOUBLE,
    dt DATE
);

Pre-computed grid cells make it possible to render density maps instantly instead of aggregating millions of points on each request.

Daily Statistics

Use case: Pipeline monitoring and trend analysis

CREATE TABLE gold.daily_stats (
    total_messages BIGINT,
    unique_vessels INTEGER,
    avg_speed DOUBLE,
    coverage_area_km2 DOUBLE,
    dt DATE
);

Gold datasets are small—often just a few KB per day—so they can be kept indefinitely.

Partitioning Strategy

What to Partition By

Date: Essential for time-series data. Enables efficient range queries and incremental processing.

Source: Useful when you have multiple data providers with potentially different schemas or quality levels.

What NOT to Partition By

MMSI (vessel ID): Too high cardinality. Millions of unique values would create millions of tiny files, making directory listing painfully slow.

Hour: Too granular for daily batch processing. Use the timestamp column within the date partition instead.

Geographic region: Queries rarely filter by fixed regions. Better to use geospatial indexes or post-query filtering.

Schema Evolution

Data sources occasionally change their formats. Our strategy for handling this:

Bronze: Accept any schema. Store as-is without validation.

Silver: Map to standard schema. Add new columns as nullable. Use default values for missing columns. Log mismatches for investigation.

Gold: Use versioned datasets. If a gold schema changes significantly, create a new version (vessel_tracks_v2) rather than breaking existing consumers.

This approach keeps the pipeline resilient to upstream changes while maintaining stable interfaces for downstream users.

Retention Policies

With 1TB of storage, we need lifecycle policies:

Layer	Retention	Reason
Bronze	7 days	Only needed for reprocessing
Silver	3+ years	Source of truth for analytics
Gold	Permanent	Small size, high value

MinIO supports lifecycle policies that automatically delete old bronze files:

<LifecycleConfiguration>
    <Rule>
        <ID>delete-bronze-after-7-days</ID>
        <Filter>
            <Prefix>bronze/</Prefix>
        </Filter>
        <Status>Enabled</Status>
        <Expiration>
            <Days>7</Days>
        </Expiration>
    </Rule>
</LifecycleConfiguration>

Naming Conventions

Consistent naming prevents confusion as your data lake grows:

Lowercase everything: bronze/, not Bronze/
No spaces: Use underscores (vessel_tracks) or hyphens
Hive partitioning: key=value/ format
ISO dates: YYYY-MM-DD, never MM-DD-YYYY
Descriptive names: vessel_tracks, not vt

Good paths:

silver/ais/source=dma/dt=2024-01-15/part-0.parquet
gold/vessel_tracks/dt=2024-01-15/tracks.parquet

Avoid:

Silver/AIS/DMA/2024-01-15.parquet  # Mixed case, no partitioning
gold/vt/20240115/data.parquet      # Abbreviation, wrong date format

Creating the Structure

Use the MinIO client (mc) to create the bucket and initial structure:

# Configure mc with your MinIO credentials
mc alias set local http://<MINIO_IP>:9000 <access-key> <secret-key>

# Create the bucket
mc mb local/ais-lake

# Create layer placeholders
echo "placeholder" | mc pipe local/ais-lake/bronze/.keep
echo "placeholder" | mc pipe local/ais-lake/silver/.keep
echo "placeholder" | mc pipe local/ais-lake/gold/.keep

Note: S3/MinIO doesn’t have real folders—they’re virtual constructs based on object key prefixes. The .keep files ensure the “folders” appear in the console.

What You’ve Designed

Your data lake now has:

Clear organization with bronze, silver, and gold layers
Efficient partitioning by source and date for fast queries
Standard schemas that make data predictable and queryable
Retention policies that keep storage under control
Naming conventions that prevent confusion

This structure scales from gigabytes to terabytes without changes. The same patterns work on a home server or a cloud data platform.

What’s Next

With the data lake structure defined, we need to understand where the data comes from. Chapter 4 covers connecting to AIS data feeds and building the download pipeline.

Designing a data lake for your organization? We help companies implement modern data architectures. Get in touch to discuss your project.