Snowflake — Iceberg tables 101

Johan
3 min readJul 12, 2024

--

TL;DR use this Snowflake Quickstart to explore Iceberg tables https://quickstarts.snowflake.com/guide/getting_started_iceberg_tables

Winter Iceberg Season — Free photo on Pixabay — Pixabay

Let’s try to understand these Iceberg tables, as developments around data-lake table formats can be take some effort to understand. For me, this means learning by doing.

All links are added below for explanation. Some oneliners from those resources:

Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

TLDR: The Iceberg format is a logical table with underlying data stored in columnar formats on cloud object storage. When working with Iceberg tables, you want to ensure best practices such as choosing the right partitioning scheme, compacting small files, managing data retention, and managing schema evolution.

Second, now that we understand what it is, let’s work on some Iceberg tables. I am using the Snowflake quickstart here.

We need to do the AWS setup for the external volume. For now, I just create the AWS stuff manually. We need to use the same region for AWS and Snowflake.

Find the region for your Snowflake account and point your AWS account to the same region:

select current_region();

After going through the AWS and Snowflake setup, execute this command:

show external volumes;
desc external volume iceberg_external_volume;

Similar to storage integrations, we now need to go back and forth between Snowflake and AWS to enter the credentials in the trust relationship in AWS:

Then we can create the table and insert data into it:

CREATE OR REPLACE ICEBERG TABLE customer_iceberg (
c_custkey INTEGER,
c_name STRING,
c_address STRING,
c_nationkey INTEGER,
c_phone STRING,
c_acctbal INTEGER,
c_mktsegment STRING,
c_comment STRING
)
CATALOG='SNOWFLAKE'
EXTERNAL_VOLUME='iceberg_external_volume'
BASE_LOCATION='';

INSERT INTO customer_iceberg
SELECT * FROM snowflake_sample_data.tpch_sf1.customer;

When running a query, a Snowflake table and an Iceberg table can be seamlessly integrated as you can see in this query profile:

This is what is created in S3, two folders:

In the data folder, we see that the file is created as a Parquet file:

The metadata folder contains the Iceberg metadata, avro & json files:

In the tutorial, row-level and column-level security are explained with code, using Snowpark for Iceberg handling, sharing Iceberg tables and finally using Spark for processing.

To conclude: there is a lot to discover on this Iceberg topic with regards to the Snowflake implementation. Really necessary to start playing around with it. If you have any comments or learning resources, please let me know in the comments.

Links:

Iceberg 101: Working with Iceberg Tables — Upsolver

What Are Apache Iceberg Tables and How Are They Useful? | Snowflake

Apache Iceberg — Apache Iceberg

https://quickstarts.snowflake.com/guide/getting_started_iceberg_tables

--

--

Johan

I am a freelance data engineer. My main focus is Snowflake, but I am always eager to learn new things. Python and AWS are my main side interests