Overview
This guide introduces the incremental materialization of models in Snowflake using SDF. We’ll build on the basic concepts of materializing tables and views, focusing on incremental updates to optimize performance and resource usage. We continue to use datasets from Cybersyn available through Snowflake’s marketplace.Prerequisites
This guide should be followed after completing the Getting Started with Snowflake and SDF guide and the Basic Materialization with Snowflake guide.
- A Snowflake account with this Cybersyn dataset installed.
- Valid Snowflake username / password credentials with write access to at least one database we can materialize tables to.
- Instantiated credentials completed in the previous guide.
When installing the Cybersyn dataset, make sure to grant the Snowflake role you’ll use with SDF read access to
TECH__INNOVATION_ESSENTIALS (i.e. the Cybersyn database).Guide
1
Create a New SDF Project from the Cybersyn Tech Innovation Sample
If you just completed the Basic Materialization with Snowflake guide, you can skip the next two steps and use the project you created there.
2
Compile to Test Credentials
To ensure your credentials are working and have read access to the new Cybersyn database, let’s try compiling one of the models.
Working set 3 model files, 1 .sdf file
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.OPENALEX_FUNDERS_INDEX (schema)
Compiling sdf_snowflake.cybersyn_tech_innovation.funder_aggregates (./models/sdf_snowflake/cybersyn_tech_innovation/funder_aggregates.sql)
Finished 2 models [2 succeeded] in 1.782 secsSchema sdf_snowflake.cybersyn_tech_innovation.FUNDER_AGGREGATES
┌──────────────────────┬────────────────┬────────────┬─────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞══════════════════════╪════════════════╪════════════╪═════════════╡
│ FUNDER_NAME ┆ varchar ┆ ┆ │
│ TOTAL_FUNDERS ┆ decimal(38, 0) ┆ ┆ │
│ TOTAL_WORKS_COUNT ┆ decimal(38, 0) ┆ ┆ │
│ TOTAL_CITATIONS ┆ decimal(38, 0) ┆ ┆ │
│ TOTAL_GRANTS ┆ decimal(38, 0) ┆ ┆ │
│ UNIQUE_COUNTRY_COUNT ┆ decimal(38, 0) ┆ ┆ │
│ FIRST_CREATED_DATE ┆ timestamp ┆ ┆ │
│ LAST_UPDATED_DATE ┆ timestamp ┆ ┆ │
└──────────────────────┴────────────────┴────────────┴─────────────┘3
Materialize an Increment Model
Now that we’ve confirmed our credentials are working, let’s materialize an incremental model. Cybersyn offers a table called Let’s unpack a few things here:You might be wondering, how does SDF know when to set SDF defaults the incremental strategy to Before running our new incremental model, let’s inspect the compiled query output to see what exactly will be run against Snowflake. To do so, open up the file As you can see, the query was compiled with Nice! The model should have successfully been materialized in Snowflake. Next, we’ll try running the model in incremental mode.All we need to do is run the model again and SDF will automatically detect that the model has already been materialized and set Notice how the model ran much faster this time? That’s because SDF only fetched new rows from the This indicates SDF is preloading the schema and last altered time of the table before running the query. As was previously mentioned, this is to set the
github_events which gets updated frequently with new events across all repositories on GitHub. We’ll use this table to demonstrate incremental materialization with the append merge strategy.Let’s say we want to track push events to the Apache DataFusion repo. We can create a new model that finds all push events to DataFusion. Since the github_events source data is updated frequently, we can use incremental materialization to only query new push events, thereby saving on compute and optimizing our pipeline.
Since we don’t care about updates to existing rows, we can use the append merge strategy. This strategy appends new rows to the target table without updating existing rows.Let’s start creating a file called datafusion_push_events.sql in the directory models/sdf_snowflake/staging/ and adding the following SQL to it:models/sdf_snowflake/staging/datafusion_push_events.sql
- This relatively simple query fetches events and filters by their
type,repository name, andcreated_at_timestamp. - The
{% if builtin.is_incremental_mode %}block is a Jinja conditional that checks if the model is being materialized incrementally. If it is, we only fetch rows that are newer than the newest row in the previous materialization of this table. If not, we fetch rows from the last week.
In a production scenario, you would likely want to fetch events from all time for non-incremental mode run as this would be a full refresh of the data. We are adding the week filter to prevent any major compute costs in this guide.
builtin.is_incremental_mode to True? SDF sets this variable to True when the model has already been materialized in the cloud.Before running this model, we’ll need need to tell SDF to overwrite the default materialization for this table. We can do this by adding the following to the workspace.sdf.yml file:workspace.sdf.yml
append, as such we don’t need to specify it in this YML (we’ll explore this later in the guide).
Now, let’s first compile our workspace with this new model:Working set 4 model files, 1 .sdf file
Downloading SDF_SNOWFLAKE.STAGING.DATAFUSION_PUSH_EVENTS (exists_remotely)
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_EVENTS (schema)
Compiling sdf_snowflake.staging.datafusion_push_events (./models/sdf_snowflake/staging/datafusion_push_events.sql)
Finished 2 models [2 succeeded] in 2.526 secsSchema sdf_snowflake.staging.DATAFUSION_PUSH_EVENTS
┌────────────────────────────────┬────────────────┬────────────┬─────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞════════════════════════════════╪════════════════╪════════════╪═════════════╡
│ ID ┆ varchar ┆ ┆ │
│ CREATED_AT_TIMESTAMP ┆ timestamp ┆ ┆ │
│ TYPE ┆ varchar ┆ ┆ │
│ ACTOR_AVATAR_URL ┆ varchar ┆ ┆ │
│ ACTOR_DISPLAY_LOGIN ┆ varchar ┆ ┆ │
│ ACTOR_GRAVATAR_ID ┆ varchar ┆ ┆ │
│ ACTOR_ID ┆ decimal(38, 0) ┆ ┆ │
│ ACTOR_LOGIN ┆ varchar ┆ ┆ │
│ ACTOR_URL ┆ varchar ┆ ┆ │
│ REPO_ID ┆ decimal(38, 0) ┆ ┆ │
│ REPO_NAME ┆ varchar ┆ ┆ │
│ REPO_URL ┆ varchar ┆ ┆ │
│ ORG_AVATAR_URL ┆ varchar ┆ ┆ │
│ ORG_GRAVATAR_ID ┆ varchar ┆ ┆ │
│ ORG_ID ┆ decimal(38, 0) ┆ ┆ │
│ ORG_LOGIN ┆ varchar ┆ ┆ │
│ ORG_URL ┆ varchar ┆ ┆ │
│ PAYLOAD ┆ variant ┆ ┆ │
│ PAYLOAD_ACTION ┆ varchar ┆ ┆ │
│ PAYLOAD_DESCRIPTION ┆ varchar ┆ ┆ │
│ PAYLOAD_COMMENT ┆ varchar ┆ ┆ │
│ PAYLOAD_MASTER_BRANCH ┆ varchar ┆ ┆ │
│ PAYLOAD_PULL_REQUEST ┆ varchar ┆ ┆ │
│ PAYLOAD_PUSHER_TYPE ┆ varchar ┆ ┆ │
│ PAYLOAD_PUSH_ID ┆ decimal(38, 0) ┆ ┆ │
│ PAYLOAD_HEAD ┆ varchar ┆ ┆ │
│ PAYLOAD_REF ┆ varchar ┆ ┆ │
│ PAYLOAD_REF_TYPE ┆ varchar ┆ ┆ │
│ PAYLOAD_ISSUE_ID ┆ decimal(38, 0) ┆ ┆ │
│ PAYLOAD_ISSUE ┆ variant ┆ ┆ │
│ PUBLIC ┆ boolean ┆ ┆ │
│ LOAD_DATE ┆ varchar ┆ ┆ │
│ CREATED_AT ┆ varchar ┆ ┆ │
│ PAYLOAD_BODY ┆ varchar ┆ ┆ │
│ PAYLOAD_COMMIT_ID ┆ varchar ┆ ┆ │
│ PAYLOAD_CREATED_AT ┆ varchar ┆ ┆ │
│ PAYLOAD_USER_ID ┆ decimal(38, 0) ┆ ┆ │
│ PAYLOAD_USER_LOGIN ┆ varchar ┆ ┆ │
│ ISSUE ┆ variant ┆ ┆ │
│ ISSUE_ACTIVE_LOCK_REASON ┆ varchar ┆ ┆ │
│ ISSUE_ASSIGNEE ┆ varchar ┆ ┆ │
│ ISSUE_ASSIGNEES ┆ variant ┆ ┆ │
│ ISSUE_AUTHOR_ASSOCIATION ┆ varchar ┆ ┆ │
│ ISSUE_BODY ┆ varchar ┆ ┆ │
│ ISSUE_CLOSED_AT ┆ varchar ┆ ┆ │
│ ISSUE_COMMENTS ┆ decimal(38, 0) ┆ ┆ │
│ ISSUE_COMMENTS_URL ┆ varchar ┆ ┆ │
│ ISSUE_CREATED_AT ┆ varchar ┆ ┆ │
│ ISSUE_DRAFT ┆ boolean ┆ ┆ │
│ ISSUE_EVENTS_URL ┆ varchar ┆ ┆ │
│ ISSUE_HTML_URL ┆ varchar ┆ ┆ │
│ ISSUE_ID ┆ decimal(38, 0) ┆ ┆ │
│ ISSUE_LABELS ┆ variant ┆ ┆ │
│ ISSUE_LABELS_URL ┆ varchar ┆ ┆ │
│ ISSUE_LOCKED ┆ boolean ┆ ┆ │
│ ISSUE_MILESTONE ┆ varchar ┆ ┆ │
│ ISSUE_NODE_ID ┆ varchar ┆ ┆ │
│ ISSUE_NUMBER ┆ decimal(38, 0) ┆ ┆ │
│ ISSUE_PERFORMED_VIA_GITHUB_APP ┆ varchar ┆ ┆ │
│ ISSUE_PULL_REQUEST ┆ variant ┆ ┆ │
│ ISSUE_REACTIONS ┆ variant ┆ ┆ │
│ ISSUE_REPOSITORY_URL ┆ varchar ┆ ┆ │
│ ISSUE_STATE ┆ varchar ┆ ┆ │
│ ISSUE_TIMELINE_URL ┆ varchar ┆ ┆ │
│ ISSUE_TITLE ┆ varchar ┆ ┆ │
│ ISSUE_UPDATED_AT ┆ varchar ┆ ┆ │
│ ISSUE_URL ┆ varchar ┆ ┆ │
│ ISSUE_USER_ID ┆ decimal(38, 0) ┆ ┆ │
│ ISSUE_USER_LOGIN ┆ varchar ┆ ┆ │
│ ISSUE_USER_TYPE ┆ varchar ┆ ┆ │
│ LANGUAGE ┆ varchar ┆ ┆ │
└────────────────────────────────┴────────────────┴────────────┴─────────────┘sdftarget/dbg/materialized/sdf_snowflake/staging/datafusion_push_events.sql and inspect the SQL query.It should show this:sdftarget/dbg/materialized/sdf_snowflake/staging/datafusion_push_events.sql
builtin.is_incremental_mode set to False. This is because we haven’t materialized the table yet.Let’s run the model now to materialize the table in non-incremental mode.builtin.is_incremental_mode to True.github_events table that were created after the last materialization.Furthermore, you might notice something slightly different in the run output, specifically a line that says Preloading like so:is_incremental_mode builtin variable.Lastly, if you inspect the compiled query output again, you should see builtin.is_incremental_mode set to True and the query’s SQL reflective of that.sdftarget/dbg/materialized/sdf_snowflake/staging/datafusion_push_events.sql
4
Utilize the Merge Incremental Strategy
Now that we’ve successfully materialized an incremental model with the Let’s say we now want to track a running count of the total push events per each unique contributor to the DataFusion repo. Since we want to update the row corresponding to the contributor, or create a new row if the contributor is new, the Next, let’s update the In this YML, we’ve set the Before running the model, let’s inspect the compiled query output to see what exactly will be run against Snowflake. To do so, open up the file Since this table hasn’t materialized yet, The model should have successfully been materialized in Snowflake. Next, let’s try running the model in incremental mode. Like before, we’ll just run the model again and SDF will automatically detect that the model has already been materialized and set Let’s inspect the compiled query output again to see the SQL query that was run. You should see As you can see, the query was compiled with
append strategy, let’s explore the merge strategy. The merge strategy is useful when you want to update existing rows in the target table with new data from the source table.For a full reference of all supported merge strategies, see the SDF Reference section.
merge strategy is a perfect fit for this use case.Let’s start by adding a new SQL file to our workspace called top_datafusion_contributors.sql in the directory models/sdf_snowflake/cybersyn_tech_innovation/ with the following SQL:models/sdf_snowflake/cybersyn_tech_innovation/top_datafusion_contributors.sql
workspace.sdf.yml to set materialization to incremental-table for this model.workspace.sdf.yml
incremental-options to use the merge strategy. We’ve also specified the unique-key as actor_id and the merge-update-columns as contribution_count.
This tells SDF to update the contribution_count column in the target table with the new count from the source table when it finds two rows that match on their actor_id.Let’s compile the workspace with this new model:Working set 5 model files, 1 .sdf file
Downloading SDF_SNOWFLAKE.STAGING.GITHUB_PUSH_EVENTS (exists_remotely)
Downloading SDF_SNOWFLAKE.CYBERSYN_TECH_INNOVATION.TOP_DATAFUSION_CONTRIBUTORS (exists_remotely)
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_EVENTS (schema)
Compiling sdf_snowflake.staging.datafusion_push_events (./models/sdf_snowflake/staging/datafusion_push_events.sql)
Compiling sdf_snowflake.cybersyn_tech_innovation.top_datafusion_contributors (./models/sdf_snowflake/cybersyn_tech_innovation/top_datafusion_contributors.sql)
Finished 3 models [3 succeeded] in 2.466 secsSchema sdf_snowflake.cybersyn_tech_innovation.TOP_DATAFUSION_CONTRIBUTORS
┌─────────────────────┬────────────────┬────────────┬─────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞═════════════════════╪════════════════╪════════════╪═════════════╡
│ ACTOR_ID ┆ decimal(38, 0) ┆ ┆ │
│ ACTOR_DISPLAY_LOGIN ┆ varchar ┆ ┆ │
│ CONTRIBUTION_COUNT ┆ decimal(38, 0) ┆ ┆ │
└─────────────────────┴────────────────┴────────────┴─────────────┘sdftarget/dbg/materialized/sdf_snowflake/cybersyn_tech_innovation/top_datafusion_contributors.sql and inspect the SQL query.It should show this:sdftarget/dbg/materialized/sdf_snowflake/cybersyn_tech_innovation/top_datafusion_contributors.sql
builtin.is_incremental_mode is set to False and the table is simply being created. Let’s run the model now to materialize the table in non-incremental mode.builtin.is_incremental_mode to True.builtin.is_incremental_mode set to True and the query’s SQL reflective of that.sdftarget/dbg/materialized/sdf_snowflake/cybersyn_tech_innovation/top_datafusion_contributors.sql
If the file isn’t updated, try reopening it. Sometimes, because of VSCode’s cache, the file appears to be unchanged.
builtin.is_incremental_mode set to True and, as a byproduct, the SQL query is a merge statement that updates the contribution_count column in the target table with the new count from the source table.For a full list of our supported incremental options and strategies, see the SDF Reference section.