Seamless Data Synchronization from Transactional DB’s to Snowflake

Seamless Data Synchronization from Transactional Database to Snowflake

Authored by Ameex Technologies on 05 May 2020

Snowflake data replication, empowers enterprises to synchronize their data hosted across multiple accounts, global locations, and multiple cloud service providers. The intent is to synchronize enterprise data updates into a single backup location or a suitable cloud service provider. This ensures that the enterprise’s data is available 24X7.

This blog will detail an approach to synchronously stream row replication write-ahead logs from a transactional database (DB) to Snowflake for analytical/reporting consumption.

The use cases for this approach would be a warehousing solution with a near real-time replica of the aster Transactional Database which can be transformed using ELT.

In our case, we will be using Postgres as an example and use synchronous WAL log streaming replication. Since this process is synchronous, the Postgres Master DB will wait until the snowflake daemon sends read feedback before committing a transaction. This ensures failsafe in case of the Master Server`s failure.

Seamless Data Synchronization from Transactional DB’s to Snowflake

What is Write ahead log and replication streaming?

A write-ahead log is a standard method of ensuring database backups using changelogs for each transaction. In our case, we will be using Logical replication for Postgres which means that only DML changelogs will be streamed over TCP/IP in the form Wal messages.

This method requires a base backup of the master DB in Snowflake and a sync up of target database schemas and definitions.

Advantages of using Write ahead logs vs Batch streaming:

  • The small footprint of data transfer and almost real-time replication from the Master side
  • Low processing overhead on the source database
  • Low latency of minutes in replicating changes
  • In the case of a service-driven architecture, changes from different services can be streamed to each other and also the Snowflake warehouse

Performance planning:

Since we are using a synchronous stream, the Snowflake daemon should be capable of responding with feedback for the master DB wal messages at a fairly fast rate so as to not degrade transactional DB performance due to synchronous locks and wait times.

Steps for setting up replication streaming:

1.Verify whether Postgres DB has logical replication enabled:

Run the following command in the Postgres GUI/cli and ensure that the wal_level is set to logical.

Seamless Data Synchronization from Transactional DB’s to Snowflake

 

Seamless Data Synchronization from Transactional DB’s to Snowflake

In the case of an on-premise setup, ensure the autoconf file loads the wal_level and replication_slots number. In the case of an AWS RDS Postgres instance, the IAM role used to access the replication slot should have replication permissions.

Ensure that the parameter group associated with the RDS instance has replication parameters configured properly.

2.Set up an Ec2 machine for streaming wal logs to Kinesis:

A python script has to be set up to run as a continuous job on an ec2 machine, to consume logical stream replication and dump changelogs to Kinesis.

3.Set up Kinesis stream:

Kinesis is amazon`s fully managed streaming service that is fault-tolerant and real-time. In this use case, we will use Kinesis to collect wal logs from the Postgres daemon and dump them into S3.

Seamless Data Synchronization from Transactional DB’s to Snowflake

A Shard is a unit of capacity and in this case, we will be using a single Shard.

4.Set up firehose output to S3:

Firehose reads the Kinesis streams and dumps wal json logs to an S3 bucket. In this example, we are not performing any transformations on the streamed logs. So we will use an S3 bucket as the destination.

Associate appropriate IAM role with policies, providing access to Kinesis, Firehose, and destination S3 bucket or use the in-line role set up by Firehose.

5.Set up Snowpipe to pick up changes from S3:

Snowpipe is Snowflakes managed serverless service which will load files from an external stage(S3) based on AWS S3 notifications from the S3 bucket.

Set up a Snowpipe connection to the destination S3. Changes will be streamed to Snowflake for replication.

This can be performed by creating an external stage pointing to the Firehose destination S3 bucket and add a pipe pointing to this stage.

6.Merging changelogs into a table using Python:

To recreate the original table, we can either use separate transactions for each wal log change, leading to higher details of auditing or perform a merge statement in batches.

Following is sample SQL which will generate Insert and Delete statements for each wal log transaction:

Using python script the Delete and Insert statements can be executed in a single transaction as:

Begin transaction
<Delete statement>
<Insert statement>
End transaction

This can also be accomplished by generating a staging table for a merge statement. A separate merge statement has to be created for each target table.

The merge statement to merge inserts and deletes, uses a single transaction for the entire query, leading to higher throughput and lesser DB requests. However, there is a tradeoff in terms of not being able to audit and debug failures for each wal log transaction.

Conclusion:

With this method, changelogs will be shipped to Snowflake synchronously with a latency of minutes, based on the Snowpipe load time. To circumvent Snowpipe, rest apis can be used to trigger copy statements to avoid lock conditions.

To know more about our data engineering expertise, write to us.

Blogs