Pull from Snowflake on Amazon AWS¶
Snowflake is an analytic data warehouse that is fast, easy to use, and flexible. Snowflake uses a SQL database engine that is designed for the cloud. Snowflake can provide tables as a data source to Amperity.
Snowflake can provide data to Amperity in the following ways:
Stages configured in Snowflake
Sharing directly with Amperity
This topic describes the steps that are required to pull databases and tables to Amperity from Snowflake:
Add courier using table objects or using Snowflake query syntax
Get details¶
Snowflake requires the following configuration details:
The username and password. Tip Most users are configured with an initial password that must be changed. Create the user, manually change the password, and then use the updated password to configure this data source. |
|
The stage name, along with credentials that allow Amperity to pull data from a customer-managed Amazon S3 bucket, which is the storage location to which Snowflake stages data and from which Amperity pulls data. Use secrets and keys to pull Snowflake table output from the customer-managed Amazon S3 bucket. |
|
The Snowflake account name. |
|
The region in which Snowflake is located. Note The Amperity instance of Snowflake should be located in the same region as the customer’s instance of Snowflake. |
|
A list of table names in the Snowflake database to be pulled to Amperity. Tip Download each table from Snowflake as a CSV file with UTF-8 encoding, and then use it as a sample file during feed creation. |
Snowflake secure data sharing¶
Snowflake secure data sharing is not enabled by default. If your organization wants to use secure data sharing, your instance of Snowflake must be in the same region as the Amperity instance.
Amperity supports the following regions for Amazon AWS:
US West |
us-west-2 |
US East |
us-east-1 |
Canada |
ca-central-1 |
Note
If your region is not listed you may make a request through your Amperity representative for Amperity engineering to add support for that region.
Configure objects¶
Amperity requires access to the customer’s instance of Snowflake. This requires narrow read permissions to a warehouse object in Snowflake via a role and a user.
The following objects must be created in the customer’s instance of Snowflake:
A role via CREATE ROLE
A warehouse via CREATE WAREHOUSE
A user via CREATE USER
The role must be granted permission to the warehouse via GRANT USAGE and the user must be added to the role via GRANT ROLE.
To configure Snowflake objects
To configure Snowflake objects use a Data Definition Language (DDL) command similar to:
CREATE ROLE AMPERITY;
CREATE WAREHOUSE AMPERITY_WH WITH
WAREHOUSE_SIZE = 'XSMALL'
WAREHOUSE_TYPE = 'STANDARD'
AUTO_SUSPEND = 600
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 2
SCALING_POLICY = 'ECONOMY';
GRANT USAGE ON WAREHOUSE AMPERITY_WH TO ROLE AMPERITY;
CREATE USER amperity
PASSWORD = '********************'
DEFAULT_WAREHOUSE = AMPERITY_WH
DEFAULT_ROLE = AMPERITY;
GRANT ROLE AMPERITY TO USER amperity;
Configure storage integration¶
Amperity can ingest any type of data structure or data format that can be output from Snowflake. This output is written by Snowflake to a customer-managed Amazon S3 bucket, from which Amperity pulls the data.
Use secrets and keys to pull Snowflake table output from the customer-managed Amazon S3 bucket.
Note
You may need to mock the IAM role for the external ID and IAM user name. After the named stage is created in Snowflake, update the IAM role for the external ID and IAM user name.
The IAM role is unique to the Snowflake account. The external ID is unique to the Snowflake stage.
For each database and schema to be output to the Amazon S3 bucket, specify a database and schema via USE , create a stage via CREATE , and then GRANT USAGE .
To configure Snowflake storage integration
To configure Snowflake storage integration use a Data Definition Language (DDL) command similar to:
USE DATABASE <database-name>;
USE SCHEMA <schema-name>;
CREATE STAGE <stage-name>
URL = 's3://<bucket>/<tenant-specific-path>/'
CREDENTIALS = (AWS_ROLE = 'AMPERITY')
ENCRYPTION = (TYPE = 'AWS_SSE_S3');
GRANT USAGE ON STAGE <stage-name> TO ROLE AMPERITY;
DESC STAGE <stage-name>;
Configure tables¶
For each table to be sent to Amperity, the Amperity role in Snowflake must be given permission. This requires permissions on both the database and the schema that contain the table. Usage rights to the database and schema do not grant access to the entire database or schema. Additional table-specific grants are required via a combination of GRANT USAGE, USE, and GRANT SELECT.
To configure Snowflake tables
To configure Snowflake tables use a Data Definition Language (DDL) command similar to:
GRANT USAGE ON DATABASE <database-name> TO ROLE AMPERITY;
USE DATABASE <database-name>;
GRANT USAGE ON SCHEMA <schema-name> TO ROLE AMPERITY;
USE SCHEMA <schema-name>;
GRANT SELECT ON TABLE <table-name> TO ROLE AMPERITY;
Get sample CSV file¶
A sample CSV file may be output from Snowflake, and then uploaded to Amperity while adding a feed.
Get a sample CSV file from Snowflake
Open the Snowflake web UI.
Select the Worksheets tab.
From the object browser, select a table that is to be sent to Amperity.
In the SQL editor, run the following query:
SELECT * FROM <table-name> LIMIT 1000
Click the Download Results button.
In the Export Results dialog box, click the Export button.
Add feed¶
A feed defines how data should be loaded into a domain table, including specifying which columns are required and which columns should be associated with a semantic tag that indicates that column contains customer profile (PII) and transactions data.
Note
A feed must be added for each table that is pulled from Snowflake, including all tables that contain customer records and interaction records, along with any other tables that will be used to support downstream workflows. Use the CSV files that were pulled to Amperity to define the schema for each feed.
To add a feed
From the Sources tab, click Add Feed.
Add a data source named Snowflake.
Tip
Use the same data source for all Snowflake table objects that are to be sent to Amperity.
Enter the name of the Snowflake table object as the name of the feed.
Upload a CSV file.
Click Continue.
Choose the primary key, apply semantics, adjust field types, verify any fields that are marked required, and then click Activate.
Add courier¶
A courier brings data from an external system to Amperity.
Important
Consolidate all Snowflake feeds to a single courier as a fileset.
You have two options for how to pull data from Snowflake:
Pull table objects¶
You can pull entire tables from Snowflake to Amperity.
To pull Snowflake table objects
From the Sources page, click Add Courier. The Add Source page opens.
Find, and then click the icon for Snowflake. The Add Courier page opens.
This automatically selects aws-snowflake as the Credential Type.
Enter the username and password. This should be the Amperity username and password created in Snowflake for the user security object.
Add the IAM Access Key and IAM Secret Key for the IAM role with permission to access the Amazon S3 bucket used for Snowflake stages. Add the Amazon Resource Name (ARN) if the IAM role is provisioned for cross-account access.
Add Snowflake settings for the account name, the region identifier, and a square-bracketed list of Snowflake tables to be sent to Amperity:
[ "table.name.one", "table.name.two" "table.name.etc" ]
For each table to be sent to Amperity, define the load operations using the feed ID for the feed that is associated with that table.
Click Save.
Pull columns using Snowflake query syntax¶
You can use Snowflake query syntax to build a query that runs against one (or more) Snowflake tables.
Important
A Snowflake query name must be unique. The name may contain alphanumeric characters (A-Z, a-z), underscores, hyphens, and/or periods. For example: “Query_name.12-345a”.
To pull columns using Snowflake query syntax
From the Sources page, click Add Courier. The Add Source page opens.
Find, and then click the icon for Snowflake. The Add Courier page opens.
This automatically selects aws-snowflake as the Credential Type.
Enter the username and password. This should be the Amperity username and password created in Snowflake for the user security object.
Add the IAM Access Key and IAM Secret Key for the IAM role with permission to access the Amazon S3 bucket used for Snowflake stages. Add the Amazon Resource Name (ARN) if the IAM role is provisioned for cross-account access.
Add Snowflake settings for the account name, the region identifier, and a square-bracketed list of one (or more) Snowflake tables against which a query that uses Snowflake query syntax will be run to be sent to Amperity:
[ "table.name.one", "table.name.two", "table.name.etc" ]
Caution
The courier will download both query results AND all listed tables.
Click “Add Snowflake query”. In the expanded box, provide a unique query name. A query name may contain alphanumeric characters (A-Z, a-z), underscores, hyphens, and/or periods. For example: “Query_name.12-345a”.
Caution
You must add a blank line to the end of the Snowflake query.
Use Snowflake query syntax to build a query to run against a table that is to be pulled to Amperity.
Important
The name of the query must be added to the file parameter within the load operations. For example:
"FEED_ID": [ { "type": "load", "file": "Query_name.12-345a" }
For each table to be sent to Amperity, define the load operations using the feed ID for the feed that is associated with that table.
Click Save.
Run courier manually¶
Run the courier again. This time, because the load operations are present and the feeds are configured, the courier will pull data from Snowflake.
To run the courier manually
From the Sources tab, open the menu for the courier with updated load operations that is configured for Snowflake, and then select Run. The Run Courier dialog box opens.
Select the load option, either for a specific time period or all available data. Actual data will be loaded to a domain table because the feed is configured.
Click Run.
This time the notification will return a message similar to:
Completed in 5 minutes 12 seconds
Add to courier group¶
A courier group is a list of one (or more) couriers that are run as a group, either ad hoc or as part of an automated schedule. A courier group can be configured to act as a constraint on downstream workflows.
After the Snowflake courier is configured correctly and you have verified that all tables are being collected successfully, you can add the courier to a courier group and automate it.
To add the courier to a courier group
From the Sources tab, click Add Courier Group. This opens the Create Courier Group dialog box.
Enter the name of the courier. For example: “Snowflake”.
Add a cron string to the Schedule field to define a schedule for the orchestration group.
A schedule defines the frequency at which a courier group runs. All couriers in the same courier group run as a unit and all tasks must complete before a downstream process can be started. The schedule is defined using cron.
Cron syntax specifies the fixed time, date, or interval at which cron will run. Each line represents a job, and is defined like this:
┌───────── minute (0 - 59) │ ┌─────────── hour (0 - 23) │ │ ┌───────────── day of the month (1 - 31) │ │ │ ┌────────────── month (1 - 12) │ │ │ │ ┌─────────────── day of the week (0 - 6) (Sunday to Saturday) │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ * * * * * command to execute
For example,
30 8 * * *
represents “run at 8:30 AM every day” and30 8 * * 0
represents “run at 8:30 AM every Sunday”. Amperity validates your cron syntax and shows you the results. You may also use crontab guru to validate cron syntax.Set Status to Enabled.
Specify a time zone.
A courier group schedule is associated with a time zone. The time zone determines the point at which a courier group’s scheduled start time begins. A time zone should be aligned with the time zone of system from which the data is being pulled.
Use the Use this time zone for file date ranges checkbox to use the selected time zone to look for files. If unchecked, the courier group will use the current time in UTC to look for files to pick up.
Note
The time zone that is chosen for an courier group schedule should consider every downstream business processes that requires the data and also the time zone(s) in which the consumers of that data will operate.
Add at least one courier to the courier group. Select the name of the courier from the Courier drop-down. Click + Add Courier to add more couriers.
Click Add a courier group constraint, and then select a courier group from the drop-down list.
A wait time is a constraint placed on a courier group that defines an extended time window for data to be made available at the source location.
Important
A wait time is not required for a bridge.
A courier group typically runs on an automated schedule that expects customer data to be available at the source location within a defined time window. However, in some cases, the customer data may be delayed and isn’t made available within that time window.
For each courier group constraint, apply any offsets.
A courier can be configured to look for files within range of time that is older than the scheduled time. The scheduled time is in Coordinated Universal Time (UTC), unless the “Use this time zone for file date ranges” checkbox is enabled for the courier group.
This range is typically 24 hours, but may be configured for longer ranges. For example, it’s possible for a data file to be generated with a correct file name and datestamp appended to it, but for that datestamp to represent the previous day because of how an upstream workflow is configured. A wait time helps ensure that the data at the source location is recognized correctly by the courier.
Warning
This range of time may affect couriers in a courier group whether or not they run on a schedule. A manually run courier group may not take its schedule into consideration when determining the date range; only the provided input day(s) to load data from are used as inputs.
Click Save.