Pull from Snowflake on Amazon AWS

Snowflake is an analytic data warehouse that is fast, easy to use, and flexible. Snowflake uses a SQL database engine that is designed for the cloud. Snowflake can provide tables as a data source to Amperity.

Snowflake can provide data to Amperity in the following ways:

  1. Stages configured in Snowflake

  2. Sharing directly with Amperity

This topic describes the steps that are required to pull databases and tables to Amperity from Snowflake:

  1. Get details

  2. Configure Snowflake objects

  3. Configure Snowflake storage integration

  4. Configure Snowflake tables

  5. Get sample CSV files

  6. Add feed

  7. Add courier

  8. Run courier

  9. Add to courier group

Get details

Snowflake requires the following configuration details:

  1. The username and password.

    Tip

    Most users are configured with an initial password that must be changed. Create the user, manually change the password, and then use the updated password to configure this data source.

  2. The IAM access key and secret key for the dedicated IAM role for use by Amperity.

  3. The Amazon Resource Name (ARN) for a role with cross-account access. (This is only required when a role with cross-account access is in use.)

  4. The account name.

  5. The region in which Snowflake is located.

    Note

    The Amperity instance of Snowflake should be located in the same region as the customer’s instance of Snowflake.

  6. The stage name.

  7. A list of table names in the Snowflake database to be pulled to Amperity.

    Tip

    Download each table from Snowflake as a CSV file with UTF-8 encoding, and then use it as a sample file during feed creation.

Snowflake secure data sharing

Snowflake secure data sharing is not enabled by default. If your organization wants to use secure data sharing, your instance of Snowflake must be in the same region as the Amperity instance.

Amperity supports the following regions:

Cloud platform

Regions

Amazon AWS

aws-us-west-2

aws-us-east-1

Microsoft Azure

az-us-east-2

Note

If your region is not listed you may make a request through your Amperity representative for Amperity engineering to add support for that region.

Configure objects

Amperity requires access to the customer’s instance of Snowflake. This requires narrow read permissions to a warehouse object in Snowflake via a role and a user.

The following objects must be created in the customer’s instance of Snowflake:

The role must be granted permission to the warehouse via GRANT USAGE and the user must be added to the role via GRANT ROLE.

To configure Snowflake objects

To configure Snowflake objects use a Data Definition Language (DDL) command similar to:

CREATE ROLE AMPERITY;

CREATE WAREHOUSE AMPERITY_WH WITH
  WAREHOUSE_SIZE = 'XSMALL'
  WAREHOUSE_TYPE = 'STANDARD'
  AUTO_SUSPEND = 600
  AUTO_RESUME = TRUE
  MIN_CLUSTER_COUNT = 1
  MAX_CLUSTER_COUNT = 2
  SCALING_POLICY = 'ECONOMY';

GRANT USAGE ON WAREHOUSE AMPERITY_WH TO ROLE AMPERITY;

CREATE USER amperity
  PASSWORD = '********************'
  DEFAULT_WAREHOUSE = AMPERITY_WH
  DEFAULT_ROLE = AMPERITY;

GRANT ROLE AMPERITY TO USER amperity;

Configure storage integration

Amperity can ingest any type of data structure or data format that can be output from Snowflake. This output should be written to an Amazon S3 bucket. This requires an IAM policy with permissions to a tenant-specific path on Amazon S3 and a role that is authorized to write to an Amazon S3 bucket. This allows the previously configured Amperity role the ability to write table output from Snowflake to an Amazon S3 bucket.

Note

You may need to mock the IAM role for the external ID and IAM user name. After the named stage is created in Snowflake, update the IAM role for the external ID and IAM user name.

The IAM role is unique to the Snowflake account. The external ID is unique to the Snowflake stage.

For each database and schema to be output to the Amazon S3 bucket, specify a database and schema via USE, create a stage via CREATE, and then GRANT USAGE.

To configure Snowflake storage integration

To configure Snowflake storage integration use a Data Definition Language (DDL) command similar to:

USE DATABASE <database-name>
USE SCHEMA <schema-name>
CREATE STAGE <stage-name>
  URL = 's3://<bucket>/<tenant-specific-path>/'
  CREDENTIALS = (AWS_ROLE = 'AMPERITY')
  ENCRYPTION = (TYPE = 'AWS_SSE_S3');
GRANT USAGE ON STAGE <stage-name> TO ROLE AMPERITY;
DESC STAGE <stage-name>;

Configure tables

For each table to be sent to Amperity, the Amperity role in Snowflake must be given permission. This requires permissions on both the database and the schema that contain the table. Usage rights to the database and schema do not grant access to the entire database or schema. Additional table-specific grants are required via a combination of GRANT USAGE, USE, and GRANT SELECT.

To configure Snowflake tables

To configure Snowflake tables use a Data Definition Language (DDL) command similar to:

GRANT USAGE ON DATABASE <database-name> TO ROLE AMPERITY;
USE DATABASE <database-name>
GRANT USAGE ON SCHEMA <schema-name> TO ROLE AMPERITY;
USE SCHEMA <schema-name>;
GRANT SELECT ON TABLE <table-name> TO ROLE AMPERITY;

Get sample CSV file

A sample CSV file may be output from Snowflake, and then uploaded to Amperity while adding a feed.

Get a sample CSV file from Snowflake

  1. Open the Snowflake web UI.

  2. Select the Worksheets tab.

  3. From the object browser, select a table that is to be sent to Amperity.

  4. In the SQL editor, run the following query:

    SELECT * FROM <table-name> LIMIT 1000
    
  5. Click the Download Results button.

  6. In the Export Results dialog box, click the Export button.

Add feed

A feed defines how data should be loaded into a domain table, including specifying which columns are required and which columns should be associated with a semantic tag that indicates that column contains customer profile (PII) and transactions data.

Note

A feed must be added for each table that is pulled from Snowflake, including all tables that contain customer records and interaction records, along with any other tables that will be used to support downstream workflows. Use the CSV files that were pulled to Amperity to define the schema for each feed.

To add a feed

  1. From the Sources tab, click Add Feed.

  2. Add a data source named Snowflake.

    Tip

    Use the same data source for all Snowflake table objects that are to be sent to Amperity.

  3. Enter the name of the Snowflake table object as the name of the feed.

  4. Upload a CSV file.

  5. Click Continue.

  6. Choose the primary key, apply semantics, adjust field types, verify any fields that are marked required, and then click Activate.

Add courier

A courier brings data from external system to Amperity. A courier relies on a feed to know which fileset to bring to Amperity for processing.

Consolidate all Snowflake feeds to a single courier as a fileset.

To create a courier for all Snowflake table objects

  1. From the Sources tab, click Add Courier. The Add Source page opens.

  2. Find, and then click the icon for Snowflake. The Add Courier page opens.

    This automatically selects aws-snowflake as the Credential Type.

  3. Enter the username and password. This should be the Amperity username and password created in Snowflake for the user security object.

  4. Add the IAM Access Key and IAM Secret Key for the IAM role with permission to access the Amazon S3 bucket used for Snowflake stages. Add the Amazon Resource Name (ARN) if the IAM role is provisioned for cross-account access.

  5. Add Snowflake settings for the account name, the region identifier, and a square-bracketed list of Snowflake tables to be sent to Amperity:

    [
      "table.name.one",
      "table.name.two"
      "table.name.etc"
    ]
    
  6. Enter the name of the Snowflake stage.

  7. Optional. Use a query to select specific columns from a Snowflake table prior to pulling those results to Amperity. Enable the “Use queries with Snowflake tables” option, and then click “Add Snowflake query”. In the expanded box, provide a unique query name. A query name may contain alphanumeric characters (A-Z, a-z), underscores, hyphens, and/or periods. For example: “Query_name.12-345a”.

    Use Snowflake query syntax to build a query to run against a table that is to be pulled to Amperity.

    Important

    The name of the query must be added to the file parameter within the load operations. For example:

    "FEED_ID": [
      {
        "type": "load",
        "file": "Query_name.12-345a"
      }
    
  8. For each table to be sent to Amperity, define the load operations using the feed ID for the feed that is associated with that table.

  9. Click Save.

Run courier manually

Run the courier again. This time, because the load operations are present and the feeds are configured, the courier will pull data from Snowflake.

To run the courier manually

  1. From the Sources tab, open the    menu for the courier with updated load operations that is configured for Snowflake, and then select Run. The Run Courier dialog box opens.

  2. Select the load option, either for a specific time period or all available data. Actual data will be loaded to a domain table because the feed is configured.

  3. Click Run.

    This time the notification will return a message similar to:

    Completed in 5 minutes 12 seconds
    

Add to courier group

A courier group is a list of one (or more) couriers that are run as a group, either ad hoc or as part of an automated schedule. A courier group can be configured to act as a constraint on downstream workflows.

After the Snowflake courier is configured correctly and you have verified that all tables are being collected successfully, you can add the courier to a courier group and automate it.

To add the courier to a courier group

  1. From the Sources tab, click Add Courier Group. This opens the Create Courier Group dialog box.

  2. Enter the name of the courier. For example: “Snowflake”.

  3. Add a cron string to the Schedule field to define a schedule for the orchestration group.

    A schedule defines the frequency at which a courier group runs. All couriers in the same courier group run as a unit and all tasks must complete before a downstream process can be started. The schedule is defined using cron.

    Cron syntax specifies the fixed time, date, or interval at which cron will run. Each line represents a job, and is defined like this:

    ┌───────── minute (0 - 59)
    │ ┌─────────── hour (0 - 23)
    │ │ ┌───────────── day of the month (1 - 31)
    │ │ │ ┌────────────── month (1 - 12)
    │ │ │ │ ┌─────────────── day of the week (0 - 6) (Sunday to Saturday)
    │ │ │ │ │
    │ │ │ │ │
    │ │ │ │ │
    * * * * * command to execute
    

    For example, 30 8 * * * represents “run at 8:30 AM every day” and 30 8 * * 0 represents “run at 8:30 AM every Sunday”. Amperity validates your cron syntax and shows you the results. You may also use crontab guru to validate cron syntax.

  4. Set Status to Enabled

  5. Specify a time zone.

    A courier group schedule is associated with a time zone. The time zone determines the point at which an courier group’s scheduled start time begins. A time zone should be aligned with the time zone of system from which the data is being pulled.

    Note

    The time zone that is chosen for an courier group schedule should consider every downstream business processes that requires the data and also the time zone(s) in which the consumers of that data will operate.

  6. Set SLA? to False. (You can change this later after you have verified the end-to-end workflows.)

  7. Add at least one courier to the courier group. Select the name of the courier from the Courier drop-down. Click + Add Courier to add more couriers.

  8. Click Add a courier group constraint, and then select a courier group from the drop-down list.

    A wait time is a constraint placed on a courier group that defines an extended time window for data to be made available at the source location. A courier group typically runs on an automated schedule that expects customer data to be available at the source location within a defined time window. However, in some cases, the customer data may be delayed and isn’t made available within that time window.

  9. For each courier group constraint, apply any offsets.

    An offset is a constraint placed on a courier group that defines a range of time that is older than the scheduled time, within which a courier group will accept customer data as valid for the current job.

    A courier group offset is typically set to be 24 hours. For example, it’s possible for customer data to be generated with a correct file name and datestamp appended to it, but for that datestamp to represent the previous day because of the customer’s own workflow. An offset ensures that the data at the source location is recognized by the courier as the correct data source.

    Warning

    An offset affects couriers in a courier group whether or not they run on a schedule.

  10. Click Save.