About couriers¶

A courier brings data from an external system to Amperity.

What a courier does:

Checks if data is available at the source location.
Collects data from the source location, and then pulls that data to Amperity.

What a courier needs:

Access to the source location. Most data sources–Amazon S3, Azure Blob Storage, Google Cloud Storage, or any SFTP site–allow the use of many file formats, while others may use Snowflake or REST APIs.
A location from which to copy data.
An associated feed.
A file format–CSV, TSV, Apache Parquet, etc., along with additional details for compression, archive, and encryption.
A combination of load settings and load operations. The exact combination of settings and operations depends on the data source and the types of files to be pulled to Amperity.

File couriers¶

A file data source can provide files to Amperity in just about any file format, such as CSV, JSON, Apache Parquet, Apache AVRO, PSV, and TSV. Locations from which file data sources can be pulled include Amazon S3, Azure Blob Storage, Google Cloud Storage, and Any SFTP site.

Load settings¶

File data sources define load settings in two parts:

A list of files that should be pulled to Amperity.
A load operation.

The exact combination of files and load operations depends on the data source from which data is made available to Amperity.

File patterns¶

A courier looks for objects using a combination of the path to a directory, the name of a file, and a date. A courier runs based on a date or a date range, and then looks for files in the source location for that date or date range.

Wildcards¶

A wildcard can match zero (or more) characters up until a forward-slash character.

Note

When a file pattern with a wildcard matches more than one file for a given date or date range, the matched files are loaded in such a way that guarantees per-day ordering. If your courier uses an ingest query, ascending lexicographical ordering by file is not guaranteed or preserved within a single day’s files.

Examples

The following example shows using a wildcard at the end of a file pattern:

'files/'yyyy'/'MM'/'dd'/customers-*.csv'

will match any of these files:

/customers-.csv
/customers-1.csv
/customers-hello-world.csv

and will not match any of these:

/customers-.csv.0
/customers-0.json
/customers-0/1/file.csv
/customers.csv

Literal strings¶

A literal string must be an exact match to characters in the file path, with the exception of the presence of wildcard characters within literal strings. Wrap literal strings that match Joda-Time format in single quotes. For example:

‘files/’
‘/’
‘/’
‘MM-dd-YYYY’

Date components¶

Date components act as placeholders for months, days, and years. Real values are applied when the courier runs on a given date or date range. Date components must match Joda-Time pattern-based formatting , but should generally be limited to the following patterns:

Pattern	Meaning	Examples
yyyy	4-digit year	2020, 2021, …
MM	2-digit month	01, 02, … 12
dd	2-digit date	01, 02, … 31

A courier that runs using this pattern:

'files/'yyyy'/'MM'/'dd'/customers-*.csv'

when run on April 10, 2020 will look for files at 'files/2020/04/10/customers-*.csv' and will return any files that match.

Input examples¶

The following examples show how files input to Amperity are unpacked, depending on various combinations of encryption, compression type, and file format. All examples use yyyy_MM_dd for the date format.

for single files¶

PGP, TGZ, CSV

Input to Amperity: table_name_yyyy_MM_dd.tgz.pgp
After decryption: table_name_yyyy_MM_dd.tgz
After decompression: table_name_yyyy_MM_dd.csv

PGP, GZip, TAR, CSV

Input to Amperity: table_name_yyyy_MM_dd.csv.tar.gz.pgp
After decryption: table_name_yyyy_MM_dd.csv.tar.gz
After decompression: table_name_yyyy_MM_dd.csv.tar
After the archive is opened: table_name_yyyy_MM_dd.csv

PGP, TAR, Apache Parquet

Input to Amperity: table_name_yyyy_MM_dd.tar.pgp
After decryption: table_name_yyyy_MM_dd.tar
After decompression: table_name_yyyy_MM_dd.parquet, with 1 to n Apache Parquet part files within the directory.

for multiple files¶

PGP, TAR, Apache Parquet

Input to Amperity: input_name_yyyy_MM_dd.parquet.tar.pgp
After decryption: input_name_yyyy_MM_dd.parquet.tar
After decompression: table_name_yyyy_MM_dd.parquet, where, for each table, 1 to n Apache Parquet files will be located within a single directory.

PGP, TGZ, CSV

Input to Amperity: input_name_yyyy_MM_dd.csv.tgz.pgp
After decryption: input_name_yyyy_MM_dd.csv.tgz
After decompression: table_name.csv, where all tables that were input are located within a single directory.

API couriers¶

An API data source will vary, depending on the file format and other configuration details. API data sources include Campaign Monitor, Google Analytics, Salesforce Sales Cloud, and Zendesk.

Snowflake couriers¶

A Snowflake data source provides a list of tables that are consolidated into a fileset. Snowflake data sources include Snowflake itself, and then also any FiveTran data source, such as Klaviyo, Shopify, Kustomer, and HubSpot.

Table lists¶

A table list defines the list of tables to be pulled to Amperity from Snowflake.

[
  "AMPERITY_A1BO987C.ACME.CAMPAIGN",
  "AMPERITY_A1BO987C.ACME.LIST",
  "AMPERITY_A1BO987C.ACME.CONTACT"
]

Stage names¶

A stage defines the location of objects that are available within Snowflake.

AMPERITY_A1BO987C.ACME.ACME_STAGE

Load operations¶

Load operations associate each table in the list of tables to a feed. (The initial setup for this courier will use an incorrect feed ID – df-xxxxxx.)

{
  "df-xxxxx": [
    {
      "type": "load",
      "file": "AMPERITY_A1BO987C.ACME.CAMPAIGN"
    }
  ],
  "df-xxxxx": [
    {
      "type": "load",
      "file": "AMPERITY_A1BO987C.ACME.LIST"
    }
  ],
  "df-xxxxx": [
    {
      "type": "load",
      "file": "AMPERITY_A1BO987C.ACME.CONTACT"
    }
  ]
}

Load types¶

A fileset is a group of files processed as a unit by a single courier. A fileset defines each file individually by name, datestamp, file format, and load operation. A courier expects all files in a fileset to be available for processing, unless a file is as optional.

Each file in a fileset must be associated with one of the following load operation types:

Load file
Load file with complex data types
Truncate, then load file

Load file¶

You can load contents of a data file to a domain table as an UPSERT operation that is based off of the primary key in the table.

Load file with complex data types¶

Use the Spark load type for files that have complex or nested data structures. Use when:

File format: Apache Avro files have records, enums, arrays, maps, unions, and fixed data types.
File format: Apache Parquet files have struct, array, and map data types.
File format: JSON files have nested objects and arrays.
File format: XML files have complexType elements.

Truncate, then load file¶

You can empty the contents of a table prior to loading a data file to a domain table as a load operation.

Note

A truncate operation is always run first, regardless of where it’s specified within the load operation.

How-tos¶

This section describes tasks related to managing couriers in Amperity:

Add courier
Add courier as copy
Add to courier group
Delete courier
Edit courier
Load data only
Run and only load files
Run for a specific day
Run for a time period
Run, but skip missing files
Restart job
View courier

Add courier¶

Use the Add Courier button to add a courier to Amperity. A courier should be created for each feed that exists in Amperity.

For smaller data sources, a courier may be associated with more than one feed. For larger data sources, a courier should be associated with a single feed. This is, in part, because couriers are run in parallel, but multiple feeds associated with a single courier are run sequentially.

For example: if Snowflake is configured to send six tables to Amperity via six feeds, but all running as part of the same courier, table one must finish before table two, which must finish before table three, and so on. Whereas if each table is configured with its own courier, all six tables could start processing at the same time.

A courier configured from the Amperity UI must be configured to use one of the existing plugins in Amperity, such as for Amazon S3, Azure Blob Storage, Azure Data Lake Storage, SFTP, or Snowflake.

Some of these plugins have more than one option for credentials.

Use SnapPass to securely share configuration data with your Amperity representative.

To add a courier

From the Sources page, click Add Courier. The Add Courier page opens.
Enter the name of the courier.
From the Plugin dropdown, select a plugin.

Note

The settings for a courier will vary, depending on the courier selected from the Plugin dropdown.
Enter the credentials for the courier type.
Enter any courier-specific settings.
Under <COURIER NAME> Settings configure the file load settings. This is done in two parts: a list of files that should be available to Amperity (including how they are made available), and then a series of load operations that associates each file in the list to a feed.
Click Save.

Add courier as copy¶

You may add a courier by copying an existing courier. This is useful when couriers share plugin, credential, and other common settings. A copied courier will retain all of the configured settings as the original, but will be assigned a unique name based on the name of the copied courier.

To add a courier as a copy

From the Sources page, open the menu for a courier, and then select Make a copy. The Add Courier page opens.
Update the name of the courier.
Verify all other configuration settings. Edit them as necessary.
Under <COURIER NAME> Settings configure the file load settings. This is done in two parts: a list of files that should be available to Amperity (including how they are made available), and then a series of load operations that associates each file in the list to a feed.
Click Save.

Add to courier group¶

A courier group is a list of one or more couriers that run as a group. A courier group can act as a constraint on downstream workflows and can run automatically as part of a scheduled workflow.

To add a courier to a courier group

From the Sources page, open the menu for a courier group, and then select Edit.
On the Couriers tab, click the Add courier link.
Select the name of a courier from the dropdown list, set the wait time and range for which data is loaded. Enable alerts for when files are missing.
Click Save.

Delete courier¶

Use the Delete option to remove a courier from Amperity. This should be done carefully. Verify that both upstream and downstream processes no longer depend on this courier before you delete it. This action will not delete the feeds associated with the courier.

To delete a courier

From the Sources page, open the menu for a courier, and then select Delete. The Delete Courier dialog box opens.
Click Delete.

Edit courier¶

Use the Edit option in the row for a specific courier to make configuration changes. For example, a new file is added to an Amazon S3 filedrop location already configured to send data to Amperity. After the feed is created, it can be added to the existing courier objects and load operations.

In other cases, a courier may need editing because the credentials to the data source have changed.

To edit a courier

From the Sources page, open the menu for a courier, and then select Edit. The Edit Courier page opens.
Make your changes.
Click Save.

Load data only¶

A courier can be run to load data to a domain table and prevent downstream processes, such as Stitch, customer 360 database runs, queries, and orchestrations.

To load data (without downstream processing)

From the Sources page, open the menu for a courier, and then select Run. The Run Courier page opens.
Select Load data from a specific day or Load data from a specific time period.
To prevent downstream processing, select Ingest only.
Click Run.

Run couriers¶

Use the Run option to run the courier manually.

A courier can be run in the following ways:

Only load files
Run for a specific day
Run for a time period
Skip missing files

Only load files¶

A courier can be configured to load data, but not start any downstream processing, including Stitch, database generation, or queries.

Warning

Stitch must be run for data to be available in databases. Jobs that are run as load only do not automatically run Stitch.

To run a courier without downstream processing

From the Sources page, open the menu for a courier, and then select Run. The Run Courier page opens.
Select Load data from a specific day or Load data from a specific time period.
Under Load options, select Ingest only.
Click Run.

Run for a specific day¶

A courier can be configured to load data from a specific day.

To run a courier and load data from a specific day

From the Sources page, open the menu for a courier, and then select Run. The Run Courier page opens.
Select Load data from a specific day.
Select a calendar date.
To prevent downstream processing, select Ingest only.

Warning

When a data source is changed, and then loaded using the Ingest only option, downstream processes are not started automatically. Data that contains PII must be stitched. Databases that contain interaction records must be regenerated so that attributes and predictions are recalculated.
Click Run.

Run for a time period¶

A courier can be configured to load all data from a specified time period.

To run a courier to load all data from a specific time period

From the Sources page, open the menu for a courier, and then select Run. The Run Courier page opens.
Select Load data from a specific time period.
Select a start date and an end date.

Important

The start of the selected date range is inclusive, whereas the end of the selected date range is exclusive.
To prevent downstream processing, select Ingest only.

Warning

When a data source is changed, and then loaded using the Ingest only option, downstream processes are not started automatically. Data that contains PII must be stitched. Databases that contain interaction records must be regenerated so that attributes and predictions are recalculated.
Click Run.

Skip missing files¶

A courier can be configured to skip sources files that are missing and continue running.

To skip missing files

From the Sources page, open the menu for a courier, and then select Run. The Run Courier page opens.
Select Load data from a specific day or Load data from a specific time period.
Under Load options, select Skip missing files.
Click Run.

View error log¶

If a courier runs and returns an error, you may view the errors from that feed.

To view errors

From the Notifications pane, for the stage error, open the View Load Details link.
From the View Load Details pane, select View Error Log for the feed with errors.
Investigate the errors reported.

Restart job¶

If a courier runs and returns an error, you may view the error, resolve that error by updating the feed configuration or Spark SQL query, and then restart it without having to reload the data associated with the job.

To restart a job

From the Notifications pane, for the stage error, open the View Load Details link and investigate why the job failed.
Edit the feed configuration or Spark SQL query to address the reasons for the error.
From the Notifications pane, click Restart Ingest Job.

View courier¶

The Sources page shows the status of every courier, including when it last ran or updated, and its current status.

To view a courier

From the Sources page, open the menu for a courier, and then select View. The View Courier page opens.