About File Formats

This topic covers details that are common to files that can be pulled to or sent from Amperity using any filedrop source.

Pull files to Amperity

The following sections apply to all files that are pulled to Amperity:

Connect to source

Amperity requires the ability to connect to, and then read data from a filedrop location. The credentials that allow that connection and the permissions to read data are entered into the Amperity user interface while configuring a courier. These credentials are created and managed by the owner of the filedrop location, which is often external to Amperity (but is sometimes a system that is owned by Amperity). The customer may need to provide credentials to Amperity using SnapPass to complete the configuration.

SnapPass allows secrets to be shared in a secure, ephemeral way. Input a single or multi-line secret, along with an expiration time, and then generate a one-time use URL that may be shared with anyone. Amperity uses SnapPass for sharing credentials to systems with customers.

Pull file formats

Amperity supports the following file formats when using a filedrop location to pull data to Amperity:

Note

Amperity can ingest data from a wide variety of data sources, such as legacy data outputs like AS/400. Ask your Amperity Support representative about formats that are not directly listed in this section to determine if those data formats can be used as a way to provide data to Amperity.

Date formats

Dates should be quoted and should be in the “yyyy-MM-dd HH:mm:ss.SSS” format. The time portion (“:mm:ss.SSS”) is optional. For example:

  • 2019-01-28 18:32:05.123

  • 2019-01-28 18:32:05

  • 2019-01-28

When the date format is not similar to the expected date format, Amperity will attempt to convert the date and time values. If date formats are mixed, Amperity will use the first one that matches.

Tip

Spark SQL may be used to transform source data into a supported date format prior to loading it to Amperity.

Date values conversion order

Date values are converted in the following order of precedence:

Date type

Format

1

basic-date

“yyyyMMdd”

2

date

“yyyy-MM-dd”

3

slash-date

“yyyy/MM/dd”

4

us-date-4y

“MM/dd/yyyy”

5

us-date-2y

“MM/dd/yy”

6

date-month-4y

“dd-MMM-yyyy”

7

date-month-2y

“dd-MMM-yy”

8

date-month-4y-spaced

“dd MMM yyyy”

9

date-month-4y-no-space

“ddMMMyyyy”

Time values conversion order

Time values (when present) are converted in the following order of precedence:

Date type

Format

1

basic-time-Z

“HHmmssZ”

2

basic-time

“HHmmss”

3

basic-time-millis-Z

“HHmmss.SSSSSSSSSZ”

4

basic-time-millis

“HHmmss.SSSSSSSSS”

5

24-hour-minute-second-millis-zone

“HH:mm:ss.SSS z”

6

24-hour-minute-second-millis-Z

“HH:mm:ss.SSSSSSSSSZ”

7

24-hour-minute-second-millis

“HH:mm:ss.SSSSSSSSS”

8

24-hour-minute-second-Z

“HH:mm:ssZ”

9

24-hour-minute-second

“HH:mm:ss”

10

24-hour-minute

“HH:mm”

11

24-hour

“HH”

12

12-hour-minute-second

“hh:mm:ss a”

13

12-hour-minute

“hh:mm a”

14

12-hour-minute-zone

“hh:mma z”

File compression / archive

Amperity supports the following compression and archiving formats:

  • Tar

  • Tgz

  • Zip

  • GZip

Large datasets

A large dataset is a file over 500GB in size.

Amperity recommends that large datasets:

  • Be provided to Amperity using Amazon S3, Azure Blob Storage, or Google Cloud Storage to leverage their massively parallel I/O capabilities

  • Use compression to reduce file sizes

  • In certain cases, may use the Amperity Streaming Ingest REST API to avoid batched data drops

Input examples

The following examples show how files input to Amperity are unpacked, depending on various combinations of encryption, compression type, and file format. All examples use yyyy_MM_dd for the date format.

for single files

PGP, TGZ, CSV

  1. Input to Amperity: table_name_yyyy_MM_dd.tgz.pgp

  2. After decryption: table_name_yyyy_MM_dd.tgz

  3. After decompression: table_name_yyyy_MM_dd.csv

PGP, GZip, TAR, CSV

  1. Input to Amperity: table_name_yyyy_MM_dd.csv.tar.gz.pgp

  2. After decryption: table_name_yyyy_MM_dd.csv.tar.gz

  3. After decompression: table_name_yyyy_MM_dd.csv.tar

  4. After the archive is opened: table_name_yyyy_MM_dd.csv

PGP, TAR, Apache Parquet

  1. Input to Amperity: table_name_yyyy_MM_dd.tar.pgp

  2. After decryption: table_name_yyyy_MM_dd.tar

  3. After decompression: table_name_yyyy_MM_dd.parquet, with 1 to n Apache Parquet part files within the directory.

for multiple files

PGP, TAR, Apache Parquet

  1. Input to Amperity: input_name_yyyy_MM_dd.parquet.tar.pgp

  2. After decryption: input_name_yyyy_MM_dd.parquet.tar

  3. After decompression: table_name_yyyy_MM_dd.parquet, where, for each table, 1 to n Apache Parquet files will be located within a single directory.

PGP, TGZ, CSV

  1. Input to Amperity: input_name_yyyy_MM_dd.csv.tgz.pgp

  2. After decryption: input_name_yyyy_MM_dd.csv.tgz

  3. After decompression: table_name.csv, where all tables that were input are located within a single directory.

Send files from Amperity

The following sections apply to all files that are pulled to Amperity:

Connect to destination

Amperity requires the ability to connect, and then write data to the location in which files will be dropped. The credentials that allow Amperity to write data to that location are configured in Amperity. If this location is not managed by Amperity, the customer will need to provide these credentials to Amperity using SnapPass to complete the configuration.

SnapPass allows secrets to be shared in a secure, ephemeral way. Input a single or multi-line secret, along with an expiration time, and then generate a one-time use URL that may be shared with anyone. Amperity uses SnapPass for sharing credentials to systems with customers.

Send file formats

Amperity supports the following file formats when using a filedrop location to send data from Amperity:

Campaign template patterns

Data templates that are made available to campaigns may use variables to apply campaign names, group names, and send dates to the names of campaigns that are sent from Amperity.

Note

A date is automatically appended to the filename for one-time campaigns.

Important

Campaign templates use the same tokens and Joda-Time filters as file-based templates.

Campaign names

Use the {{ campaign_name }} variable to define where the name of the campaign is added to the filename for a campaign, as it will be received by the downstream system.

Use this variable by itself to use the campaign name as the filename. For example, when the filename template is set to {{ campaign_name }} and the name of the campaign is acme_subscriber_bogo_20220815_1 the filename will be acme_subscriber_bogo_20220815_1.

You may use this variable by itself or with {{ group_name }} in any order.

Group names

Use the {{ group_name }} variable to define where the name of a recipient group is added to the filename for a campaign, as it will be received by the downstream system.

Use this variable by itself to use the recipient group name as the filename. For example, when the filename template is set to {{ group_name }} and the name of the recipient group is Group1_bogo_20220815_1 the filename will be Group1_bogo_20220815_1.

You may use this variable by itself or with {{ campaign_name }} in any order.

List names

Use the {{ list_name }} variable to use the name of the campaign as the filename for a campaign, as it will be received by the downstream system.

Filename template patterns

A filename template defines the naming patterns for files that are sent by Amperity to a location in which files are dropped. A filename template specifies the name of the file and then uses Jinja-style string formats to append a date to the filename to ensure that any downstream process can identify which file is the one to be picked up.

Joda-Time is an open-source date and time library that is used by Amperity to establish consistency in filename patterns. The recommended pattern is “Segment_Name_MM-dd-YYYY”, where “Segment_Name” is the name of the segment and “MM-dd-YYYY” will append the current date.

Text variables

Strings in a filename template are literal by default. Use the {{ text }} variable to apply special rendering to the text value.

Filters

Use a filter to shift a timezone, and then format it as a string. The following filters are available:

Filter

Description

local

Use the local filter to shift a datetime to a given timezone using a string in Joda-Time format. For example: local:'America/Los_Angeles'.

format

Use the format filter to format a datetime with a string in Joda-Time format. For example: format:'MM-dd-yyyy'.

next day

Use the next day filter to shift the datetime to the next day. For example: now|local:'America/Los_Angeles'|next_day.

Tokens

Use a token to specify how to apply a datetime to a file. The following tokens are available:

Token

Description

now

Use the now token to apply a datetime to a file that is current at the time a file is written.

File compression

Amperity supports the following compression and archiving formats:

  • Tar

  • Tgz

  • Zip

  • GZip

When Tar or Zip options are not specified, a folder is created using the name filename template specified for the orchestration. This folder will contain one (or more) files, each of which have generated names.

Tip

Compression and archive file extensions are not added to the filename template automatically. These may be added while configuring an orchestration. To add the file compression format to the output filename, append .tar, .tgz, .zip, or .gz after the file format extension in the filename template. For example: parquet.tar, csv.zip, or tsv.gz.

Output examples

The following examples show how files output by Amperity are named depending on the various combination of options for file format, compression type, and encryption that are available. All examples use yyyy_MM_dd for the date format.

for queries

Apache Parquet, TAR, PGP

  1. Output from Amperity: query_name_yyyy_MM_dd.tar.pgp

  2. After decryption: query_name_yyyy_MM_dd.tar

  3. After decompression: query_name_yyyy_MM_dd.parquet, with 1 to n Apache Parquet part files within the directory.

CSV, TGZ, PGP

  1. Output from Amperity: query_name_yyyy_MM_dd.tgz.pgp

  2. After decryption: query_name_yyyy_MM_dd.tgz

  3. After decompression: query_name_yyyy_MM_dd.csv

CSV, GZip, PGP

  1. Output from Amperity: query_name_yyyy_MM_dd.csv.gz.pgp

  2. After decryption: query_name_yyyy_MM_dd.csv.gz

  3. After decompression: query_name_yyyy_MM_dd.csv

for single tables

Apache Parquet, TAR, PGP

  1. Output from Amperity: table_name_yyyy_MM_dd.tar.pgp

  2. After decryption: table_name_yyyy_MM_dd.tar

  3. After decompression: table_name_yyyy_MM_dd.parquet, with 1 to n Apache Parquet part files within the directory.

CSV, TGZ, PGP

  1. Output from Amperity: table_name_yyyy_MM_dd.tgz.pgp

  2. After decryption: table_name_yyyy_MM_dd.tgz

  3. After decompression: table_name_yyyy_MM_dd.csv

CSV, GZip, PGP

  1. Output from Amperity: table_name_yyyy_MM_dd.csv.gz.pgp

  2. After decryption: table_name_yyyy_MM_dd.csv.gz

  3. After decompression: table_name_yyyy_MM_dd.csv

for multiple tables

Apache Parquet, TAR, PGP

  1. Output from Amperity: output_name_yyyy_MM_dd.parquet.tar.pgp

  2. After decryption: output_name_yyyy_MM_dd.parquet.tar

  3. After decompression: table_name_yyyy_MM_dd.parquet, where, for each table, 1 to n Apache Parquet files will be located within a single directory.

CSV, TGZ, PGP

  1. Output from Amperity: output_name_yyyy_MM_dd.csv.tgz.pgp

  2. After decryption: output_name_yyyy_MM_dd.csv.tgz

  3. After decompression: table_name.csv, where all tables that were output are located within a single directory.

Pretty Good Privacy (PGP)

Pretty Good Privacy (PGP) is an encryption program that provides cryptographic privacy and authentication for data communication by signing, encrypting, and decrypting data files and formats. Amperity supports PGP encryption.

PGP encryption is the encryption type that may be applied to files sent to Amperity to improve data security and help to ensure file integrity and completeness. Amperity requires. Amperity recommends:

  • 4096-bit keys

  • Protected by a strong passphrase

  • One PGP key per-tenant (minimum); one PGP key per system (recommended)

Amperity Support will generate PGP keys (both public and private key-pairs) to use when generating PGP encrypted files to be sent to Amperity. Key pairs are created in the same cloud–Amazon AWS or Microsoft Azure–in which the customer’s tenant is located.

Amperity will provide to the customer the public key using SnapPass. The customer must use that key to encrypt files prior to adding them to the filedrop location. Files that are encrypted using PGP should be compressed prior to encryption. (Compression applied after encryption does not reduce the size of the file.) Amperity will use the private key to decrypt files prior to loading them.

Important

There are two types of PGP public keys: a primary key and a subkey. Amperity does not allow the use of a primary key for public-private key encryption. If you attempt to use a primary key you will see an error similar to “Destination failed validation: PGP public key is a primary key. Please provide a subkey or a keyring with exactly one subkey.”

File encryption

Any tool that is compliant with the OpenPGP standard, as defined by RFC 4880 may be used for PGP encryption.

  • GNU Privacy Guard. Available from https://www.gnupg.org/ . Instructions for how to use GNU Privacy Guard are from that site.

  • GPG Tools. Available from https://gpgtools.org/ . Instructions for how to use GPG Tools are from that site.

Caution

The tenant and Amperity must use the same tool to encrypt and decrypt files.

Tip

Use the following command to encrypt a file:

$ gpg --encrypt --recipient s3@acme.amperity.com data.csv

This will encrypt a file named “data.csv” and will output a file named “data.csv.gpg”. Change data.csv to the name of the file to be encrypted. Change s3@acme.amperity.com to the location in Amperity to which the data will be sent.

File decryption

Important

The customer must generate PGP keys (both public and private key-pairs) for use when sending PGP encrypted files from Amperity. The customer should provide to Amperity the public key using SnapPass. Amperity will use that key to encrypt the files before sending them to the filedrop location. Files that are encrypted using PGP are appended with the .pgp extension.

Any tool that is compliant with the OpenPGP standard, as defined by RFC4880 may be used for PGP decryption.

  • GNU Privacy Guard. Available from https://www.gnupg.org/. Instructions for how to use GNU Privacy Guard are from that site.

  • GPG Tools. Available from https://gpgtools.org/. Instructions for how to use GPG Tools are from that site.

Caution

The tenant and Amperity must use the same tool to encrypt and decrypt files.

Tip

Use the following command to decrypt a file:

$ gpg --decrypt --recipient [location]@acme.amperity.com data.csv.gpg

This will decrypt a file named “data.csv.gpg” and will output a file named “data.csv”. Change data.csv to the name of the file to be decrypted. Change [location] to the parameter that indicates the cloud platform to which the data will be sent.

Key rotations

Amperity performs key rotations on a periodic basis as a best practice. Key rotations are sometimes necessary in situations where a key may have been compromised. When a key rotation happens, Amperity will:

  1. Generate a key pair

  2. Create a keyring file that contains the old and new private keys and uses the same phassphrase

  3. Install the keyring file to the courier

  4. Share the new public key with the customer using SnapPass

  5. Wait for confirmation from the customer that the public key is updated

  6. Create a keyring file that contains only the updated private key

  7. Install the keyring file that contains only the updated private key to the courier

Public keys for SFTP

RSA is a cryptographic system that may be used to generate public and private key pairs for the purpose of securing data transmission to and from Amperity via SFTP. The public key is used to encrypt data. The private key is based on a very large prime number and is used to decrypt data.

To use an RSA key to secure data transmission to Amperity

  1. Generate an RSA key pair:

    $ ssh-keygen -t rsa -m PEM -f generated-key
    

    This will write two files: generated-key (the private key) and generated-key.pub (the public key).

  2. Add the public key to the SFTP location from which data is sent to Amperity.

  3. Add the private key to the Amperity SFTP courier.