About file formats¶
This topic covers details that are common to files that can be pulled to or sent from Amperity using Amazon S3, Azure Blob Storage, Google Cloud Storage, or any SFTP site.
Pull files to Amperity¶
The following sections apply to all files that are pulled to Amperity:
Connect to source¶
Amperity requires the ability to connect to, and then read data from a filedrop location. The credentials that allow that connection and the permissions to read data are entered into the Amperity user interface while configuring a courier. These credentials are created and managed by the owner of the filedrop location, which is often external to Amperity (but is sometimes a system that is owned by Amperity). The customer may need to provide credentials to Amperity using SnapPass to complete the configuration.
SnapPass allows secrets to be shared in a secure, ephemeral way. Input a single or multi-line secret, along with an expiration time, and then generate a one-time use URL that may be shared with anyone. Amperity uses SnapPass for sharing credentials to systems with customers.
Pull file formats¶
Amperity supports the following file formats when using a filedrop location to pull data to Amperity:
Note
Amperity can ingest data from a wide variety of data sources, such as legacy data outputs like AS/400. Ask your Amperity Support representative about formats that are not directly listed in this section to determine if those data formats can be used as a way to provide data to Amperity.
Date formats¶
Dates should be quoted and should be in the “yyyy-MM-dd HH:mm:ss.SSS” format. The time portion (“:mm:ss.SSS”) is optional. For example:
2019-01-28 18:32:05.123
2019-01-28 18:32:05
2019-01-28
When the date format is not similar to the expected date format, Amperity will attempt to convert the date and time values. If date formats are mixed, Amperity will use the first one that matches.
Tip
Spark SQL may be used to transform source data into a supported date format prior to loading it to Amperity.
Date values conversion order¶
Date values are converted in the following order of precedence:
– |
Date type |
Format |
---|---|---|
1 |
basic-date |
“yyyyMMdd” |
2 |
date |
“yyyy-MM-dd” |
3 |
slash-date |
“yyyy/MM/dd” |
4 |
us-date-4y |
“MM/dd/yyyy” |
5 |
us-date-2y |
“MM/dd/yy” |
6 |
date-month-4y |
“dd-MMM-yyyy” |
7 |
date-month-2y |
“dd-MMM-yy” |
8 |
date-month-4y-spaced |
“dd MMM yyyy” |
9 |
date-month-4y-no-space |
“ddMMMyyyy” |
Time values conversion order¶
Time values (when present) are converted in the following order of precedence:
– |
Date type |
Format |
---|---|---|
1 |
basic-time-Z |
“HHmmssZ” |
2 |
basic-time |
“HHmmss” |
3 |
basic-time-millis-Z |
“HHmmss.SSSSSSSSSZ” |
4 |
basic-time-millis |
“HHmmss.SSSSSSSSS” |
5 |
24-hour-minute-second-millis-zone |
“HH:mm:ss.SSS z” |
6 |
24-hour-minute-second-millis-Z |
“HH:mm:ss.SSSSSSSSSZ” |
7 |
24-hour-minute-second-millis |
“HH:mm:ss.SSSSSSSSS” |
8 |
24-hour-minute-second-Z |
“HH:mm:ssZ” |
9 |
24-hour-minute-second |
“HH:mm:ss” |
10 |
24-hour-minute |
“HH:mm” |
11 |
24-hour |
“HH” |
12 |
12-hour-minute-second |
“hh:mm:ss a” |
13 |
12-hour-minute |
“hh:mm a” |
14 |
12-hour-minute-zone |
“hh:mma z” |
File compression / archive¶
Amperity supports the following compression and archiving formats:
Tar
Tgz
Zip
GZip
Large datasets¶
A large dataset is a file over 500GB in size.
Amperity recommends that large datasets:
Be provided to Amperity using Amazon S3, Azure Blob Storage, or Google Cloud Storage to use their massively parallel I/O capabilities
Use compression to reduce file sizes
In certain cases, may use the Amperity Streaming Ingest API to avoid batched data drops
Input examples¶
The following examples show how files input to Amperity are unpacked, depending on various combinations of encryption, compression type, and file format. All examples use yyyy_MM_dd
for the date format.
for single files¶
PGP, TGZ, CSV
Input to Amperity: table_name_yyyy_MM_dd.tgz.pgp
After decryption: table_name_yyyy_MM_dd.tgz
After decompression: table_name_yyyy_MM_dd.csv
PGP, GZip, TAR, CSV
Input to Amperity: table_name_yyyy_MM_dd.csv.tar.gz.pgp
After decryption: table_name_yyyy_MM_dd.csv.tar.gz
After decompression: table_name_yyyy_MM_dd.csv.tar
After the archive is opened: table_name_yyyy_MM_dd.csv
PGP, TAR, Apache Parquet
Input to Amperity: table_name_yyyy_MM_dd.tar.pgp
After decryption: table_name_yyyy_MM_dd.tar
After decompression: table_name_yyyy_MM_dd.parquet, with 1 to n Apache Parquet part files within the directory.
for multiple files¶
PGP, TAR, Apache Parquet
Input to Amperity: input_name_yyyy_MM_dd.parquet.tar.pgp
After decryption: input_name_yyyy_MM_dd.parquet.tar
After decompression: table_name_yyyy_MM_dd.parquet, where, for each table, 1 to n Apache Parquet files will be located within a single directory.
PGP, TGZ, CSV
Input to Amperity: input_name_yyyy_MM_dd.csv.tgz.pgp
After decryption: input_name_yyyy_MM_dd.csv.tgz
After decompression: table_name.csv, where all tables that were input are located within a single directory.
Send files from Amperity¶
The following sections apply to all files that are sent from Amperity:
Connect to destination¶
Amperity requires the ability to connect, and then write data to the location in which files will be dropped. The credentials that allow Amperity to write data to that location are configured in Amperity. If this location is not managed by Amperity, the customer will need to provide these credentials to Amperity using SnapPass to complete the configuration.
SnapPass allows secrets to be shared in a secure, ephemeral way. Input a single or multi-line secret, along with an expiration time, and then generate a one-time use URL that may be shared with anyone. Amperity uses SnapPass for sharing credentials to systems with customers.
Send file formats¶
Amperity supports the following file formats when sending files from Amperity:
Campaign template patterns¶
Data templates that are made available to campaigns may use variables to apply campaign names, group names, and send dates to the names of campaigns that are sent from Amperity.
Note
A date is automatically appended to the filename for one-time campaigns.
Important
Campaign templates use the same tokens and Joda-Time filters as file-based templates.
Campaign names¶
Use the {{ campaign_name }}
variable to define where the name of the campaign is added to the filename for a campaign, as it will be received by the downstream system.
Use this variable by itself to use the campaign name as the filename. For example, when the filename template is set to {{ campaign_name }}
and the name of the campaign is acme_subscriber_bogo_20220815_1
the filename will be acme_subscriber_bogo_20220815_1
.
You may use this variable by itself or with {{ group_name }}
in any order.
Group names¶
Use the {{ group_name }}
variable to define where the name of a treatment group is added to the filename for a campaign, as it will be received by the downstream system.
Use this variable by itself to use the treatment group name as the filename. For example, when the filename template is set to {{ group_name }}
and the name of the treatment group is Group1_bogo_20220815_1
the filename will be Group1_bogo_20220815_1
.
You may use this variable by itself or with {{ campaign_name }}
in any order.
List names¶
Use the {{ list_name }}
variable to use the name of the campaign as the filename for a campaign, as it will be received by the downstream system.
Filename template patterns¶
A filename template defines the naming pattern for files that are sent from Amperity. Specify the name of the file, and then use Jinja-style string formatting to append a date or timestamp to the filename.
Joda-Time is an open-source date and time library that is used by Amperity to establish consistency in filename patterns. The recommended pattern is “Segment_Name_MM-dd-YYYY”, where “Segment_Name” is the name of the segment and “MM-dd-YYYY” will append the current date.
Text variables¶
Strings in a filename template are literal by default. Use the {{ text }}
variable to apply special rendering to the text
value.
Filters¶
Use a filter to shift a timezone, and then format it as a string. The following filters are available:
Filter |
Description |
---|---|
local |
Use the |
format |
Use the |
next day |
Use the |
Tokens¶
Use a token to specify how to apply a datetime to a file. The following tokens are available:
Token |
Description |
---|---|
now |
Use the |
File compression¶
Amperity supports the following compression and archiving formats:
Tar
Tgz
Zip
GZip
When Tar or Zip options are not specified, a folder is created using the name filename template specified for the orchestration. This folder will contain one (or more) files, each of which have generated names.
Tip
Compression and archive file extensions are not added to the filename template automatically. These may be added while configuring an orchestration. To add the file compression format to the output filename, append .tar, .tgz, .zip, or .gz after the file format extension in the filename template. For example: parquet.tar, csv.zip, or tsv.gz.
Output examples¶
The following examples show how files output by Amperity are named depending on the various combination of options for file format, compression type, and encryption that are available. All examples use yyyy_MM_dd
for the date format.
for queries¶
Apache Parquet, TAR, PGP
Output from Amperity: query_name_yyyy_MM_dd.tar.pgp
After decryption: query_name_yyyy_MM_dd.tar
After decompression: query_name_yyyy_MM_dd.parquet, with 1 to n Apache Parquet part files within the directory.
CSV, TGZ, PGP
Output from Amperity: query_name_yyyy_MM_dd.tgz.pgp
After decryption: query_name_yyyy_MM_dd.tgz
After decompression: query_name_yyyy_MM_dd.csv
CSV, GZip, PGP
Output from Amperity: query_name_yyyy_MM_dd.csv.gz.pgp
After decryption: query_name_yyyy_MM_dd.csv.gz
After decompression: query_name_yyyy_MM_dd.csv
for single tables¶
Apache Parquet, TAR, PGP
Output from Amperity: table_name_yyyy_MM_dd.tar.pgp
After decryption: table_name_yyyy_MM_dd.tar
After decompression: table_name_yyyy_MM_dd.parquet, with 1 to n Apache Parquet part files within the directory.
CSV, TGZ, PGP
Output from Amperity: table_name_yyyy_MM_dd.tgz.pgp
After decryption: table_name_yyyy_MM_dd.tgz
After decompression: table_name_yyyy_MM_dd.csv
CSV, GZip, PGP
Output from Amperity: table_name_yyyy_MM_dd.csv.gz.pgp
After decryption: table_name_yyyy_MM_dd.csv.gz
After decompression: table_name_yyyy_MM_dd.csv
for multiple tables¶
Apache Parquet, TAR, PGP
Output from Amperity: output_name_yyyy_MM_dd.parquet.tar.pgp
After decryption: output_name_yyyy_MM_dd.parquet.tar
After decompression: table_name_yyyy_MM_dd.parquet, where, for each table, 1 to n Apache Parquet files will be located within a single directory.
CSV, TGZ, PGP
Output from Amperity: output_name_yyyy_MM_dd.csv.tgz.pgp
After decryption: output_name_yyyy_MM_dd.csv.tgz
After decompression: table_name.csv, where all tables that were output are located within a single directory.
SFTP and encryption¶
You must use encryption when using Secure File Transfer Protocol (SFTP) to transfer files to and from Amperity. Encryption protects the information in the files and requires the use of secure protocols and encryption programs:
Secure Shell (SSH) is a secure protocol that protects your files as they are transferred to and from Amperity
Pretty Good Privacy (PGP) is an encryption program that signs, encrypts, and decrypts files, protecting your files while they are at rest
Secure Shell (SSH)¶
Secure Shell (SSH) is a secure protocol that protects your files while they are transferred to and from Amperity.
Important
Amperity prefers using RSA as the encryption format for generating SSH keys that are used by SFTP connections to pull data to or send data from Amperity. Amperity prefers the size of the SSH key to be 4096 bits.
Amperity also supports the ECDSA and EC25519 encryption formats.
Generate SSH keypairs¶
RSA is a cryptographic system that may be used to generate public and private key pairs for the purpose of securing data transmission to and from Amperity via SFTP. The public key is used to encrypt data. The private key is based on a very large prime number and is used to decrypt data.
To generate SSH keypairs
Run the following command to generate an RSA key pair:
$ ssh-keygen -t rsa -m PEM -f generated-key
This will write two files: generated-key (the private key) and generated-key.pub (the public key).
The location in which the public and private keys should be placed depends on the location to which data is transferred.
Add the public key to the SFTP location from which data is sent to Amperity.
Add the private key to the Amperity SFTP courier.
Pretty Good Privacy (PGP)¶
Pretty Good Privacy (PGP) is an encryption program that provides cryptographic privacy and authentication for data communication by signing, encrypting, and decrypting data files and formats. Amperity supports PGP encryption.
PGP helps protect your files while they are at rest. Amperity recommends to use:
4096-bit keys
A strong passphrase
One PGP key per-tenant (minimum); one PGP key per system (recommended)
Files that are encrypted using PGP are appended with the .pgp extension.
Important
There are two types of PGP public keys: a primary key and a subkey. Amperity does not allow the use of a primary key for public-private key encryption. If you attempt to use a primary key you will see an error similar to “Destination failed validation: PGP public key is a primary key. Please provide a subkey or a keyring with exactly one subkey.”
Encrypt files¶
Any tool that is compliant with the OpenPGP standard, as defined by RFC 4880 may be used for PGP encryption.
GNU Privacy Guard. Available from https://www.gnupg.org/ . Instructions for how to use GNU Privacy Guard are from that site.
GPG Tools. Available from https://gpgtools.org/ . Instructions for how to use GPG Tools are from that site.
Caution
The tenant and Amperity must use the same tool to encrypt and decrypt files.
Tip
Use the following command to encrypt a file:
$ gpg --encrypt --recipient s3@acme.amperity.com data.csv
This will encrypt a file named “data.csv” and will output a file named “data.csv.gpg”. Change data.csv
to the name of the file to be encrypted. Change s3@acme.amperity.com
to the location in Amperity to which the data will be sent.
Decrypt files¶
Any tool that is compliant with the OpenPGP standard, as defined by RFC4880 may be used for PGP decryption.
GNU Privacy Guard. Available from https://www.gnupg.org/. Instructions for how to use GNU Privacy Guard are from that site.
GPG Tools. Available from https://gpgtools.org/. Instructions for how to use GPG Tools are from that site.
Caution
The tenant and Amperity must use the same tool to encrypt and decrypt files.
Tip
Use the following command to decrypt a file:
$ gpg --decrypt --recipient [location]@acme.amperity.com data.csv.gpg
This will decrypt a file named “data.csv.gpg” and will output a file named “data.csv”. Change data.csv
to the name of the file to be decrypted. Change [location]
to the parameter that indicates the cloud platform to which the data will be sent.
About data transfers¶
You must use encryption while transferring files to and from Amperity. SSH protects your files as they are transferred to and from Amperity. PGP protects your files while they are at rest.
The combination of public and private keys that are used for a specific workflow depends on if data is being pulled to Amperity from an upstream system or if data is being sent from Amperity to a downstream system.
Pull files to Amperity¶
When Amperity pulls files from upstream systems using SFTP, use the following combinations for SSH and PGP keys:
for SSH
The owner of the upstream system will create the SSH keypair and will maintain the private SSH key.
The public SSH key is configured in Amperity; you may send the public SSH key to your Amperity representative using SnapPass.
for PGP
Amperity Support will create the PGP keypair and will maintain the private PGP key.
Amperity Support will send you the public PGP key using SnapPass; the owner of the upstream system will encrypt files using the public PGP key prior to adding the files to the location from which Amperity will pull data.
Tip
Use file compression before encrypting files; compression applied after encryption will not reduce the size of the file.
Amperity will use the private PGP key to decrypt files pulled from the upstream system.
Send files from Amperity¶
When Amperity sends files to downstream systems using SFTP, use the following combinations for SSH and PGP keys:
for SSH
Amperity Support will create the SSH keypair and will maintain the private SSH key
Amperity Support will send you the public SSH key using SnapPass; add the public SSH key to the downstream system to which Amperity is configured to send data
for PGP
The owner of the downstream system creates the PGP keypair and maintains the private PGP key; you may send the public PGP key to your Amperity representative using SnapPass
Amperity will use the public PGP key to encrypt files before sending them to the downstream system.
The downstream system will use the private PGP key to decrypt files sent from Amperity
Key rotations¶
Amperity performs key rotations on a periodic basis as a best practice. Key rotations are sometimes necessary in situations where a key may have been compromised. When a key rotation happens, Amperity will:
Generate a key pair
Create a keyring file that contains the old and new private keys and uses the same passphrase
Install the keyring file to the courier
Share the new public key with the customer using SnapPass
Wait for confirmation from the customer that the public key is updated
Create a keyring file that contains only the updated private key
Install the keyring file that contains only the updated private key to the courier