About File Formats¶
This topic covers details that are common to files that can be pulled to or sent from Amperity using any filedrop source.
Pull files to Amperity¶
The following sections apply to all files that are pulled to Amperity:
Connect to source¶
Amperity requires the ability to connect to, and then read data from a filedrop location. The credentials that allow that connection and the permissions to read data are entered into the Amperity user interface while configuring a courier. These credentials are created and managed by the owner of the filedrop location, which is often external to Amperity (but is sometimes a system that is owned by Amperity). The customer may need to provide credentials to Amperity using SnapPass to complete the configuration.
SnapPass allows secrets to be shared in a secure, ephemeral way. Input a single or multi-line secret, along with an expiration time, and then generate a one-time use URL that may be shared with anyone. Amperity uses SnapPass for sharing credentials to systems with customers.
Pull file formats¶
Amperity supports the following file formats when using a filedrop location to pull data to Amperity:
Note
Amperity can ingest data from a wide variety of data sources, such as legacy data outputs like AS/400. Ask your Amperity Support representative about formats that are not directly listed in this section to determine if those data formats can be used as a way to provide data to Amperity.
Date formats¶
Dates should be quoted and should be in the “yyyy-MM-dd HH:mm:ss.SSS” format. The time portion (“:mm:ss.SSS”) is optional. For example:
2019-01-28 18:32:05.123
2019-01-28 18:32:05
2019-01-28
When the date format is not similar to the expected date format, Amperity will attempt to convert the date and time values. If date formats are mixed, Amperity will use the first one that matches.
Tip
Spark SQL may be used to transform source data into a supported date format prior to loading it to Amperity.
Date values conversion order¶
Date values are converted in the following order of precedence:
– |
Date type |
Format |
---|---|---|
1 |
basic-date |
“yyyyMMdd” |
2 |
date |
“yyyy-MM-dd” |
3 |
slash-date |
“yyyy/MM/dd” |
4 |
us-date-4y |
“MM/dd/yyyy” |
5 |
us-date-2y |
“MM/dd/yy” |
6 |
date-month-4y |
“dd-MMM-yyyy” |
7 |
date-month-2y |
“dd-MMM-yy” |
8 |
date-month-4y-spaced |
“dd MMM yyyy” |
9 |
date-month-4y-no-space |
“ddMMMyyyy” |
Time values conversion order¶
Time values (when present) are converted in the following order of precedence:
– |
Date type |
Format |
---|---|---|
1 |
basic-time-Z |
“HHmmssZ” |
2 |
basic-time |
“HHmmss” |
3 |
basic-time-millis-Z |
“HHmmss.SSSSSSSSSZ” |
4 |
basic-time-millis |
“HHmmss.SSSSSSSSS” |
5 |
24-hour-minute-second-millis-zone |
“HH:mm:ss.SSS z” |
6 |
24-hour-minute-second-millis-Z |
“HH:mm:ss.SSSSSSSSSZ” |
7 |
24-hour-minute-second-millis |
“HH:mm:ss.SSSSSSSSS” |
8 |
24-hour-minute-second-Z |
“HH:mm:ssZ” |
9 |
24-hour-minute-second |
“HH:mm:ss” |
10 |
24-hour-minute |
“HH:mm” |
11 |
24-hour |
“HH” |
12 |
12-hour-minute-second |
“hh:mm:ss a” |
13 |
12-hour-minute |
“hh:mm a” |
14 |
12-hour-minute-zone |
“hh:mma z” |
File compression / archive¶
Amperity supports the following compression and archiving formats:
Tar
Tgz
Zip
GZip
Large datasets¶
A large dataset is a file over 500GB in size.
Amperity recommends that large datasets:
Be provided to Amperity using Amazon S3, Azure Blob Storage, or Google Cloud Storage to use their massively parallel I/O capabilities
Use compression to reduce file sizes
In certain cases, may use the Amperity Streaming Ingest REST API to avoid batched data drops
Input examples¶
The following examples show how files input to Amperity are unpacked, depending on various combinations of encryption, compression type, and file format. All examples use yyyy_MM_dd
for the date format.
for single files¶
PGP, TGZ, CSV
Input to Amperity: table_name_yyyy_MM_dd.tgz.pgp
After decryption: table_name_yyyy_MM_dd.tgz
After decompression: table_name_yyyy_MM_dd.csv
PGP, GZip, TAR, CSV
Input to Amperity: table_name_yyyy_MM_dd.csv.tar.gz.pgp
After decryption: table_name_yyyy_MM_dd.csv.tar.gz
After decompression: table_name_yyyy_MM_dd.csv.tar
After the archive is opened: table_name_yyyy_MM_dd.csv
PGP, TAR, Apache Parquet
Input to Amperity: table_name_yyyy_MM_dd.tar.pgp
After decryption: table_name_yyyy_MM_dd.tar
After decompression: table_name_yyyy_MM_dd.parquet, with 1 to n Apache Parquet part files within the directory.
for multiple files¶
PGP, TAR, Apache Parquet
Input to Amperity: input_name_yyyy_MM_dd.parquet.tar.pgp
After decryption: input_name_yyyy_MM_dd.parquet.tar
After decompression: table_name_yyyy_MM_dd.parquet, where, for each table, 1 to n Apache Parquet files will be located within a single directory.
PGP, TGZ, CSV
Input to Amperity: input_name_yyyy_MM_dd.csv.tgz.pgp
After decryption: input_name_yyyy_MM_dd.csv.tgz
After decompression: table_name.csv, where all tables that were input are located within a single directory.
Send files from Amperity¶
The following sections apply to all files that are pulled to Amperity:
Connect to destination¶
Amperity requires the ability to connect, and then write data to the location in which files will be dropped. The credentials that allow Amperity to write data to that location are configured in Amperity. If this location is not managed by Amperity, the customer will need to provide these credentials to Amperity using SnapPass to complete the configuration.
SnapPass allows secrets to be shared in a secure, ephemeral way. Input a single or multi-line secret, along with an expiration time, and then generate a one-time use URL that may be shared with anyone. Amperity uses SnapPass for sharing credentials to systems with customers.
Send file formats¶
Amperity supports the following file formats when using a filedrop location to send data from Amperity:
Campaign template patterns¶
Data templates that are made available to campaigns may use variables to apply campaign names, group names, and send dates to the names of campaigns that are sent from Amperity.
Note
A date is automatically appended to the filename for one-time campaigns.
Important
Campaign templates use the same tokens and Joda-Time filters as file-based templates.
Campaign names¶
Use the {{ campaign_name }}
variable to define where the name of the campaign is added to the filename for a campaign, as it will be received by the downstream system.
Use this variable by itself to use the campaign name as the filename. For example, when the filename template is set to {{ campaign_name }}
and the name of the campaign is acme_subscriber_bogo_20220815_1
the filename will be acme_subscriber_bogo_20220815_1
.
You may use this variable by itself or with {{ group_name }}
in any order.
Group names¶
Use the {{ group_name }}
variable to define where the name of a recipient group is added to the filename for a campaign, as it will be received by the downstream system.
Use this variable by itself to use the recipient group name as the filename. For example, when the filename template is set to {{ group_name }}
and the name of the recipient group is Group1_bogo_20220815_1
the filename will be Group1_bogo_20220815_1
.
You may use this variable by itself or with {{ campaign_name }}
in any order.
List names¶
Use the {{ list_name }}
variable to use the name of the campaign as the filename for a campaign, as it will be received by the downstream system.
Filename template patterns¶
A filename template defines the naming patterns for files that are sent by Amperity to a location in which files are dropped. A filename template specifies the name of the file and then uses Jinja-style string formats to append a date to the filename to ensure that any downstream process can identify which file is the one to be picked up.
Joda-Time is an open-source date and time library that is used by Amperity to establish consistency in filename patterns. The recommended pattern is “Segment_Name_MM-dd-YYYY”, where “Segment_Name” is the name of the segment and “MM-dd-YYYY” will append the current date.
Text variables¶
Strings in a filename template are literal by default. Use the {{ text }}
variable to apply special rendering to the text
value.
Filters¶
Use a filter to shift a timezone, and then format it as a string. The following filters are available:
Filter |
Description |
---|---|
local |
Use the |
format |
Use the |
next day |
Use the |
Tokens¶
Use a token to specify how to apply a datetime to a file. The following tokens are available:
Token |
Description |
---|---|
now |
Use the |
File compression¶
Amperity supports the following compression and archiving formats:
Tar
Tgz
Zip
GZip
When Tar or Zip options are not specified, a folder is created using the name filename template specified for the orchestration. This folder will contain one (or more) files, each of which have generated names.
Tip
Compression and archive file extensions are not added to the filename template automatically. These may be added while configuring an orchestration. To add the file compression format to the output filename, append .tar, .tgz, .zip, or .gz after the file format extension in the filename template. For example: parquet.tar, csv.zip, or tsv.gz.
Output examples¶
The following examples show how files output by Amperity are named depending on the various combination of options for file format, compression type, and encryption that are available. All examples use yyyy_MM_dd
for the date format.
for queries¶
Apache Parquet, TAR, PGP
Output from Amperity: query_name_yyyy_MM_dd.tar.pgp
After decryption: query_name_yyyy_MM_dd.tar
After decompression: query_name_yyyy_MM_dd.parquet, with 1 to n Apache Parquet part files within the directory.
CSV, TGZ, PGP
Output from Amperity: query_name_yyyy_MM_dd.tgz.pgp
After decryption: query_name_yyyy_MM_dd.tgz
After decompression: query_name_yyyy_MM_dd.csv
CSV, GZip, PGP
Output from Amperity: query_name_yyyy_MM_dd.csv.gz.pgp
After decryption: query_name_yyyy_MM_dd.csv.gz
After decompression: query_name_yyyy_MM_dd.csv
for single tables¶
Apache Parquet, TAR, PGP
Output from Amperity: table_name_yyyy_MM_dd.tar.pgp
After decryption: table_name_yyyy_MM_dd.tar
After decompression: table_name_yyyy_MM_dd.parquet, with 1 to n Apache Parquet part files within the directory.
CSV, TGZ, PGP
Output from Amperity: table_name_yyyy_MM_dd.tgz.pgp
After decryption: table_name_yyyy_MM_dd.tgz
After decompression: table_name_yyyy_MM_dd.csv
CSV, GZip, PGP
Output from Amperity: table_name_yyyy_MM_dd.csv.gz.pgp
After decryption: table_name_yyyy_MM_dd.csv.gz
After decompression: table_name_yyyy_MM_dd.csv
for multiple tables¶
Apache Parquet, TAR, PGP
Output from Amperity: output_name_yyyy_MM_dd.parquet.tar.pgp
After decryption: output_name_yyyy_MM_dd.parquet.tar
After decompression: table_name_yyyy_MM_dd.parquet, where, for each table, 1 to n Apache Parquet files will be located within a single directory.
CSV, TGZ, PGP
Output from Amperity: output_name_yyyy_MM_dd.csv.tgz.pgp
After decryption: output_name_yyyy_MM_dd.csv.tgz
After decompression: table_name.csv, where all tables that were output are located within a single directory.
Pretty Good Privacy (PGP)¶
Pretty Good Privacy (PGP) is an encryption program that provides cryptographic privacy and authentication for data communication by signing, encrypting, and decrypting data files and formats. Amperity supports PGP encryption.
PGP encryption is the encryption type that may be applied to files sent to Amperity to improve data security and help to ensure file integrity and completeness. Amperity requires. Amperity recommends:
4096-bit keys
Protected by a strong passphrase
One PGP key per-tenant (minimum); one PGP key per system (recommended)
Amperity Support will generate PGP keys (both public and private key-pairs) to use when generating PGP encrypted files to be sent to Amperity. Key pairs are created in the same cloud–Amazon AWS or Microsoft Azure–in which the customer’s tenant is located.
Amperity will provide to the customer the public key using SnapPass. The customer must use that key to encrypt files prior to adding them to the filedrop location. Files that are encrypted using PGP should be compressed prior to encryption. (Compression applied after encryption does not reduce the size of the file.) Amperity will use the private key to decrypt files prior to loading them.
Important
There are two types of PGP public keys: a primary key and a subkey. Amperity does not allow the use of a primary key for public-private key encryption. If you attempt to use a primary key you will see an error similar to “Destination failed validation: PGP public key is a primary key. Please provide a subkey or a keyring with exactly one subkey.”
File encryption¶
Any tool that is compliant with the OpenPGP standard, as defined by RFC 4880 may be used for PGP encryption.
GNU Privacy Guard. Available from https://www.gnupg.org/ . Instructions for how to use GNU Privacy Guard are from that site.
GPG Tools. Available from https://gpgtools.org/ . Instructions for how to use GPG Tools are from that site.
Caution
The tenant and Amperity must use the same tool to encrypt and decrypt files.
Tip
Use the following command to encrypt a file:
$ gpg --encrypt --recipient s3@acme.amperity.com data.csv
This will encrypt a file named “data.csv” and will output a file named “data.csv.gpg”. Change data.csv
to the name of the file to be encrypted. Change s3@acme.amperity.com
to the location in Amperity to which the data will be sent.
File decryption¶
Important
The customer must generate PGP keys (both public and private key-pairs) for use when sending PGP encrypted files from Amperity. The customer should provide to Amperity the public key using SnapPass. Amperity will use that key to encrypt the files before sending them to the filedrop location. Files that are encrypted using PGP are appended with the .pgp extension.
Any tool that is compliant with the OpenPGP standard, as defined by RFC4880 may be used for PGP decryption.
GNU Privacy Guard. Available from https://www.gnupg.org/. Instructions for how to use GNU Privacy Guard are from that site.
GPG Tools. Available from https://gpgtools.org/. Instructions for how to use GPG Tools are from that site.
Caution
The tenant and Amperity must use the same tool to encrypt and decrypt files.
Tip
Use the following command to decrypt a file:
$ gpg --decrypt --recipient [location]@acme.amperity.com data.csv.gpg
This will decrypt a file named “data.csv.gpg” and will output a file named “data.csv”. Change data.csv
to the name of the file to be decrypted. Change [location]
to the parameter that indicates the cloud platform to which the data will be sent.
Key rotations¶
Amperity performs key rotations on a periodic basis as a best practice. Key rotations are sometimes necessary in situations where a key may have been compromised. When a key rotation happens, Amperity will:
Generate a key pair
Create a keyring file that contains the old and new private keys and uses the same passphrase
Install the keyring file to the courier
Share the new public key with the customer using SnapPass
Wait for confirmation from the customer that the public key is updated
Create a keyring file that contains only the updated private key
Install the keyring file that contains only the updated private key to the courier
Public keys for SFTP¶
RSA is a cryptographic system that may be used to generate public and private key pairs for the purpose of securing data transmission to and from Amperity via SFTP. The public key is used to encrypt data. The private key is based on a very large prime number and is used to decrypt data.
To use an RSA key to secure data transmission to Amperity
Generate an RSA key pair:
$ ssh-keygen -t rsa -m PEM -f generated-key
This will write two files: generated-key (the private key) and generated-key.pub (the public key).
Add the public key to the SFTP location from which data is sent to Amperity.
Add the private key to the Amperity SFTP courier.