General Advice

This topic contains general advice and recommendations for sending data to Amperity.

Sending data to Amperity is the combination of:

  1. Identifying a data source. It is important to define data sources to have predictable handoffs.

  2. Determining the location from which that data source will be made available to Amperity, the file format to be provided, and the process that will be used (cloud-based storage, SFTP, FiveTran, REST API, or Snowflake) to make that available.

    Important

    Even if you do not see a data source in various lists of data sources that are shown to be “available” (such as on the Amperity website or on various pages within the documentation site), this does not mean you cannot send data from that source. A significant percentage of data sources used by Amperity customers are enabled using cloud-based storage.

  3. Configuring Amperity to process that data.

For a production environment, most data sources are configured to run once per 24-hour time period. This means that within that 24-hour time period, the data must be made available to Amperity for processing early enough within that 24-hour time period to allow Amperity to complete all downstream processing. Downstream processing includes:

  • Running Stitch for identity resolution

  • Refreshing all databases based on Stitch output

  • Processing all queries and segments that have downstream dependencies

  • Sending query results or audiences to all configured destinations and marketing channels

Note

Preprocessing or filtering data before sending it to Amperity is typically not required, but sometimes business and security concerns will require it.

The following sections contain specific advice and/or recommendations:

Credentials and Secrets

Amperity requires the ability to connect to, and then read data from the data source. The credentials that allow that connection and the ability to read that data are entered into the Amperity user interface while configuring a courier.

These credentials are created and managed by the owner of the data source, which is often external to Amperity (but is sometimes a system that is owned by Amperity, such as Amazon S3 or Azure Blob Storage). Credentials must be provided to Amperity using SnapPass to complete the configuration.

SnapPass allows secrets to be shared in a secure, ephemeral way. Input a single or multi-line secret, along with an expiration time, and then generate a one-time use URL that may be shared with anyone. Amperity uses SnapPass for sharing credentials to systems with customers.

File Formats

The following data formats are ranked in terms of preference:

  1. Apache Parquet files

  2. CSV files

  3. JSON files with simple nested data only

  4. Other data formats

Apache Parquet

Apache Parquet is a free and open-source column-oriented data storage format developed within the Apache Hadoop ecosystem. It is similar to RCFile and ORC, but provides more efficient data compression and encoding schemes with enhanced performance and can better handle large amounts of complex bulk data.

CSV

A comma-separated values (CSV) file, defined by RFC 4180 , is a delimited text file that uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

Note

Other delimited-file formats–TSV and PSV–are supported. They follow the same recommendations as CSV files and may be considered interchangeable.

JSON

JavaScript Object Notation (JSON) is language-independent data format that is derived from (and structured similar to) JavaScript.

Other data formats

Amperity can ingest data from many types of data sources, such as:

  • Output from relational database management systems

  • Apache Parquet files along with non-Parquet files that are common to Apache Hadoop environments, such as Apache Avro

  • Legacy data outputs, such as DAT

  • PSV and TSV files

  • NDJSON

  • JSON and Streaming JSON

  • Many REST APIs

  • Snowflake tables, including data sources that use FiveTran to send data

  • CBOR

Pull vs. Push

Data may be provided to Amperity in the following ways:

  1. Recommended. Amperity pulls data from a cloud-accessible storage location, such as Amazon S3, Azure Blob Storage, Google Cloud Storage, or any SFTP site.

    This location may be customer-managed (recommended) or Amperity-managed. For Amazon AWS, it is recommended to use cross-account role assumption. For Microsoft Azure, it is recommended to use Azure Data Share.

    Some data sources provide a REST API that may be used to provide data to Amperity, such as Campaign Monitor.

    Many data sources are eligible to use FiveTran as the interface that pulls data to Amperity, such as HubSpot Klaviyo, Kustomer, Shopify, Sailthru, and Square.

  2. The customer pushes data to Amperity via the Streaming Ingest REST API.

    Note

    This scenario should only be used for transactional or event-like data that would be streamed as it happens.

Amperity strongly recommends and prefers data exchange to use customer-managed cloud storage locations. This is because many REST APIs are designed for smaller volumes or have record limits. An additional challenge is that many REST APIs are record oriented rather than change oriented. This can result in scenarios like deleted records not showing up in incremental pulls or sources that are missing discrete data on upstream merges.

Systems that support change data capture (CDC) are often suitable, but those types of systems are uncommon. Even when systems do support all of these properties, upstream changes, such as normalizing a status column or changing a billing code, can cause updates to large percentages of records, which can be risky given the preference for 24-hour cadences for all workflows.

A hybrid path where a REST API is used for partial incremental changes, and then a separate file-based delivery path is used for catch-ups (either on regular intervals or on-demand) adds more surface area (i.e. risk) to the workflow.

Some REST APIs support bulk delivery, which can perform with the same type of reliability as cloud-accessible storage locations.

A complete file-based delivery using cloud-accessible storage locations is the most reliable way to get very large data volumes to Amperity.

Push Data to Amperity

To push data to Amperity you may use the Streaming Ingest REST API.

Apache Spark

Apache Spark prefers load sizes to range between 1-10000 files and file sizes to range between 1-1000 MB. Apache Spark will parse 100 x 10 MB files faster than 10 x 100 MB files and much faster than 1 x 10000 MB file. When loading large files to Amperity, as a general guideline to optimize the performance of Apache Spark, look to create situations where:

  • The number of individual files is below 3000.

  • The range of individual file sizes is below 100 MB.

Put differently, Apache Spark will parse 3000 x 100 MB files faster than 300 x 1000 MB files and much faster than 30 x 10000 MB files.

PGP Encryption

Pretty Good Privacy (PGP) is an encryption program that provides cryptographic privacy and authentication for data communication by signing, encrypting, and decrypting data files and formats. Amperity supports PGP encryption.

PGP encryption is the encryption type that may be applied to files sent to Amperity to improve data security and help to ensure file integrity and completeness. Amperity requires. Amperity recommends:

  • 4096-bit keys

  • Protected by a strong passphrase

  • One PGP key per-tenant (minimum); one PGP key per system (recommended)

Amperity Support will generate PGP keys (both public and private key-pairs) to use when generating PGP encrypted files to be sent to Amperity. Key pairs are created in the same cloud–Amazon AWS or Microsoft Azure–in which the customer’s tenant is located.

Amperity will provide to the customer the public key using SnapPass. The customer must use that key to encrypt files prior to adding them to the filedrop location. Files that are encrypted using PGP should be compressed prior to encryption. (Compression applied after encryption does not reduce the size of the file.) Amperity will use the private key to decrypt files prior to loading them.

Important

There are two types of PGP public keys: a primary key and a subkey. Amperity does not allow the use of a primary key for public-private key encryption. If you attempt to use a primary key you will see an error similar to “Destination failed validation: PGP public key is a primary key. Please provide a subkey or a keyring with exactly one subkey.”

Encrypt Files

Any tool that is compliant with the OpenPGP standard, as defined by RFC 4880 may be used for PGP encryption.

  • GNU Privacy Guard. Available from https://www.gnupg.org/ . Instructions for how to use GNU Privacy Guard are from that site.

  • GPG Tools. Available from https://gpgtools.org/ . Instructions for how to use GPG Tools are from that site.

Caution

The tenant and Amperity must use the same tool to encrypt and decrypt files.

Tip

Use the following command to encrypt a file:

$ gpg --encrypt --recipient s3@acme.amperity.com data.csv

This will encrypt a file named “data.csv” and will output a file named “data.csv.gpg”. Change data.csv to the name of the file to be encrypted. Change s3@acme.amperity.com to the location in Amperity to which the data will be sent.

Rotate Keys

Amperity performs key rotations on a periodic basis as a best practice. Key rotations are sometimes necessary in situations where a key may have been compromised. When a key rotation happens, Amperity will:

  1. Generate a key pair

  2. Create a keyring file that contains the old and new private keys and uses the same phassphrase

  3. Install the keyring file to the courier

  4. Share the new public key with the customer using SnapPass

  5. Wait for confirmation from the customer that the public key is updated

  6. Create a keyring file that contains only the updated private key

  7. Install the keyring file that contains only the updated private key to the courier

Connection Details

The following collection details are needed for customer-owned Amazon S3, Azure Blob Storage, and SFTP locations.

Location

Details

Amazon S3

Access key, secret key, and bucket name.

Azure Blob Storage

Using shared access credentials, the name of the container, the blob prefix, and credential details.

SFTP

Host name, user name, public key (preferred).

-or-

Host name, user name, and passphrase.

Date Formats

Dates should be quoted and should be in the “yyyy-MM-dd HH:mm:ss.SSS” format. The time portion (“:mm:ss.SSS”) is optional. For example:

  • 2019-01-28 18:32:05.123

  • 2019-01-28 18:32:05

  • 2019-01-28

When the date format is not similar to the expected date format, Amperity will attempt to convert the date and time values. If date formats are mixed, Amperity will use the first one that matches.

IP Allowlists

Warning

IP allowlists are not recommended. Many issues can arise when an IP address is on an allowlist within Amazon AWS or Microsoft Azure because both services use their own internal networks for routing.

  • Amazon AWS recommends against using allowlists on the SourceIP condition because it denies access to AWS services that make calls on your behalf

  • Microsoft Azure suggests that using IP allowlists for shared access signature (SAS) tokens is only recommended for use with IP addresses that are located outside of Microsoft Azure.

Tip

Alternatives to using an allowlist include:

  1. Cross-account roles within Amazon AWS, which requires using an Amazon Resource Name (ARN) for the role with cross-account access.

  2. Using Azure Data Share.

Discuss these options with your Amperity representative prior to making a decision to allowlist IP addresses.

Use the following IP addresses when IP allowlists are required:

Service

IP Address

Amazon AWS

Loading Dock: 54.70.74.198

Amperity: 52.42.237.53

Azure

Loading Dock: 20.186.51.237

Amperity: 104.46.106.84

Corporate: 76.121.66.238

Azure EU

Amperity: 20.123.127.54

Large Datasets

A large dataset is a file over 500GB in size.

Amperity recommends that large datasets:

  • Be provided to Amperity using Amazon S3, Azure Blob Storage, or Google Cloud Storage to leverage their massively parallel I/O capabilities

  • Use compression to reduce file sizes

  • In certain cases, may use the Amperity Streaming Ingest REST API to avoid batched data drops