General Advice¶
This topic contains general advice and recommendations for sending data to Amperity.
Sending data to Amperity is the combination of:
Identifying a data source. It is important to define data sources to have predictable handoffs.
Determining the location from which that data source will be made available to Amperity, the file format to be provided, and the process that will be used (cloud-based storage, SFTP, FiveTran, REST API, or Snowflake) to make that available.
Important
Even if you do not see a data source in various lists of data sources that are shown to be “available” (such as on the Amperity website or on various pages within the documentation site), this does not mean you cannot send data from that source. A significant percentage of data sources used by Amperity customers are enabled using cloud-based storage.
Configuring Amperity to process that data.
For a production environment, most data sources are configured to run once per 24-hour time period. This means that within that 24-hour time period, the data must be made available to Amperity for processing early enough within that 24-hour time period to allow Amperity to complete all downstream processing. Downstream processing includes:
Running Stitch for identity resolution
Refreshing all databases based on Stitch output
Processing all queries and segments that have downstream dependencies
Sending query results or audiences to all configured destinations and marketing channels
Note
Preprocessing or filtering data before sending it to Amperity is typically not required, but sometimes business and security concerns will require it.
The following sections contain specific advice and/or recommendations:
Credentials and Secrets¶
Amperity requires the ability to connect to, and then read data from the data source. The credentials that allow that connection and the ability to read that data are entered into the Amperity user interface while configuring a courier.
These credentials are created and managed by the owner of the data source, which is often external to Amperity (but is sometimes a system that is owned by Amperity, such as Amazon S3 or Azure Blob Storage). Credentials must be provided to Amperity using SnapPass to complete the configuration.
SnapPass allows secrets to be shared in a secure, ephemeral way. Input a single or multi-line secret, along with an expiration time, and then generate a one-time use URL that may be shared with anyone. Amperity uses SnapPass for sharing credentials to systems with customers.
File Formats¶
The following data formats are ranked in terms of preference:
Apache Parquet¶
Apache Parquet is a free and open-source column-oriented data storage format developed within the Apache Hadoop ecosystem. It is similar to RCFile and ORC, but provides more efficient data compression and encoding schemes with enhanced performance and can better handle large amounts of complex bulk data.
CSV¶
A comma-separated values (CSV) file, defined by RFC 4180 , is a delimited text file that uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.
Note
Other delimited-file formats–TSV and PSV–are supported. They follow the same recommendations as CSV files and may be considered interchangeable.
JSON¶
JavaScript Object Notation (JSON) is language-independent data format that is derived from (and structured similar to) JavaScript.
Other data formats¶
Amperity can ingest data from many types of data sources, such as:
Output from relational database management systems
Apache Parquet files along with non-Parquet files that are common to Apache Hadoop environments, such as Apache Avro
Legacy data outputs, such as DAT
JSON and Streaming JSON
Many REST APIs
Snowflake tables, including data sources that use FiveTran to send data
Pull vs. Push¶
Data may be provided to Amperity in the following ways:
Recommended. Amperity pulls data from a cloud-accessible storage location, such as Amazon S3, Azure Blob Storage, Google Cloud Storage, or any SFTP site.
This location may be customer-managed (recommended) or Amperity-managed. For Amazon AWS, it is recommended to use cross-account role assumption. For Microsoft Azure, it is recommended to use Azure Data Share.
Some data sources provide a REST API that may be used to provide data to Amperity, such as Campaign Monitor.
Many data sources are eligible to use FiveTran as the interface that pulls data to Amperity, such as HubSpot Klaviyo, Kustomer, Shopify, Sailthru, and Square.
The customer pushes data to Amperity via the Streaming Ingest REST API.
Note
This scenario should only be used for transactional or event-like data that would be streamed as it happens.
Amperity strongly recommends and prefers data exchange to use customer-managed cloud storage locations. This is because many REST APIs are designed for smaller volumes or have record limits. An additional challenge is that many REST APIs are record oriented rather than change oriented. This can result in scenarios like deleted records not showing up in incremental pulls or sources that are missing discrete data on upstream merges.
Systems that support change data capture (CDC) are often suitable, but those types of systems are uncommon. Even when systems do support all of these properties, upstream changes, such as normalizing a status column or changing a billing code, can cause updates to large percentages of records, which can be risky given the preference for 24-hour cadences for all workflows.
A hybrid path where a REST API is used for partial incremental changes, and then a separate file-based delivery path is used for catch-ups (either on regular intervals or on-demand) adds more surface area (i.e. risk) to the workflow.
Some REST APIs support bulk delivery, which can perform with the same type of reliability as cloud-accessible storage locations.
A complete file-based delivery using cloud-accessible storage locations is the most reliable way to get very large data volumes to Amperity.
Push Data to Amperity¶
To push data to Amperity you may use the Streaming Ingest REST API.
Apache Spark¶
Apache Spark prefers load sizes to range between 1-10000 files and file sizes to range between 1-1000 MB. Apache Spark will parse 100 x 10 MB files faster than 10 x 100 MB files and much faster than 1 x 10000 MB file. When loading large files to Amperity, as a general guideline to optimize the performance of Apache Spark, look to create situations where:
The number of individual files is below 3000.
The range of individual file sizes is below 100 MB.
Put differently, Apache Spark will parse 3000 x 100 MB files faster than 300 x 1000 MB files and much faster than 30 x 10000 MB files.
PGP Encryption¶
Pretty Good Privacy (PGP) is an encryption program that provides cryptographic privacy and authentication for data communication by signing, encrypting, and decrypting data files and formats. Amperity supports PGP encryption.
PGP encryption is the encryption type that may be applied to files sent to Amperity to improve data security and help to ensure file integrity and completeness. Amperity requires. Amperity recommends:
4096-bit keys
Protected by a strong passphrase
One PGP key per-tenant (minimum); one PGP key per system (recommended)
Amperity Support will generate PGP keys (both public and private key-pairs) to use when generating PGP encrypted files to be sent to Amperity. Key pairs are created in the same cloud–Amazon AWS or Microsoft Azure–in which the customer’s tenant is located.
Amperity will provide to the customer the public key using SnapPass. The customer must use that key to encrypt files prior to adding them to the filedrop location. Files that are encrypted using PGP should be compressed prior to encryption. (Compression applied after encryption does not reduce the size of the file.) Amperity will use the private key to decrypt files prior to loading them.
Important
There are two types of PGP public keys: a primary key and a subkey. Amperity does not allow the use of a primary key for public-private key encryption. If you attempt to use a primary key you will see an error similar to “Destination failed validation: PGP public key is a primary key. Please provide a subkey or a keyring with exactly one subkey.”
Encrypt Files¶
Any tool that is compliant with the OpenPGP standard, as defined by RFC 4880 may be used for PGP encryption.
GNU Privacy Guard. Available from https://www.gnupg.org/ . Instructions for how to use GNU Privacy Guard are from that site.
GPG Tools. Available from https://gpgtools.org/ . Instructions for how to use GPG Tools are from that site.
Caution
The tenant and Amperity must use the same tool to encrypt and decrypt files.
Tip
Use the following command to encrypt a file:
$ gpg --encrypt --recipient s3@acme.amperity.com data.csv
This will encrypt a file named “data.csv” and will output a file named “data.csv.gpg”. Change data.csv
to the name of the file to be encrypted. Change s3@acme.amperity.com
to the location in Amperity to which the data will be sent.
Rotate Keys¶
Amperity performs key rotations on a periodic basis as a best practice. Key rotations are sometimes necessary in situations where a key may have been compromised. When a key rotation happens, Amperity will:
Generate a key pair
Create a keyring file that contains the old and new private keys and uses the same passphrase
Install the keyring file to the courier
Share the new public key with the customer using SnapPass
Wait for confirmation from the customer that the public key is updated
Create a keyring file that contains only the updated private key
Install the keyring file that contains only the updated private key to the courier
Connection Details¶
The following collection details are needed for customer-owned Amazon S3, Azure Blob Storage, and SFTP locations.
Location |
Details |
---|---|
Amazon S3 |
Access key, secret key, and bucket name. |
Azure Blob Storage |
Using shared access credentials, the name of the container, the blob prefix, and credential details. |
SFTP |
Host name, user name, public key (preferred). -or- Host name, user name, and passphrase. |
Date Formats¶
Dates should be quoted and should be in the “yyyy-MM-dd HH:mm:ss.SSS” format. The time portion (“:mm:ss.SSS”) is optional. For example:
2019-01-28 18:32:05.123
2019-01-28 18:32:05
2019-01-28
When the date format is not similar to the expected date format, Amperity will attempt to convert the date and time values. If date formats are mixed, Amperity will use the first one that matches.
IP Allowlists¶
Use the following IP addresses when IP allowlists are required:
Service |
IP Address |
---|---|
Amazon AWS |
Loading Dock: 54.70.74.198 Amperity: 52.42.237.53 SFTP: 52.11.51.214 |
Amazon AWS Canada |
Loading Dock: 3.99.255.250 |
Azure |
Loading Dock: 20.186.51.237 Amperity: 104.46.106.84 Corporate: 76.121.66.238 SFTP: 20.36.236.80 |
Azure EU |
Loading Dock: 20.67.201.150 Amperity: 20.123.127.54 |
Tip
Alternatives to using an allowlist include:
Cross-account roles within Amazon AWS, which requires using an Amazon Resource Name (ARN) for the role with cross-account access.
Using Azure Data Share.
Discuss these options with your Amperity representative prior to making a decision to allowlist IP addresses.
Warning
IP allowlists are not recommended. Many issues can arise when an IP address is on an allowlist within Amazon AWS or Microsoft Azure because both services use their own internal networks for routing.
Amazon AWS recommends against using allowlists on the SourceIP condition because it denies access to AWS services that make calls on your behalf
Microsoft Azure suggests that using IP allowlists for shared access signature (SAS) tokens is only recommended for use with IP addresses that are located outside of Microsoft Azure.
Large Datasets¶
A large dataset is a file over 500GB in size.
Amperity recommends that large datasets:
Be provided to Amperity using Amazon S3, Azure Blob Storage, or Google Cloud Storage to use their massively parallel I/O capabilities
Use compression to reduce file sizes
In certain cases, may use the Amperity Streaming Ingest REST API to avoid batched data drops