General advice

This topic contains general advice and recommendations for sending data to Amperity.

Sending data to Amperity is the combination of:

  1. Identifying a data source. It is important to define data sources to have predictable handoffs.

  2. Determining the location from which that data source will be made available to Amperity, the file format to be provided, and the process that will be used (cloud-based storage, SFTP, FiveTran, REST API, or Snowflake) to make that available.

    Important

    Even if you do not see a data source in various lists of data sources that are shown to be “available” (such as on the Amperity website or on various pages within the documentation site), this does not mean you cannot send data from that source. A significant percentage of data sources used by Amperity customers are enabled using cloud-based storage.

  3. Configuring Amperity to process that data.

For a production environment, most data sources are configured to run once per 24-hour time period. This means that within that 24-hour time period, the data must be made available to Amperity for processing early enough within that 24-hour time period to allow Amperity to complete all downstream processing. Downstream processing includes:

  • Running Stitch for identity resolution

  • Refreshing all databases based on Stitch output

  • Processing all queries and segments that have downstream dependencies

  • Sending query results or audiences to all configured destinations and marketing channels

Note

Preprocessing or filtering data before sending it to Amperity is typically not required, but sometimes business and security concerns will require it.

The following sections contain specific advice and/or recommendations:

Credentials and Secrets

Amperity requires the ability to connect to, and then read data from the data source. The credentials that allow that connection and the ability to read that data are entered into the Amperity user interface while configuring a courier.

These credentials are created and managed by the owner of the data source, which is often external to Amperity (but is sometimes a system that is owned by Amperity, such as Amazon S3 or Azure Blob Storage). Credentials must be provided to Amperity using SnapPass to complete the configuration.

SnapPass allows secrets to be shared in a secure, ephemeral way. Input a single or multi-line secret, along with an expiration time, and then generate a one-time use URL that may be shared with anyone. Amperity uses SnapPass for sharing credentials to systems with customers.

File Formats

The following data formats are ranked in terms of preference:

  1. Apache Parquet files

  2. CSV files

  3. JSON files with simple nested data only

  4. Other data formats

Apache Parquet

Apache Parquet is a free and open-source column-oriented data storage format developed within the Apache Hadoop ecosystem. It is similar to RCFile and ORC, but provides more efficient data compression and encoding schemes with enhanced performance and can better handle large amounts of complex bulk data.

CSV

A comma-separated values (CSV) file, defined by RFC 4180 , is a delimited text file that uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

Note

Other delimited-file formats–TSV and PSV–are supported. They follow the same recommendations as CSV files and may be considered interchangeable.

JSON

JavaScript Object Notation (JSON) is language-independent data format that is derived from (and structured similar to) JavaScript.

Other data formats

Amperity can ingest data from many types of data sources, such as:

  • Output from relational database management systems

  • Apache Parquet files along with non-Parquet files that are common to Apache Hadoop environments, such as Apache Avro

  • Legacy data outputs, such as DAT

  • PSV and TSV files

  • NDJSON

  • JSON and Streaming JSON

  • Many REST APIs

  • Snowflake tables, including data sources that use FiveTran to send data

  • CBOR

Pull vs. Push

Data may be provided to Amperity in the following ways:

  1. Recommended. Amperity pulls data from a cloud-accessible storage location, such as Amazon S3, Azure Blob Storage, Google Cloud Storage, or any SFTP site.

    This location may be customer-managed (recommended) or Amperity-managed. For Amazon AWS, it is recommended to use cross-account role assumption. For Microsoft Azure, it is recommended to use Azure Data Share.

    Some data sources provide a REST API that may be used to provide data to Amperity, such as Campaign Monitor.

    Many data sources are eligible to use FiveTran as the interface that pulls data to Amperity, such as HubSpot Klaviyo, Kustomer, Shopify, Sailthru, and Square.

  2. The customer pushes data to Amperity via the Streaming Ingest API.

    Note

    This scenario should only be used for transactional or event-like data that would be streamed as it happens.

Amperity strongly recommends and prefers data exchange to use customer-managed cloud storage locations. This is because many REST APIs are designed for smaller volumes or have record limits. An additional challenge is that many REST APIs are record oriented rather than change oriented. This can result in scenarios like deleted records not showing up in incremental pulls or sources that are missing discrete data on upstream merges.

Systems that support change data capture (CDC) are often suitable, but those types of systems are uncommon. Even when systems do support all of these properties, upstream changes, such as normalizing a status column or changing a billing code, can cause updates to large percentages of records, which can be risky given the preference for 24-hour cadences for all workflows.

A hybrid path where a REST API is used for partial incremental changes, and then a separate file-based delivery path is used for catch-ups (either on regular intervals or on-demand) adds more surface area (i.e. risk) to the workflow.

Some REST APIs support bulk delivery, which can perform with the same type of reliability as cloud-accessible storage locations.

A complete file-based delivery using cloud-accessible storage locations is the most reliable way to get very large data volumes to Amperity.

Push Data to Amperity

To push data to Amperity you may use the Streaming Ingest API.

Apache Spark

Apache Spark prefers load sizes to range between 1-10000 files and file sizes to range between 1-1000 MB. Apache Spark will parse 100 x 10 MB files faster than 10 x 100 MB files and much faster than 1 x 10000 MB file. When loading large files to Amperity, as a general guideline to optimize the performance of Apache Spark, look to create situations where:

  • The number of individual files is below 3000.

  • The range of individual file sizes is below 100 MB.

Put differently, Apache Spark will parse 3000 x 100 MB files faster than 300 x 1000 MB files and much faster than 30 x 10000 MB files.

Connection Details

The following collection details are needed for customer-owned Amazon S3, Azure Blob Storage, and SFTP locations.

Location

Details

Amazon S3

Access key, secret key, and bucket name.

Azure Blob Storage

Using shared access credentials, the name of the container, the blob prefix, and credential details.

SFTP

Host name, user name, public key (preferred).

-or-

Host name, user name, and passphrase.

Date Formats

Dates should be quoted and should be in the “yyyy-MM-dd HH:mm:ss.SSS” format. The time portion (“:mm:ss.SSS”) is optional. For example:

  • 2019-01-28 18:32:05.123

  • 2019-01-28 18:32:05

  • 2019-01-28

When the date format is not similar to the expected date format, Amperity will attempt to convert the date and time values. If date formats are mixed, Amperity will use the first one that matches.

IP addresses for allowlists

You can add Amperity services to allowlists that may be required by upstream systems. The IP address that should be added to the allowlist for the upstream system depends on the service to which that upstream system will connect.

Important

Amperity does not maintain allowlists for connections that are made to Amperity services from upstream systems.

Warning

Using an IP allowlist is not recommended. Many issues can arise when an IP address is on an allowlist within Amazon AWS or Microsoft Azure because both services use their own internal networks for routing.

  • Amazon AWS recommends against using allowlists on the SourceIP condition because it denies access to AWS services that make calls on your behalf

  • Microsoft Azure suggests that using IP allowlists for shared access signature (SAS) tokens is only recommended for use with IP addresses that are located outside of Microsoft Azure.

When connecting to your Amperity tenant

Most connections are made directly to your Amperity tenant. Use one of the following Amperity IP addresses for an allowlist that is required by an upstream system. The specific IP address to use depends on the location in which your tenant is hosted:

  • On Amazon AWS use “52.42.237.53”

  • On Amazon AWS (Canada) use “3.98.199.97”

  • On Microsoft Azure use “104.46.106.84” and “20.81.91.210”

  • On Microsoft Azure (EU) use “20.123.127.54”

When connecting to the attached SFTP site

Some connections are made directly to the SFTP site that is included with your Amperity tenant. The specific IP address to use depends on the location in which your tenant is hosted:

  • On Amazon AWS use “52.11.51.214”

  • On Amazon AWS (Canada) use “52.60.229.171”

  • On Microsoft Azure use “20.36.236.80”

  • On Microsoft Azure (EU) use “51.104.139.110”

Tip

Alternatives to using an allowlist include:

  1. Cross-account roles within Amazon AWS, which requires using an Amazon Resource Name (ARN) for the role with cross-account access.

  2. Using Azure Data Share.

Discuss these options with your Amperity representative prior to making a decision to allowlist IP addresses.

Large Datasets

A large dataset is a file over 500GB in size.

Amperity recommends that large datasets:

  • Be provided to Amperity using Amazon S3, Azure Blob Storage, or Google Cloud Storage to use their massively parallel I/O capabilities

  • Use compression to reduce file sizes

  • In certain cases, may use the Amperity Streaming Ingest API to avoid batched data drops