File format: PSV¶

A pipe-separated values (PSV) file is a delimited text file that uses a pipe to separate values. A PSV file stores tabular data in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by pipes. The use of the pipe as a field separator is the source of the name for this file format.

Tip

Consider using Apache Avro and Apache Parquet file formats instead of PSV.

Avro uses a JSON-like schema that stores data in rows. Avro files have a very small file size that transfers quickly.
Parquet is highly compact, can be transferred easily, and avoids escape character and data formatting issues that can be present in other formats.

File sizes¶

The size of a PSV file cannot exceed 10 GB. A PSV file that is larger than 10 GB must be split into smaller files before it is made available to Amperity. The total number of PSV files in a single ingest job cannot exceed 500,000.

Pull PSV files¶

To pull PSV files to Amperity:

Select a data source.
Configure a courier for the location and name of the PSV file.
Define a feed to associate the fields that were selected from the PSV file with semantic tags for customer profiles and interactions, as necessary.

Data sources¶

Pull PSV files to Amperity using one of the following data sources:

Load data¶

Use a feed to associate fields in the PSV file with semantic tags and a courier to pull the PSV file from its upstream data source.

Couriers
Feeds

Couriers¶

A courier brings data from an external system to Amperity.

A courier must specify the location of the PSV file, and then define how that file is to be pulled to Amperity.

File settings
Feed selection

File settings¶

Use the File settings section of the courier configuration page to specify the path to the PSV file and to define formattting within the file, such as escape character, quote charcter, compression type or header row.

Feed selection¶

Use the Feed selection section of the courier configuration page to identify the feed for which this courier pulls data, and then which files are loaded.

From the Load type dropdown select one of:

Load Use this option to load data to the associated domain table.
Truncate and load Use this option to delete all rows in the associated domain table, and then load data.

Feeds¶

A feed defines how to load data into a domain table, including specifying required columns and columns with semantic tags for customer profile (PII) or transactions data.

Apply profile (PII) semantics to customer records and transaction, and product catalog semantics to interaction records. Use blocking key (bk), foreign key (fk), and separation key (sk) semantic tags to define how Amperity should understand values that exist across data sources.

Domain SQL¶

Domain SQL reshapes data before loading it to Amperity and making that data available to downstream process, such as Stitch or customer profiles. Domain SQL uses Spark SQL to support use cases, such as building new tables from existing domain tables or reshaping data to allow correctly apply semantic tags for transactions.

Domain SQL allows the data in PSV files to be transformed after it has been loaded to Amperity. Some common use cases for using domain SQL to transform data include:

Send PSV files¶

Amperity can send PSV files to downstream workflows using any of the following destinations:

Split outputs¶

Split delimiter-separated output–CSV, PSV, TSV, or files with custom delimiters–into multiple files to ensure downstream file limits are not exceeded.

Choose “Rows” and set “Rows limit” to a value between “50000” and “10000000”. This is the maximum number of rows for split output files.

Choose “Megabytes” and set “Megabytes limit” to a value between “1 MB” and “2000 MB”. This is the maximum file size.

Additional configuration is required for filename templates.

Set the value of “Split filename template” to “{{file_number}}.csv” to apply a unique seven digit left-padded integer to the filename. For example: “0000001.csv”, “0000002.csv”, and “0000003.csv”.

Use the “Split file directory template” to name the directory into which split files are added.

For example: if the value of “Split file directory template” is {{now|format:’YYYY’}}.tgz and the value of “Split filename template” is “{{file_number}}.csv” Amperity will output a gzipped tarball named “2025.tgz” with subfiles named “0000001.csv”, “0000002.csv”, and “0000003.csv”.