File format: PSV

A pipe-separated values (PSV) file is a delimited text file that uses a pipe to separate values. A PSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by pipes. The use of the pipe as a field separator is the source of the name for this file format.

Tip

Consider using Apache Avro and Apache Parquet file formats instead of PSV.

  • Avro uses a JSON-like schema that stores data in rows. Avro files have a very small file size that transfers quickly.

  • Parquet is highly compact, can be transferred easily, and avoids escape character and data formatting issues that can be present in other formats.

Pull PSV files

To pull PSV files to Amperity:

  1. Select a filedrop data source.

  2. Define a feed to associate fields in the PSV file with semantic tags; in some situations you may need to use an ingest query to transform data in the PSV file prior to loading it to Amperity.

  3. Configure a courier for the location and name of the PSV file, and then for the load operation (either directly from a feed or from an ingest query).

Data sources

Pull PSV files to Amperity using any filedrop data source:

Load data

For most PSV files, use a feed to associate fields in the PSV file with semantic tags. In some situations, an ingest query may be necessary to transform data prior to loading it to Amperity.

Feeds

A feed defines how data should be loaded into a domain table, including specifying which columns are required and which columns should be associated with a semantic tag that indicates that column contains customer profile (PII) and transactions data.

Apply profile (PII) semantics to customer records and transaction, and product catalog semantics to interaction records. Use blocking key (bk), foreign key (fk), and separation key (sk) semantic tags to define how Amperity should understand how field relationships should be understood when those values are present across your data sources.

Files with header rows

Most PSV files contain a row of data that defines headers for the data set. Configure the load settings for the courier to accept the headers.

Files with non-standard quotes and separators

Some PSV files may use non-standard characters for quotes and separators, such as * for quotes and ; for separators. If a PSV file contains non-standard characters, you must specify the character in the courier load settings. For example:

"file/quote": "*",
"file/separator": ";"
Files without header rows

A headerless PSV file does not contain a row of data that defines headers for the data set. When working with a headerless PSV file you can configure the load settings for the courier to accept headerless files or you may use an ingest query when the data must be changed in some way prior to loading it to Amperity.

Tip

Before using an ingest query to change data prior to loading it to Amperity, consider using domain SQL after it has been loaded to Amperity. Using domain SQL allows for easier changes to data schemas and (more importantly) does not require data to be truncated, and then reloaded on a regular basis.

Domain SQL

Domain SQL is a reference to using Spark SQL to reshape data that has been loaded to Amperity prior to making it available to downstream processes, such as Stitch or the customer 360 database. Domain SQL is often used to build new tables from existing domain tables, and then reshaping that data into a new table that allows semantic tags for transactions and itemized transactions to be applied correctly.

Domain SQL allows the data in PSV files to be transformed after it has been loaded to Amperity. Some common use cases for using domain SQL to transform data include:

Ingest queries

An ingest query is a SQL statement that may be applied to data prior to loading it to a domain table. An ingest query is defined using Spark SQL syntax.

Use Spark SQL to define an ingest query for the PSV file. Use a SELECT statement to specify which fields should be loaded to Amperity. Apply transforms to those fields as necessary.

Couriers

A courier brings data from an external system to Amperity.

A courier must specify the location of the PSV file, and then define how that file is to be pulled to Amperity. This is done using a combination of configuration blocks:

  1. Load settings

  2. Load operations

Load settings

Use courier load settings to specify the path to the PSV file, a file tag (which can be the same as the name of the PSV file), if the file does not contain header rows, and the "text/pipe-separated-values" content type.

With some PSV files it may be necessary to specify the character used for quotes and the character used as a separator.

with header rows
{
  "object/type": "file",
  "object/file-pattern": "'path/to/file'-YYYY-MM-dd'.psv'",
  "object/land-as": {
     "file/header-rows": 1,
     "file/tag": "FILE_NAME",
     "file/content-type": "text/pipe-separated-values"
  }
},
with non-standard quotes and separators
{
  "object/type": "file",
  "object/file-pattern": "'path/to/file'-YYYY-MM-dd'.psv'",
  "object/land-as": {
     "file/header-rows": 1,
     "file/tag": "FILE_NAME",
     "file/quote": "*",
     "file/separator": ";",
     "file/content-type": "text/pipe-separated-values"
  }
},
without header rows

If a PSV file is headerless, change the load setting for "file/header-rows" from 1 to 0.

{
  "object/type": "file",
  "object/file-pattern": "'path/to/file'-YYYY-MM-dd'.psv'",
  "object/land-as": {
     "file/header-rows": p,
     "file/tag": "FILE_NAME",
     "file/content-type": "text/pipe-separated-values"
  }
},
wildcards in archives

A wildcard can be used to match one (or more) files in an archive.

The following example shows how to use a wildcard to match a set of CSV files contained within the same archive:

[
  {
    "archive/contents": {
      "ArchiveName/files_*.csv:" {
        "subobject/land-as": {
          "file/header-rows": 1,
          "file/separator": ",",
          "file/tag": "launch",
          "file/content-type": "text/csv"
        }
      }
    },
    "object/type": "archive",
    "object/file-pattern": "/path/to/archive/ArchiveName.zip"
  }
]

Load operations

Use courier load operations to associate a feed ID to the courier, apply the same file tag as the one used for load settings. Load operations for an ingest query may specify a series of options.

Load from feed
{
  "FEED_ID": [
    {
      "type": "OPERATION",
      "file": "FILE_NAME"
    }
  ]
}
Load from ingest query
{
  "FEED_ID": [
    {
      "type": "spark-sql",
      "spark-sql-files": [
        {
          "file": "FILE_NAME"
          "options": {
            "delimiter": "\|",
            "escape": "\\",
            "multiline": "true",
            "quote": "\""
          }
        }
      ],
      "spark-sql-query": "INGEST_QUERY_NAME"
    }
  ]
}

Send PSV files

Amperity can send PSV files to downstream workflows using any filedrop destination: