File format: PSV¶
A pipe-separated values (PSV) file is a delimited text file that uses a pipe to separate values. A PSV file stores tabular data in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by pipes. The use of the pipe as a field separator is the source of the name for this file format.
Tip
Consider using Apache Avro and Apache Parquet file formats instead of PSV.
Avro uses a JSON-like schema that stores data in rows. Avro files have a very small file size that transfers quickly.
Parquet is highly compact, can be transferred easily, and avoids escape character and data formatting issues that can be present in other formats.
File sizes¶
The size of a PSV file cannot exceed 10 GB. A PSV file that is larger than 10 GB must be split into smaller files before it is made available to Amperity. The total number of PSV files in a single ingest job cannot exceed 500,000.
Pull PSV files¶
To pull PSV files to Amperity:
Select a filedrop data source.
Define a feed to associate fields in the PSV file with semantic tags; in some situations you may need to use an ingest query to transform data in the PSV file prior to loading it to Amperity.
Configure a courier for the location and name of the PSV file, and then for the load operation (either directly from a feed or from an ingest query).
Data sources¶
Pull PSV files to Amperity using any filedrop data source:
Load data¶
For most PSV files, use a feed to associate fields in the PSV file with semantic tags. In some situations, an ingest query may be necessary to transform data prior to loading it to Amperity.
Feeds¶
A feed defines how to load data into a domain table, including specifying required columns and columns with semantic tags for customer profile (PII) or transactions data.
Apply profile (PII) semantics to customer records and transaction, and product catalog semantics to interaction records. Use blocking key (bk), foreign key (fk), and separation key (sk) semantic tags to define how Amperity should understand how field relationships should be understood when those values are present across your data sources.
Files with header rows¶
Most PSV files contain a row of data that defines headers for the data set. Configure the load settings for the courier to accept the headers.
Files with non-standard quotes and separators¶
Some PSV files may use non-standard characters for quotes and separators, such as *
for quotes and ;
for separators. If a PSV file contains non-standard characters, you must specify the character in the courier load settings. For example:
"file/quote": "*",
"file/separator": ";"
Files without header rows¶
A headerless PSV file does not contain a row of data that defines headers for the data set. When working with a headerless PSV file you can configure the load settings for the courier to accept headerless files or you may use an ingest query when the data must be changed in some way prior to loading it to Amperity.
Tip
Before using an ingest query to change data prior to loading it to Amperity, consider using domain SQL after it has been loaded to Amperity. Using domain SQL allows for easier changes to data schemas and (more importantly) does not require data to be truncated, and then reloaded on a regular basis.
Domain SQL¶
Domain SQL reshapes data before loading it to Amperity and making that data available to downstream process, such as Stitch or customer profiles. Domain SQL uses Spark SQL to support use cases, such as building new tables from existing domain tables or reshaping data to allow correctly apply semantic tags for transactions.
Domain SQL allows the data in PSV files to be transformed after it has been loaded to Amperity. Some common use cases for using domain SQL to transform data include:
Ingest queries¶
An ingest query is a SQL statement that runs against data before loading data to a domain table. Use Spark SQL to define an ingest query.
Use Spark SQL to define an ingest query for the PSV file. Use a SELECT statement to specify which fields should be loaded to Amperity. Apply transforms to those fields as necessary.
Couriers¶
A courier brings data from an external system to Amperity.
A courier must specify the location of the PSV file, and then define how that file is to be pulled to Amperity. This is done using a combination of configuration blocks:
Load settings¶
Use courier load settings to specify the path to the PSV file, a file tag (which can be the same as the name of the PSV file), if the file does not contain header rows, and the "text/pipe-separated-values"
content type.
With some PSV files it may be necessary to specify the character used for quotes and the character used as a separator.
with header rows¶
{
"object/type": "file",
"object/file-pattern": "'path/to/file'-YYYY-MM-dd'.psv'",
"object/land-as": {
"file/header-rows": 1,
"file/tag": "FILE_NAME",
"file/content-type": "text/pipe-separated-values"
}
},
with non-standard quotes and separators¶
{
"object/type": "file",
"object/file-pattern": "'path/to/file'-YYYY-MM-dd'.psv'",
"object/land-as": {
"file/header-rows": 1,
"file/tag": "FILE_NAME",
"file/quote": "*",
"file/separator": ";",
"file/content-type": "text/pipe-separated-values"
}
},
without header rows¶
If a PSV file is headerless, change the load setting for "file/header-rows"
from 1
to 0
.
{
"object/type": "file",
"object/file-pattern": "'path/to/file'-YYYY-MM-dd'.psv'",
"object/land-as": {
"file/header-rows": p,
"file/tag": "FILE_NAME",
"file/content-type": "text/pipe-separated-values"
}
},
wildcards in archives¶
A wildcard can be used to match one (or more) files in an archive.
The following example shows how to use a wildcard to match a set of CSV files contained within the same archive:
[
{
"archive/contents": {
"ArchiveName/files_*.csv:" {
"subobject/land-as": {
"file/header-rows": 1,
"file/separator": ",",
"file/tag": "launch",
"file/content-type": "text/csv"
}
}
},
"object/type": "archive",
"object/file-pattern": "/path/to/archive/ArchiveName.zip"
}
]
Load operations¶
Use courier load operations to associate a feed ID to the courier, apply the same file tag as the one used for load settings. Load operations for an ingest query may specify a series of options.
Load from feed¶
{
"FEED_ID": [
{
"type": "OPERATION",
"file": "FILE_NAME"
}
]
}
Load from ingest query¶
{
"FEED_ID": [
{
"type": "spark-sql",
"spark-sql-files": [
{
"file": "FILE_NAME"
"options": {
"delimiter": "\|",
"escape": "\\",
"multiline": "true",
"quote": "\""
}
}
],
"spark-sql-query": "INGEST_QUERY_NAME"
}
]
}
Send PSV files¶
Amperity can send PSV files to downstream workflows using any filedrop destination: