File format: TSV¶
A tab-separated values (TSV) file is a delimited text file that uses a tab to separate values and stores tabular data in plain text. Each line in the file is a data record. Each record consists of one or more fields, separated by tabs.
Tip
Consider using Apache Avro and Apache Parquet file formats instead of TSV.
Avro uses a JSON-like schema that stores data in rows. Avro files have a very small file size that transfers quickly.
Parquet is highly compact, can be transferred easily, and avoids escape character and data formatting issues that can be present in other formats.
File sizes¶
The size of a TSV file cannot exceed 10 GB. A TSV file that is larger than 10 GB must be split into smaller files before it is made available to Amperity. The total number of TSV files in a single ingest job cannot exceed 500,000.
Pull TSV files¶
To pull TSV files to Amperity:
Select a data source.
Configure a courier for the location and name of the TSV file.
Define a feed to associate fields in the TSV file with semantic tags.
Data sources¶
Pull TSV files to Amperity using one of the following data sources:
Recommendations¶
When using TSV files, it is recommend to:
Use column headers (with no special characters except underscores)
Ensure duplicate header names are not present
Ensure one (or more) fields are present that can act as a unique identifier
Use a comma as the delimiter for fields; use a newline character as the delimiter for rows
Escape commas or quotes that appear in the data
Quote string values
Encode files in UTF-8 or UTF-16. Amperity automatically detects the 2-byte header present with the UTF-16 encoding format. If the 2-byte header is missing, the file is treated as UTF-8.
Compress files prior to encryption using ZIP, GZIP, and/or TAR. Amperity automatically decompresses GZIP files; ZIP and TAR decompression must be specified in courier file load settings.
Encrypt files using PGP; compression will not reduce the size of an encrypted file
Load data¶
Use a feed to associate fields in the TSV file with semantic tags and a courier to pull the TSV file from its upstream data source.
Couriers¶
A courier brings data from an external system to Amperity.
A courier must specify the location of the TSV file, and then define how that file is to be pulled to Amperity.
File settings¶
Use the File settings section of the courier configuration page to specify the path to the TSV file and to define formattting within the file, such as escape character, quote charcter, compression type or header row.
Feed selection¶
Use the Feed selection section of the courier configuration page to identify the feed for which this courier pulls data, and then which files are loaded.
From the Load type dropdown select one of:
Load Use this option to load data to the associated domain table.
Truncate and load Use this option to delete all rows in the associated domain table, and then load data.
Feeds¶
A feed defines how to load data into a domain table, including specifying required columns and columns with semantic tags for customer profile (PII) or transactions data.
Apply profile (PII) semantics to customer records and transaction, and product catalog semantics to interaction records. Use blocking key (bk), foreign key (fk), and separation key (sk) semantic tags to define how Amperity should understand values that exist across data sources.
Domain SQL¶
Domain SQL reshapes data before loading it to Amperity and making that data available to downstream process, such as Stitch or customer profiles. Domain SQL uses Spark SQL to support use cases, such as building new tables from existing domain tables or reshaping data to allow correctly apply semantic tags for transactions.
Domain SQL allows the data in TSV files to be transformed after it has been loaded to Amperity. Some common use cases for using domain SQL to transform data include:
Send TSV files¶
Amperity can send TSV files to downstream workflows using any of the following destinations:
Split outputs¶
Split delimiter-separated output–CSV, PSV, TSV, or files with custom delimiters–into multiple files to ensure downstream file limits are not exceeded.
Choose “Rows” and set “Rows limit” to a value between “50000” and “10000000”. This is the maximum number of rows for split output files.
Choose “Megabytes” and set “Megabytes limit” to a value between “1 MB” and “2000 MB”. This is the maximum file size.
Additional configuration is required for filename templates.
Set the value of “Split filename template” to “{{file_number}}.csv” to apply a unique seven digit left-padded integer to the filename. For example: “0000001.csv”, “0000002.csv”, and “0000003.csv”.
Use the “Split file directory template” to name the directory into which split files are added.
For example: if the value of “Split file directory template” is {{now|format:’YYYY’}}.tgz and the value of “Split filename template” is “{{file_number}}.csv” Amperity will output a gzipped tarball named “2025.tgz” with subfiles named “0000001.csv”, “0000002.csv”, and “0000003.csv”.