File format: TSV¶
A tab-separated values (TSV) file is a delimited text file that uses a tab to separate values. A TSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by tabs. The use of the tab as a field separator is the source of the name for this file format.
Tip
Consider using Apache Avro and Apache Parquet file formats instead of TSV.
Avro uses a JSON-like schema that stores data in rows. Avro files have a very small file size that transfers quickly.
Parquet is highly compact, can be transferred easily, and avoids escape character and data formatting issues that can be present in other formats.
Pull TSV files¶
To pull TSV files to Amperity:
Select a filedrop data source.
Define a feed to associate fields in the TSV file with semantic tags; in some situations you may need to use an ingest query to transform data in the TSV file prior to loading it to Amperity.
Configure a courier for the location and name of the TSV file, and then for the load operation (either directly from a feed or from a ingest query).
Data sources¶
Pull TSV files to Amperity using any filedrop data source:
Recommendations¶
When using TSV files, it is recommend to:
Use column headers (with no special characters except underscores)
Ensure duplicate header names are not present
Ensure one (or more) fields are present that can act as a unique identifier
Use a comma as the delimiter for fields; use a newline character as the delimiter for rows
Escape commas or quotes that appear in the data
Quote string values
Encode files in UTF-8 or UTF-16. Amperity automatically detects the 2-byte header present with the UTF-16 encoding format. If the 2-byte header is missing, the file is treated as UTF-8.
Compress files prior to encryption using ZIP, GZIP, and/or TAR. Amperity automatically decompresses GZIP files; ZIP and TAR decompression must be specified in courier file load settings.
Encrypt files using PGP; compression will not reduce the size of an encrypted file
Load data¶
For most TSV files, use a feed to associate fields in the TSV file with semantic tags. In some situations, an ingest query may be necessary to transform data prior to loading it to Amperity.
Feeds¶
A feed defines how data should be loaded into a domain table, including specifying which columns are required and which columns should be associated with a semantic tag that indicates that column contains customer profile (PII) and transactions data.
Apply profile (PII) semantics to customer records and transaction, and product catalog semantics to interaction records. Use blocking key (bk), foreign key (fk), and separation key (sk) semantic tags to define how Amperity should understand how field relationships should be understood when those values are present across your data sources.
Files with header rows¶
Most TSV files contain a row of data that defines headers for the data set. Configure the load settings for the courier to accept the headers.
Files with non-standard quotes and separators¶
Some TSV files may use non-standard characters for quotes and separators, such as *
for quotes and ;
for separators. If a TSV file contains non-standard characters, you must specify the character in the courier load settings. For example:
"file/quote": "*",
"file/separator": ";"
Files without header rows¶
A headerless TSV file does not contain a row of data that defines headers for the data set. When working with a headerless TSV file you can configure the load settings for the courier to accept headerless files or you may use an ingest query when the data must be changed in some way prior to loading it to Amperity.
Tip
Before using an ingest query to change data prior to loading it to Amperity, consider using domain SQL after it has been loaded to Amperity. Using domain SQL allows for easier changes to data schemas and (more importantly) does not require data to be truncated, and then reloaded on a regular basis.
Domain SQL¶
Domain SQL is a reference to using Spark SQL to reshape data that has been loaded to Amperity prior to making it available to downstream processes, such as Stitch or the customer 360 database. Domain SQL is often used to build new tables from existing domain tables, and then reshaping that data into a new table that allows semantic tags for transactions and itemized transactions to be applied correctly.
Domain SQL allows the data in TSV files to be transformed after it has been loaded to Amperity. Some common use cases for using domain SQL to transform data include:
Ingest queries¶
An ingest query is a SQL statement that may be applied to data prior to loading it to a domain table. An ingest query is defined using Spark SQL syntax.
Use Spark SQL to define an ingest query for the TSV file. Use a SELECT statement to specify which fields should be loaded to Amperity. Apply transforms to those fields as necessary.
Ordinal column names¶
Some situations require using an ingest query to load a headerless TSV file. Use a SELECT statement to apply ordinal column names, and then aliasing each column name to a field name if desired.
A headerless file should use column names that are based on ordinal positions using the following pattern:
_c0
_c1
_c2
_c3
...
_cN
Add these column names within the SELECT statement. You may use aliasing to apply better column names within the ingest query.
For example, using ordinal columns to filter a long list of clickstream data points down to a usable schema for use within Amperity:
SELECT
_c0 AS click_time_gmt
,_c1 AS datetime
,_c2 AS ip_address
,_c3 AS page_event
,_c4 AS duplicate_purchase
,_c5 AS campaign_name
,_c6 AS click_source
,_c7 AS channel_name
,_c8 AS product_list
...
FROM table
Couriers¶
A courier brings data from an external system to Amperity.
A courier must specify the location of the TSV file, and then define how that file is to be pulled to Amperity. This is done using a combination of configuration blocks:
Load settings¶
Use courier load settings to specify the path to the TSV file, a file tag (which can be the same as the name of the TSV file), if the file does not contain header rows, and the "text/tab-separated-values"
content type.
With some TSV files it may be necessary to specify the character used for quotes and the character used as a separator.
with header rows¶
{
"object/type": "file",
"object/file-pattern": "'path/to/file'-YYYY-MM-dd'.tsv'",
"object/land-as": {
"file/header-rows": 1,
"file/tag": "FILE_NAME",
"file/content-type": "text/tab-separated-values"
}
},
with non-standard quotes and separators¶
{
"object/type": "file",
"object/file-pattern": "'path/to/file'-YYYY-MM-dd'.tsv'",
"object/land-as": {
"file/header-rows": 1,
"file/tag": "FILE_NAME",
"file/quote": "*",
"file/separator": ";",
"file/content-type": "text/tab-separated-values"
}
},
without header rows¶
If a TSV file is headerless, change the load setting for "file/header-rows"
from 1
to 0
.
{
"object/type": "file",
"object/file-pattern": "'path/to/file'-YYYY-MM-dd'.tsv'",
"object/land-as": {
"file/header-rows": 0,
"file/tag": "FILE_NAME",
"file/content-type": "text/tab-separated-values"
}
},
wildcards in archives¶
A wildcard can be used to match one (or more) files in an archive.
The following example shows how to use a wildcard to match a set of CSV files contained within the same archive:
[
{
"archive/contents": {
"ArchiveName/files_*.csv:" {
"subobject/land-as": {
"file/header-rows": 1,
"file/separator": ",",
"file/tag": "launch",
"file/content-type": "text/csv"
}
}
},
"object/type": "archive",
"object/file-pattern": "/path/to/archive/ArchiveName.zip"
}
]
Load operations¶
Use courier load operations to associate a feed ID to the courier, apply the same file tag as the one used for load settings. Load operations for an ingest query may specify a series of options.
Load from feed¶
{
"FEED_ID": [
{
"type": "OPERATION",
"file": "FILE_NAME"
}
]
}
Load from ingest query¶
{
"FEED_ID": [
{
"type": "spark-sql",
"spark-sql-files": [
{
"file": "FILE_NAME",
"options": {
"delimiter": "\t",
"escape": "\\",
"multiline": "true",
"quote": "\""
}
}
],
"spark-sql-query": "INGEST_QUERY_NAME"
}
]
}
Send TSV files¶
Amperity can send TSV files to downstream workflows using any filedrop destination: