CSV files¶

A comma-separated values (CSV) file, defined by RFC 4180 , is a delimited text file that uses a comma to separate values. A CSV file stores tabular data–numbers and text–in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

Tip

Consider using Apache Avro and Apache Parquet file formats instead of CSV.

Avro uses a JSON-like schema that stores data in rows. Avro files have a very small file size that transfers quickly.
Parquet is highly compact, can be transferred easily, and avoids escape character and data formatting issues that can be present in other formats.

File sizes¶

The size of a CSV file cannot exceed 10 GB. A CSV file that is larger than 10 GB must be split into smaller files before it is made available to Amperity. The total number of CSV files in a single ingest job cannot exceed 500,000.

Double quotes¶

RFC 4180 has the following guidelines for the use of double quotes in CSV files:

A field may or may not be enclosed by double quotes. When a field is not enclosed by double quotes, double quotes may not appear inside a field.
A field that contains line breaks (CRLF), double quotes, and/or commas should be enclosed in double quotes.
When a field is enclosed by double quotes, a double quote that appears inside a field must be escaped using a double quote.

When an escape character is not specified, Amperity will assume that the escape character is double quotes. This follows the RFC as closely as possible: “If double quotes are used to enclose fields, then a double quote appearing inside a field must be escaped by preceding it with another double quote.”

CSV files, generally, do not have a formal specification, which allows for a wide variety of implementations that can create very specific problems.

If you notice data quality issues from a data source that uses a CSV file, consider updating that CSV file to adhere to the RFC as closely as possible, including applying a consistent approach for the use of double quotes, and then ensuring the escape character is compatible with the data provided by that data source.

For example, many Polynesian languages use an apostrophe (’) or a single quote (ʻ) at the start of (and within) first, last, and location names: Hawai’i, O’ahu, Kau’i, U’ilani. In Hawai’ian, a grave accent (`) is often used instead of an apostrophe: Hawai`ian. (Apostrophes within names is not unique to Polynesian languages.)

A courier quote configuration similar to """ will allow string values like:

"Kau'i","Hawai'i","O'ahu","A description of 'places'"

and:

Kau'i,Hawai'i,O'ahu,A description of 'places'

If the escape character is a single quote ("'"), similar string values must be escaped:

Kau''i,Hawai''i,O''ahu,A description of ''places''

Pull CSV files¶

To pull CSV files to Amperity:

Select a data source.
Configure a courier for the location and name of the CSV file.
Define a feed to associate fields in the CSV file with semantic tags.

Data sources¶

Pull CSV files to Amperity using one of the following data sources:

Recommendations¶

When using CSV files, it is recommend to:

Use column headers (with no special characters except underscores)

Note

When loading data from a CSV file that does not have headers, all fields in the feed are enabled and are required.
Ensure duplicate header names are not present
Ensure one (or more) fields are present that can act as a unique identifier
Use a comma as the delimiter for fields; use a newline character as the delimiter for rows
Escape commas or quotes that appear in the data
Quote string values
Encode files in UTF-8 or UTF-16. Amperity automatically detects the 2-byte header present with the UTF-16 encoding format. If the 2-byte header is missing, the file is treated as UTF-8.
Compress files prior to encryption using ZIP, GZIP, and/or TAR. Amperity automatically decompresses GZIP files; ZIP and TAR decompression must be specified in courier file load settings.
Encrypt files using PGP; compression will not reduce the size of an encrypted file

Load data¶

Use a feed to associate fields in the CSV file with semantic tags and a courier to pull the CSV file from its upstream data source.

Couriers
Feeds

Couriers¶

A courier brings data from an external system to Amperity.

A courier must specify the location of the CSV file, and then define how that file is to be pulled to Amperity.

File settings
Feed selection

File settings¶

Use the File settings section of the courier configuration page to specify the path to the CSV file and to define formattting within the file, such as escape character, quote charcter, compression type or header row.

Feed selection¶

Use the Feed selection section of the courier configuration page to identify the feed for which this courier pulls data, and then which files are loaded.

From the Load type dropdown select one of:

Load Use this option to load data to the associated domain table.
Truncate and load Use this option to delete all rows in the associated domain table, and then load data.

AS/400¶

AS/400 is a computer platform that can provide raw data to Amperity as zipped CSV files configured as a one-time export or as a recurring export. For recurring exports, the customer must set up a workflow that collects this data, and then puts it into a location from which Amperity can pull data (Amazon S3, Azure Blob Storage, Google Cloud Storage, or SFTP).

Data must be configured to be exported as a collection of zipped CSV tables. For example, tables for customer profiles, orders, order items, shipments, emails, and so on.
Records within these tables should have a common identifier, such as a customer number. Use this identifier for foreign key tagging within feeds.
A data dictionary must be defined for these tables, and then provided to Amperity.
Verify date formats. These may require additional processing within Amperity that transforms them into the preferred format.
Review the data within the tables, and then use domain SQL for additional data shaping, as necessary.

Feeds¶

A feed defines how to load data into a domain table, including specifying required columns and columns with semantic tags for customer profile (PII) or transactions data.

Apply profile (PII) semantics to customer records and transaction, and product catalog semantics to interaction records. Use blocking key (bk), foreign key (fk), and separation key (sk) semantic tags to define how Amperity should understand values that exist across data sources.

Domain SQL¶

Domain SQL reshapes data before loading it to Amperity and making that data available to downstream process, such as Stitch or customer profiles. Domain SQL uses Spark SQL to support use cases, such as building new tables from existing domain tables or reshaping data to allow correctly apply semantic tags for transactions.

Domain SQL allows the data in CSV files to be transformed after it has been loaded to Amperity. Some common use cases for using domain SQL to transform data include:

Send CSV files¶

Amperity can send CSV files to downstream workflows using any of the following destinations:

Split outputs¶

Split delimiter-separated output–CSV, PSV, TSV, or files with custom delimiters–into multiple files to ensure downstream file limits are not exceeded.

Choose “Rows” and set “Rows limit” to a value between “50000” and “10000000”. This is the maximum number of rows for split output files.

Choose “Megabytes” and set “Megabytes limit” to a value between “1 MB” and “2000 MB”. This is the maximum file size.

Additional configuration is required for filename templates.

Set the value of “Split filename template” to “{{file_number}}.csv” to apply a unique seven digit left-padded integer to the filename. For example: “0000001.csv”, “0000002.csv”, and “0000003.csv”.

Use the “Split file directory template” to name the directory into which split files are added.

For example: if the value of “Split file directory template” is {{now|format:’YYYY’}}.tgz and the value of “Split filename template” is “{{file_number}}.csv” Amperity will output a gzipped tarball named “2025.tgz” with subfiles named “0000001.csv”, “0000002.csv”, and “0000003.csv”.