File format: CSV¶
A comma-separated values (CSV) file, defined by RFC 4180, is a delimited text file that uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.
Tip
Consider using Apache Avro and Apache Parquet file formats instead of CSV.
Avro uses a JSON-like schema that stores data in rows. Avro files have a very small file size that transfers quickly.
Parquet is highly compact, can be transferred easily, and avoids escape character and data formatting issues that can be present in other formats.
Double quotes¶
RFC 4180 has the following guidelines for the use of double quotes in CSV files:
A field may or may not be enclosed by double quotes. When a field is not enclosed by double quotes, double quotes may not appear inside a field.
A field that contains line breaks (CRLF), double quotes, and/or commas should be enclosed in double quotes.
When a field is enclosed by double quotes, a double quote that appears inside a field must be escaped using a double quote.
When an escape character is not specified, Amperity will assume that the escape character is double quotes. This follows the RFC as closely as possible: “If double quotes are used to enclose fields, then a double quote appearing inside a field must be escaped by preceding it with another double quote.”
CSV files, generally, do not have a formal specification, which allows for a wide variety of implementations that can create very specific problems.
If you notice data quality issues from a data source that uses a CSV file, consider updating that CSV file to adhere to the RFC as closely as possible, including applying a consistent approach for the use of double quotes, and then ensuring the escape character is compatible with the data provided by that data source.
For example, many Polynesian languages use an apostrophe (’) or a single quote (ʻ) at the start of (and within) first, last, and location names: Hawai’i, O’ahu, Kau’i, U’ilani. In Hawai’ian, a grave accent (`) is often used instead of an apostrophe: Hawai`ian. (Apostrophes within names is, of course, not unique to Polynesian languages.)
A courier configuration similar to:
{
"object/type": "file",
"object/file-pattern": "'path/to/file'-YYYY-MM-dd'.csv'",
"object/land-as": {
"file/escape": "?",
"file/quote": """
}
},
will allow string values like:
"Kau'i","Hawai'i","O'ahu","A description of 'places'"
and:
Kau'i,Hawai'i,O'ahu,A description of 'places'
If the escape character is a single quote ("file/escape": "'"
), similar string values must be escaped:
Kau''i,Hawai''i,O''ahu,A description of ''places''
Pull CSV files¶
To pull CSV files to Amperity:
Select a filedrop data source.
Define a feed to associate fields in the CSV file with semantic tags; in some situations you may need to use an ingest query to transform data in the CSV file prior to loading it to Amperity.
Configure a courier for the location and name of the CSV file, and then for the load operation (either directly from a feed or from an ingest query).
Data sources¶
Pull CSV files to Amperity using any filedrop data source:
Recommendations¶
When using CSV files, it is recommend to:
Use column headers (with no special characters except underscores)
Note
When loading data from a CSV file that does not have headers, all fields in the feed are enabled and are required.
Ensure duplicate header names are not present
Ensure one (or more) fields are present that can act as a unique identifier
Use a comma as the delimiter for fields; use a newline character as the delimiter for rows
Escape commas or quotes that appear in the data
Quote string values
Encode files in UTF-8 or UTF-16. Amperity automatically detects the 2-byte header present with the UTF-16 encoding format. If the 2-byte header is missing, the file is treated as UTF-8.
Compress files prior to encryption using ZIP, GZIP, and/or TAR. Amperity automatically decompresses GZIP files; ZIP and TAR decompression must be specified in courier file load settings.
Encrypt files using PGP; compression will not reduce the size of an encrypted file
Load data¶
For most CSV files, use a feed to associate fields in the CSV file with semantic tags. In some situations, an ingest query may be necessary to transform data prior to loading it to Amperity.
Feeds¶
A feed defines how data should be loaded into a domain table, including specifying which columns are required and which columns should be associated with a semantic tag that indicates that column contains customer profile (PII) and transactions data.
Apply profile (PII) semantics to customer records and transaction, and product catalog semantics to interaction records. Use blocking key (bk), foreign key (fk), and separation key (sk) semantic tags to define how Amperity should understand how field relationships should be understood when those values are present across your data sources.
Domain SQL¶
Domain SQL is a reference to using Spark SQL to reshape data that has been loaded to Amperity prior to making it available to downstream processes, such as Stitch or the customer 360 database. Domain SQL is often used to build new tables from existing domain tables, and then reshaping that data into a new table that allows semantic tags for transactions and itemized transactions to be applied correctly.
Domain SQL allows the data in CSV files to be transformed after it has been loaded to Amperity. Some common use cases for using domain SQL to transform data include:
Ingest queries¶
An ingest query is a SQL statement that may be applied to data prior to loading it to a domain table. An ingest query is defined using Spark SQL syntax.
Use Spark SQL to define an ingest query for the CSV file. Use a SELECT statement to specify which fields should be loaded to Amperity. Apply transforms to those fields as necessary.
Flatten rows¶
Some CSV output contains rows where a field within that row contains multiple lines of data. For example, conversions data pulled from Attentive Mobile has a field named message_text
that contains the lines of an SMS message using \n
to separate each line of the message.
This should be flattened using an ingest query to remove the \n
prior to pulling this data to Amperity. For example:
SELECT
client_id
,phone
,message_id
,message_name
,REGEXP_REPLACE(message_text,"\n","") AS message_text
...
FROM AttentiveTable
Nested records¶
Occasionally, a CSV file will contain nested records. For example, a field within the CSV file contains JSON data similar to:
[{"street": "123 Main Street", "apt": "101", "city": "Seattle", "state": "WA", "zip": "98101"}]
which is JSON data in a nested format:
[
{
"street": "123 Main Street",
"apt": "101",
"city": "Seattle",
"state": "WA",
"zip": "98101"
}
]
There are two general approaches to take when a CSV file contains nested records:
Recommended. The customer should update the process for how the data is provided to Amperity to ensure the file does not contain nested records.
If the file cannot be provided without nested records, use an ingest query to flatten the data in these fields prior to loading it to Amperity.
Note
This will increase the preprocessing effort required by Amperity. Large datasets will take more time and this approach should be avoided with very large datasets and should be used carefully with datasets that will be processed on a daily basis.
For example:
WITH explodedData AS (
SELECT
table_id
,EXPLODE(combined_address) AS address FROM Source
)
SELECT
table_id
,address.street AS address
,address.apt AS address2
,address.city AS city
,address.state AS state
,address.zip AS postal
FROM Source
Ordinal column names¶
Some situations require using an ingest query to load a headerless CSV file. Use a SELECT statement to apply ordinal column names, and then aliasing each column name to a field name if desired.
A headerless file should use column names that are based on ordinal positions using the following pattern:
_c0
_c1
_c2
_c3
...
_cN
Add these column names within the SELECT statement. You may use aliasing to apply better column names within the ingest query.
For example, using ordinal columns to define a schema for profile data:
SELECT
_c0 AS user_id
,_c1 AS given_name
,_c2 AS surname
,_c3 AS email
,_c4 AS phone
FROM table
Couriers¶
A courier brings data from an external system to Amperity.
A courier must specify the location of the CSV file, and then define how that file is to be pulled to Amperity. This is done using a combination of configuration blocks:
Load settings¶
Use courier load settings to specify the path to the CSV file, a file tag (which can be the same as the name of the CSV file), if the file does not contain header rows, and the "text/csv"
content type.
With some CSV files it may be necessary to specify the character used for quotes and the character used as a separator.
with header rows¶
Most CSV files contain a row of data that defines header rows for the data set. Configure the load settings for the courier to accept the header rows.
{
"object/type": "file",
"object/file-pattern": "'path/to/file'-YYYY-MM-dd'.csv'",
"object/land-as": {
"file/header-rows": 1,
"file/tag": "FILE_NAME",
"file/content-type": "text/csv"
}
},
with non-standard quotes and separators¶
Some CSV files may use non-standard characters for quotes and separators, such as '
for quotes and \
for separators. If a CSV file contains non-standard characters, you must specify these characters in the courier load settings.
Important
The syntax used within courier configuration blocks is JSON, which uses a backslash as the escape character. You must escape a backslash in JSON using a backslash: "file/escape": "\\"
(and not "file/escape": "\"
).
{
"object/type": "file",
"object/file-pattern": "'path/to/file'-YYYY-MM-dd'.csv'",
"object/land-as": {
"file/header-rows": 1,
"file/tag": "FILE_NAME",
"file/quote": "'",
"file/separator": "\\",
"file/content-type": "text/csv"
}
},
without header rows¶
A headerless CSV file does not contain a row of data that defines headers for the data set. When working with a headerless CSV file you can configure the load settings for the courier to accept headerless files or you may use an ingest query when the data must be changed in some way prior to loading it to Amperity.
Tip
Before using an ingest query to change data prior to loading it to Amperity, consider using domain SQL after it has been loaded to Amperity. Using domain SQL allows for easier changes to data schemas and (more importantly) does not require data to be truncated, and then reloaded on a regular basis.
If a CSV file is headerless, change the load setting for "file/header-rows"
from 1
to 0
.
{
"object/type": "file",
"object/file-pattern": "'path/to/file'-YYYY-MM-dd'.csv'",
"object/land-as": {
"file/header-rows": 0,
"file/tag": "FILE_NAME",
"file/content-type": "text/csv"
}
},
wildcards in archives¶
A wildcard can be used to match one (or more) files in an archive.
The following example shows how to use a wildcard to match a set of CSV files contained within the same archive:
[
{
"archive/contents": {
"ArchiveName/files_*.csv:" {
"subobject/land-as": {
"file/header-rows": 1,
"file/separator": ",",
"file/tag": "launch",
"file/content-type": "text/csv"
}
}
},
"object/type": "archive",
"object/file-pattern": "/path/to/archive/ArchiveName.zip"
}
]
AS/400¶
AS/400 is a computer platform that can provide raw data to Amperity as zipped CSV files configured as a one-time export or as a recurring export. For recurring exports, the customer must set up a workflow that collects this data, and then puts it into a location from which Amperity can pull data (Amazon S3, Azure Blob Storage, Google Cloud Storage, or SFTP).
Data must be configured to be exported as a collection of zipped CSV tables. For example, tables for customer profiles, orders, order items, shipments, emails, and so on.
Records within these tables should have a common identifier, such as a customer number. Use this identifier for foreign key tagging within feeds.
A data dictionary must be defined for these tables, and then provided to Amperity.
Verify date formats. These may require additional processing within Amperity that transforms them into the preferred format.
Review the data within the tables, and then use domain SQL for additional data shaping, as necessary.
Load operations¶
Use courier load operations to associate a feed ID to the courier, apply the same file tag as the one used for load settings. Load operations for an ingest query may specify a series of options.
Load from feed¶
{
"FEED_ID": [
{
"type": "OPERATION",
"file": "FILE_NAME"
}
]
}
Load from ingest query¶
{
"FEED_ID": [
{
"type": "spark-sql",
"spark-sql-files": [
{
"file": "FILE_NAME",
"options": {
"delimiter": ",",
"escape": "\\",
"multiline": "true",
"quote": "\""
}
}
],
"spark-sql-query": "INGEST_QUERY_NAME"
}
]
}
Caution
Spark does not correctly implement RFC 4180 for escape characters in CSV files. The most common implementations of CSV files expect a double quote "
as an escape character while Spark uses a backslash \
. For more information about this issue view the SPARK-22236 issue within the Spark project.
You can override this behavior when working with RFC-compliant CSV files by specifying an escape character in the courier load operations using '
or "
as the escape character.
For example:
{
"FEED_ID": [
{
"type": "spark-sql",
"spark-sql-files": [
{
"file": "FILE_NAME",
"options": {
"escape": "'"
}
}
],
"spark-sql-query": "INGEST_QUERY_NAME"
}
]
}
If a CSV file uses \
as the delimiter, configure the load operation to specify an empty delimiter value, after which Spark will automatically apply the \
character as the delimiter.
For example:
{
"FEED_ID": [
{
"type": "spark-sql",
"spark-sql-files": [
{
"file": "FILE_NAME",
"options": {
"delimiter": ""
}
}
],
"spark-sql-query": "INGEST_QUERY_NAME"
}
]
}
Send CSV files¶
Amperity can send CSV files to downstream workflows using any filedrop destination: