File format: Apache Parquet

Apache Parquet is a free and open-source column-oriented data storage format developed within the Apache Hadoop ecosystem. It is similar to RCFile and ORC, but provides more efficient data compression and encoding schemes with enhanced performance and can better handle large amounts of complex bulk data.

Apache Parquet may be used with any upstream or downstream customer environment that supports the use of Parquet. Parquet is highly compact and can be transferred easily. Parquet embeds data-typing and avoids escape character and data formatting issues that can be present in other formats like CSV and TSV.

Pull Parquet files

To pull Parquet files to Amperity:

  1. Select a filedrop data source.

  2. Define a feed to associate fields in the Parquet file with semantic tags; in some situations you may need to use an ingest query to transform data in the Parquet file prior to loading it to Amperity.

  3. Configure a courier for the location and name of the Parquet file, and then for the name of an ingest query.

Note

The Zstandard (zstd) commpression scheme is not supported when ingesting Apache Parquet files.

Data sources

Pull Apache Parquet files to Amperity using any filedrop data source:

Load data

For most Parquet files, use a feed to associate fields in the Parquet file with semantic tags. In some situations, an ingest query may be necessary to transform data prior to loading it to Amperity.

Feeds

A feed defines how data should be loaded into a domain table, including specifying which columns are required and which columns should be associated with a semantic tag that indicates that column contains customer profile (PII) and transactions data.

Apply profile (PII) semantics to customer records and transaction, and product catalog semantics to interaction records. Use blocking key (bk), foreign key (fk), and separation key (sk) semantic tags to define how Amperity should understand how field relationships should be understood when those values are present across your data sources.

Ingest queries

An ingest query is a SQL statement that may be applied to data prior to loading it to a domain table. An ingest query is defined using Spark SQL syntax.

Use Spark SQL to define an ingest query for the Parquet file. Use a SELECT statement to specify which fields should be pulled to Amperity. Apply transforms to those fields as necessary.

Couriers

A courier brings data from an external system to Amperity.

A courier must specify the location of the Parquet file, and then define how that file is to be pulled to Amperity. This is done using a combination of configuration blocks:

  1. Load settings

  2. Load operations

Load settings

Use courier load settings to specify the path to the Parquet file, a file tag (which can be the same as the name of the Parquet file), and the "application/x-parquet" content type.

{
  "object/type": "file",
  "object/file-pattern": "'path/to/file'-YYYY-MM-dd'.parquet/'",
  "object/land-as": {
    "file/tag": "FILE_NAME",
    "file/content-type": "application/x-parquet"
  }
}

Note

Apache Parquet files are almost always partitioned, where a single logical Parquet file is comprised of multiple physical files in a directory structure, each of them representing a partition.

Parquet partitioning optionally permits for data to be nested in a directory structure determined by the value of partitioning columns. Amperity only detects Parquet partition files one directory level below the configured file pattern. For example:

"path/to/file-YYYY-MM-dd.parquet/part-0000.parquet"

Load operations

Use courier load operations to associate a feed ID to the courier, apply the same file tag as the one used for load settings. Load operations for an ingest query may specify a series of options.

Load from feed
{
  "FEED_ID": [
    {
      "type": "OPERATION",
      "file": "FILE_NAME"
    }
  ]
}
Load from ingest query
{
  "FEED_ID": [
    {
      "type": "spark-sql",
      "spark-sql-files": [
        {
          "file": "FILE_NAME"
        }
      ],
      "spark-sql-query": "INGEST_QUERY_NAME"
    }
  ]
}

Send Parquet files

Apache Parquet is the recommended format for any customer environment that supports the use of Parquet. This is because Parquet data embeds data-typing and avoids escape character and data formatting issues that can be present in other formats like CSV and TSV formats. Parquet is highly compact–file sizes can be up to 20 times smaller–and a format that systems can load and use quickly.

Amperity overwrites Apache Parquet files when they are sent to the same location.

Tip

By default, Amperity uses Snappy to compress Apache Parquet files prior to sending them to a destination. If you do not need PGP encryption or output as a single file, you should not use compression.

Note

A folder is created with one (or more) files unless Parquet files are configured to be compressed/archived during orchestration.

Amperity can send Apache Parquet files to downstream workflows using any filedrop destination:

Configure Parquet directories

The filename template defines the name of the directory into which Apache Parquet files are placed. By default, Amperity appends “.parquet” to the name of the directory. You can configure Amperity to omit “.parquet” from the name using the Exclude Parquet extension from the directory name setting when configuring a destination that supports sending Apache Parquet files from Amperity, including many SFTP sites, Amazon S3, Azure Blob Storage, and Google Cloud Storage.