File format: Apache Parquet

Apache Parquet is a free and open source column-oriented data storage format developed within the Apache Hadoop ecosystem. It is similar to RCFile and ORC, but offers more efficient data compression and encoding schemes with enhanced performance and can better handle large amounts of complex bulk data.

Apache Parquet may be used with any upstream or downstream customer environment that supports the use of Parquet. Parquet is highly compact and can be transferred easily. Parquet embeds data-typing and avoids escape character and data formatting issues that can be present in other formats like CSV and TSV.

Pull Parquet files

To pull Parquet files to Amperity:

  1. Select a filedrop data source.

  2. Configure a courier for the location and name of the Parquet file.

  3. Define a feed to associate fields in the Parquet file with semantic tags.

Note

The Zstandard (zstd) commpression scheme is not supported when ingesting Apache Parquet files.

Data sources

Pull Apache Parquet files to Amperity using one of the following data sources:

Load data

Use a feed to associate fields in the Apache Parquet file with semantic tags and a courier to pull the Apache Parquet file from its upstream data source.

Couriers

A courier brings data from an external system to Amperity.

A courier must specify the location of the Apache Parquet file, and then define how that file is to be pulled to Amperity.

  1. File settings

  2. Feed selection

File settings

Use the File settings section of the courier configuration page to specify the path to the Apache Parquet file and to define formattting within the file.

Note

Apache Parquet files are partitioned, where a single logical Parquet file is comprised of multiple physical files in a directory structure, each of them representing a partition.

Parquet partitioning optionally permits for data to be nested in a directory structure determined by the value of partitioning columns. Amperity only detects Parquet partition files one directory level below the configured file pattern. For example:

"path/to/file-YYYY-MM-dd.parquet/part-0000.parquet"
Feed selection

Use the Feed selection section of the courier configuration page to identify the feed for which this courier pulls data, and then files are loaded.

From the Load type dropdown select one of:

  • Load Use this option to load data to the associated domain table.

  • Spark Use this option to load data when the Apache Parquet file contains complex types, such as structs, arrays, or maps.

  • Truncate and load Use this option to delete all rows in the associated domain table, and then load data.

Feeds

A feed defines how to load data into a domain table, including specifying required columns and columns with semantic tags for customer profile (PII) or transactions data.

Apply profile (PII) semantics to customer records and transaction, and product catalog semantics to interaction records. Use blocking key (bk), foreign key (fk), and separation key (sk) semantic tags to define how Amperity should understand values that exist across data sources.

Send Parquet files

Apache Parquet is the recommended format for any customer environment that supports the use of Parquet. This is because Parquet data embeds data-typing and avoids escape character and data formatting issues that can be present in other formats like CSV and TSV formats. Parquet is highly compact–file sizes can be up to 20 times smaller–and a format that systems can load and use quickly.

Amperity overwrites Apache Parquet files when they are sent to the same location.

Tip

By default, Amperity uses Snappy to compress Apache Parquet files prior to sending them to a destination. If you do not need PGP encryption or output as a single file, you should not use compression.

Note

A folder is created with one (or more) files unless Parquet files are configured to be compressed/archived during orchestration.

Amperity can send Apache Parquet files to downstream workflows any of the following destinations:

Configure Parquet directories

The filename template defines the name of the directory into which Apache Parquet files are placed. By default, Amperity appends “.parquet” to the name of the directory. You can configure Amperity to omit “.parquet” from the name using the Exclude Parquet extension from the directory name setting when configuring a destination that supports sending Apache Parquet files from Amperity, including many SFTP sites, Amazon S3, Azure Blob Storage, and Google Cloud Storage.