Format: Apache Avro

Apache Avro is a row-oriented remote procedure call and data serialization framework developed within the Apache Hadoop ecosystem. Avro uses JSON to define data types and protocols, and serializes data in a compact binary format.

Apache Avro may be used with any upstream or downstream customer environment that supports the use of Avro. Avro offers the most compact file format available for use with Amperity.

Pull Avro files

To pull Avro files to Amperity:

  1. Select a filedrop data source

  2. Define a feed to associate fields in the Avro file with semantic tags; in some situations you may need to use an ingest query to transform data in the Avro file prior to loading it to Amperity

  3. Configure a courier for the location and name of the Avro file, and then for the name of an ingest query

Data sources

Pull Apache Avro files to Amperity using any filedrop data source:

Load data

end-before

Feeds

A feed defines how data should be loaded into a domain table, including specifying which columns are required and which columns should be associated with a semantic tag that indicates that column contains customer profile (PII) and transactions data.

Apply profile (PII) semantics to customer records and transaction, itemized transaction, and product catalog to interaction records. Use blocking key (bk), foreign key (fk), and separation key (sk) semantic tags to define how Amperity should understand how field relationships should be understood when those values are present across your data sources.

Ingest queries

An ingest query is a SQL statement that may be applied to data prior to loading it to a domain table. An ingest query is defined using Spark SQL syntax.

Use Spark SQL to define an ingest query for the Avro file. Use a SELECT statement to specify which fields should be pulled to Amperity. Apply transforms to those fields as necessary.

Couriers

A courier brings data from external system to Amperity. A courier relies on a feed to know which fileset to bring to Amperity for processing.

A courier must specify the location of the Avro file, and then define how that file is to be pulled to Amperity. This is done using a combination of configuration blocks:

  1. Load settings

  2. Load operations

Load settings

Use courier load settings to specify the path to the Avro file, a file tag (which can be the same as the name of the Avro file), and the "application/avro" content type.

{
  "object/type": "file",
  "object/file-pattern": "'path/to/file'-YYYY-MM-dd'.avro'",
  "object/land-as": {
    "file/tag": "FILE_NAME",
    "file/content-type": "application/avro"
  }
}

Load operations

Use courier load operations to associate a feed ID to the courier, apply the same file tag as the one used for load settings. Load operations for an ingest query may specify a series of options.

Load from feed
{
  "FEED_ID": [
    {
      "type": "OPERATION",
      "file": "FILE_NAME",
    }
  ]
}
Load from ingest query
{
  "FEED_ID": [
    {
      "type": "spark-sql",
      "spark-sql-files": [
        {
          "file": "FILE_NAME"
        }
      ],
      "spark-sql-query": "INGEST_QUERY_NAME"
    }
  ]
}

Send Avro files

Important

Amperity does not send Avro files to downstream workflows.