About Presto SQL

Presto is a distributed SQL query engine that is designed to efficiently query vast amounts of data using distributed queries. Presto is used within the Queries and Segments pages in the Amperity user interface.

Why should I use this reference?

The SQL Segment Editor in the Queries and Segment tabs uses Presto SQL as the underlying SQL engine. Amperity segments are built almost exclusively by using the SELECT statement, along with the clauses, operators, expressions, and functions you would expect to be available.

This reference is focused on how Presto SQL is used with Amperity and is not focused on anything that you would not expect to do from the Queries and Segment tabs in Amperity.

Please refer to this reference first, and then to the official Presto SQL documentation.

Amazon AWS vs. Azure

Amperity uses different versions of Presto SQL, depending on if the tenant runs in Amazon AWS or Microsoft Azure.

Amperity behaves the same on either platform when using Presto SQL to build segments from within the Segments tab, with the following exceptions:

  1. The way Presto SQL handles fixed precision rounding for a DECIMAL data type was changed in Presto SQL version 0.198. A decimal without an explicit type specifier–for example, 1.2–is parsed as a DOUBLE on tenants that run in Amazon AWS and as a DECIMAL on tenants that run in Azure.

Recommendations

This is a list of recommendations for using Presto SQL with Amperity segments.

General

Do

  • Use consistent and descriptive identifiers and names.

  • Make judicious use of white space and indentation to make code easier to read.

  • Store ISO-8601 compliant time and date information (YYYY-MM-DD HH:MM:SS.SSSSS).

  • Try to use only standard SQL functions instead of vendor-specific functions for reasons of portability.

  • Keep code succinct and devoid of redundant SQL, such as unnecessary quoting or parentheses or WHERE clauses that can otherwise be derived.

  • Include comments in SQL code where necessary.

    You may use the C-language style slash and asterisk opening (/*) and closing (*/) pair of delimiters to enclose comments single- and multi-line comments, as long as comments are not nested.

    You may precede single-line comments using --.

  • Convert string values to lowercase. For example: LOWER("column_name") = 'value'.

  • Code review complex SQL statements.

  • Keep the Segments page free of draft segments as much as possible.

  • Ensure segments are run automatically.

  • Use one-way SHA-256 hashes for fields that contain PII data.

  • Use double quotes ("") for database table and field names. Use single quotes ('') for values used in comparisons. For example: "State" = 'Washington'.

  • Use parentheses to group the contents of AND, OR, and NOT operators within complex SQL to ensure that each element within the WHERE clause runs in the correct order.

Avoid

  • Using reserved keyword names as identifiers.

  • CamelCase, with the exception of the table name. CamelCase is difficult to scan quickly.

  • Descriptive prefixes or Hungarian notation such as sp_ or tbl.

  • Plurals. Use the more natural collective term where possible instead. For example staff instead of employees or people instead of individuals.

  • Quote identifiers. Use SQL92 double quotes for portability, if possible.

  • Object-oriented design principles with SQL or database structures.

Commas

Commas are used as separators in SQL queries and are often added at the end of a line, like this:

SELECT
  amperity_id AS amperity_id,
  max(given_name) AS given_name,
  max(surname) AS surname,
  max(email) AS email,
  max(phone) AS phone,
  max(address) AS address,
  max(city) AS city,
  max(state) AS state,
  max(postal) AS postal,
  max(birthdate) AS birthdate,
  max(gender) AS gender
FROM Unified_Coalesced
GROUP BY amperity_id

Adding the comma at the start of the row is recommended for two reasons:

  1. It makes it visually obvious that a comma was not missed.

  2. It allows a field to be commented out without breaking anything else in the query.

For example:

SELECT
  amperity_id AS amperity_id
  ,max(given_name) AS given_name
  ,max(surname) AS surname
  ,max(email) AS email
  ,max(phone) AS phone
  ,max(address) AS address
--  ,max(address_2) AS address_line_two
  ,max(city) AS city
  ,max(state) AS state
  ,max(postal) AS postal
  ,max(birthdate) AS birthdate
  ,max(gender) AS gender
FROM Unified_Coalesced
GROUP BY amperity_id

Indentation

To ensure that SQL is readable it is important that standards of indentation are followed. Use a two-character indentation pattern for SQL queries as often as possible to help ensure consistency and readability. A two-character indentation pattern also helps minimize the width of the query, making it easier to read on the page and less likely to wrap or run off the page to the right.

Formalisms

Make use of BETWEEN where possible instead of combining multiple statements with AND. Similarly use IN() instead of multiple OR clauses. Where a value needs to be interpreted before leaving the database use the CASE expression. CASE expressions can be nested to form more complex logical structures. Avoid the use of UNION clauses and temporary tables where possible. If the schema can be optimized to remove the reliance on these features then it most likely should be.

SELECT CASE postcode
  WHEN 'BN1' THEN 'Brighton'
  WHEN 'EH1' THEN 'Edinburgh'
  END AS 'city_uk'
FROM office_locations
WHERE country = 'United Kingdom'
  AND opening_time BETWEEN 8 AND 9
  AND postcode IN ('EH1', 'BN1', 'NN1', 'KW1')

Joins

Joins should be aligned with the FROM clause and grouped with a new line where necessary. Indenting the ON and AND statements in a JOIN is optional, but can make it easier to see individual joins when many are present.

SELECT r.last_name
FROM riders AS r
INNER JOIN bikes AS b
  ON r.bike_vin_num = b.vin_num
  AND b.engine_tally > 2
INNER JOIN crew AS c
  ON r.crew_chief_last_name = c.last_name
  AND c.chief = 'Y';

Subqueries

Subqueries should be aligned to the line above them, but then follow standard indentation patters from that location. Sometimes it will make sense to have the closing parenthesis on a new line at the same character position as its opening partner—this is especially true where you have nested subqueries.

SELECT r.last_name,
  (SELECT MAX(YEAR(championship_date))
  FROM champions AS c
  WHERE c.last_name = r.last_name
    AND c.confirmed = 'Y') AS `last_championship_year`
FROM riders AS r
WHERE r.last_name IN
  (SELECT c.last_name
  FROM champions AS c
  WHERE YEAR(championship_date) > '2008'
    AND c.confirmed = 'Y');

Naming conventions

Ensure the name is unique and does not exist as a reserved keyword. Keep the length to a maximum of 30 bytes—in practice this is 30 characters unless you are using multi-byte character set. Names must begin with a letter and may not end with an underscore. Only use letters, numbers, and underscores in names. Avoid the use of multiple consecutive underscores—these can be hard to read. Use underscores where you would naturally include a space in the name (first name becomes first_name). Avoid abbreviations and if you have to use them make sure they are commonly understood.

SELECT first_name
FROM Customer360

Alias names

An alias assigns a temporary name to a column or a table that only exists for the duration of a query. An alias is defined after the AS keyword.

For columns:

SELECT column_name AS alias_name
FROM table_name;

For tables:

SELECT column_name(s)
FROM table_name AS alias_name;

When using an alias:

  • Assign names that clearly relate to the column or table.

    Important

    Do not use an alias when its name matches the original column name.

  • Always use the AS keyword to ensure readability.

    ..tip:: Include the AS keyword when aliasing columns in a SELECT statement.

  • For computed data – SUM() or AVG() – use the name you would give it were it a column defined in the schema.

  • Always wrap an aliased field name with single back ticks:

    `given_name`
    

The following example shows using aliases to add clarity to a SELECT statement that selects columns from more than one table. The “c360” alias refers to the Customer360 table and the “el” alias refers to the Email_List table:

SELECT c360.email, c360.given_name, c360.surname, el.status
FROM Customer360 c360

LEFT JOIN Email_List el ON c360.email = el.email
WHERE status IS NOT NULL

Column names

Always use the singular name. Where possible avoid simply using id as the primary identifier for the table. Do not add a column with the same name as its table and vice versa. Always use lowercase except where it may make sense not to, such as with proper nouns.

Correlation names

A correlation name assigns a temporary name to a table. A correlation name is defined within the FROM clause after the table name.

Syntax:

FROM table_name correlation_name

Example:

FROM Merged_Customers mc

When using correlations:

  • Assign names that clearly relate to the table. For example: a table named “Merged_Customers” should have a correlation name of “mc”.

  • Always use the first letter of the table as the first letter of the correlation.

  • Append additional alphanumeric characters as necessary’s to ensure unique correlation names.

  • Ensure that correlation names are used only within the same query.

Suffix names

The following suffixes represent patterns that should be applied to column names in SQL tables. These patterns help ensure the purpose of columns can be easily understood by all users. Use the correct suffixes, use them consistently, and use them everywhere they are appropriate.

Suffix

Description

_id

A unique identifier such as a column that is a primary key.

_status

A flag value or some other status of any type such as publication_status.

_total

The total or sum of a collection of values.

_num

Denotes the field contains any kind of number.

_name

Signifies a name such as first_name.

_seq

Contains a contiguous sequence of values.

_date

Denotes a column that contains the date of something.

_tally

A count.

_size

The size of something such as a file size or clothing.

_addr

An address for the record could be physical or intangible such as ip_addr.

Table names

Use a collective name or, less ideally, a plural form. For example (in order of preference) staff and employees. Do not prefix with tbl or any other such descriptive prefix or Hungarian notation. Never give a table the same name as one of its columns and vice versa. Avoid, where possible, concatenating two table names together to create the name of a relationship table. Rather than cars_mechanics prefer services.

NULL values

Functions may fail when they encounter a NULL value and others may return NULL values if any of their arguments return NULL values.

  • Use the COALESCE() function to convert to a zero-length string when using the CONCAT() and SUM() functions.

  • Use the COALESCE() function to to identify math. For example, multiplication will return NULL if any field is NULL. For example, because 1 is the identity for multiplication, use COALESCE(myColumn, 1)

Reserved words

Always use uppercase for the reserved keywords like SELECT and WHERE. It is best to avoid the abbreviated keywords and use the full length ones where available (prefer ABSOLUTE to ABS). Do not use database server specific keywords where an ANSI SQL keyword already exists performing the same function. This helps to make code more portable.

SELECT model_num
FROM Phones AS p
WHERE p.release_date > '2014-09-30'

Reserved keywords

The following keywords are reserved and must be double-quoted to be used as an identifier.

  • ALTER

  • AND

  • AS

  • BETWEEN

  • BY

  • CASE

  • CAST

  • CONSTRAINT

  • CREATE

  • CROSS

  • CUBE

  • CURRENT_DATE

  • CURRENT_TIME

  • CURRENT_TIMESTAMP

  • CURRENT_USER

  • DEALLOCATE

  • DELETE

  • DESCRIBE

  • DISTINCT

  • DROP

  • ELSE

  • END

  • ESCAPE

  • EXCEPT

  • EXECUTE

  • EXISTS

  • EXTRACT

  • FALSE

  • FOR

  • FROM

  • FULL

  • GROUP

  • GROUPING

  • HAVING

  • IN

  • INNER

  • INSERT

  • INTERSECT

  • INTO

  • IS

  • JOIN

  • LEFT

  • LIKE

  • LOCALTIME

  • LOCALTIMESTAMP

  • NATURAL

  • NORMALIZE

  • NOT

  • NULL

  • ON

  • OR

  • ORDER

  • OUTER

  • PREPARE

  • RECURSIVE

  • RIGHT

  • ROLLUP

  • SELECT

  • TABLE

  • THEN

  • TRUE

  • UESCAPE

  • UNION

  • UNNEST

  • USING

  • VALUES

  • WHEN

  • WHERE

  • WITH

One-way SHA-256 hashes

Apply one-way SHA-256 hashes to fields that contain PII data. A one-way hash ensures that data can no longer be recognizable as valid PII, yet still allows that data to applied to segments that report on users who were flagged for removal or deletion and helps ensure that removed or obfuscated data is not sent from Amperity.

A one-way SHA-256 hash has the following syntax:

TO_HEX(SHA256(TO_UTF8(UPPER(TRIM(FIELD)))))

and uses the following Presto SQL functions:

  • TRIM() removes whitespace from the field.

  • UPPER() sets all characters to upper-case.

  • TO_UTF8() converts the data into a binary format.

  • SHA256() hashes data with a one-way SHA-256 hash.

  • TO_HEX() converts the binary data into a string.

Whitespace

To make the code easier to read it is important that the correct complement of spacing is used. Do not crowd code or remove natural language spaces.

Line spacing

Always include newlines/vertical space:

  • before AND or OR

  • after semicolons to separate queries for easier reading

  • after each keyword definition

  • before a comma when separating multiple columns into logical groups

  • to separate code into related sections, which helps to ease the readability of large chunks of code.

Putting commas and conjunctions at the start of the line makes it easier to comment out a single line without disturbing the rest of the query

SELECT
  a.title
  ,a.release_date
  --,a.recording_date
  ,a.production_date
FROM albums AS a
WHERE a.title = 'Charcoal Lane'
  OR a.title = 'The New Danger';

Spaces

Spaces should be used to line up the code so that the root keywords all start on the same character boundary, and also so that This makes it easy to keep track of where you are in a query that may be multiple layers deep.

(SELECT f.species_name
  ,AVG(f.height) AS `average_height`
  ,AVG(f.diameter) AS `average_diameter`
FROM flora AS f
WHERE f.species_name = 'Banksia'
  OR f.species_name = 'Sheoak'
  OR f.species_name = 'Wattle'
GROUP BY f.species_name, f.observation_date)

UNION ALL

(SELECT b.species_name
  ,AVG(b.height) AS `average_height`
  ,AVG(b.diameter) AS `average_diameter`
FROM botanic_garden_flora AS b
WHERE b.species_name = 'Banksia'
  OR b.species_name = 'Sheoak'
  OR b.species_name = 'Wattle'
GROUP BY b.species_name, b.observation_date)

Although not exhaustive always include spaces:

  • before and after equals (=)

  • after commas (,)

  • surrounding apostrophes ('), but not within parentheses or with a trailing comma or semicolon.

SELECT
  a.title
  ,a.release_date
  ,a.recording_date
FROM albums AS a
WHERE a.title = 'Charcoal Lane'
  OR a.title = 'The New Danger'

Example SELECT statement

The following example shows selecting the Amperity ID, purchase date, and order ID, and then uses two window functions to rank by transaction totals, and then put in descending order to sort the most recent purchase date first:

SELECT
  t.Amperity_Id,
  t.purchasedate,
  t.orderid,
  rank() OVER (PARTITION BY t.Amperity_Id
               ORDER BY t.transactiontotal DESC) AS rank,
  t.transactiontotal,
  sum(t.transactiontotal) OVER (PARTITION BY t.Amperity_Id
                                ORDER BY t.purchasedate) AS rolling_sum
FROM
  TransactionsEcomm t
ORDER BY t.Amperity_Id, rank
LIMIT 100

WITH clause

The WITH clause defines a common table expression (CTE).

A common table expression (CTE) is a named subquery defined by the WITH clause. A CTE defines a set of variables that act as a temporary view, an optional list of column names, and a query expression. The results of the query expression behaves like a table. Each column of that table is defined by the optional list of column names.

The following example shows using multiple CTEs:

WITH new_in_21 AS (
  SELECT
    amperity_id
    ,one_and_done
  FROM Transaction_Attributes
  WHERE YEAR(first_order_datetime) = 2021
),

product_categories AS (
  SELECT DISTINCT
    new_in_21.amperity_id
    ,one_and_done
    ,product_category
  FROM Unified_Itemized_Transactions uit
  INNER JOIN new_in_21 ON new_in_21.amperity_id=uit.amperity_id
)

SELECT
  product_category
  ,COUNT(distinct amperity_id) customer_count
  ,1.0000*SUM(CASE when one_and_done THEN 1 ELSE 0 END) / COUNT(DISTINCT amperity_id) pct_one_done
FROM product_categories
GROUP BY 1
ORDER BY 3 DESC

SELECT statement

The SELECT statement defines a set of data to be returned from a data table. The set of returned data is often referred to as the result-set. Use the SELECT statement to retrieve rows from any table in the customer 360 database.

A SELECT statement can be complex, depending on the type of query you need to make. For example, the following SELECT statement ranks transactions by Amperity ID and by largest dollar totals:

SELECT
  t.orderid,
  t.Amperity_Id,
  t.transactiontotal,
  rank() OVER (partition by t.Amperity_Id order BY t.transactiontotal DESC) AS rank
FROM
  TransactionsEcomm t
ORDER BY t.Amperity_Id, rank asc
LIMIT 100

The rest of this topic describes the clauses, expressions, functions, and operators that are the most commonly used within the SQL Segment Editor in Amperity. More functionality than what is described in this topic is supported, as the segment editors use Presto SQL.

Important

Not all of the functionality described in the official documentation for Presto SQL should be used in the SQL Segment Editor.

Subquery predicates

A subquery is an expression which is composed of a query. The subquery is correlated when it refers to columns outside of the subquery. Logically, the subquery will be evaluated for each row in the surrounding query. The referenced columns will thus be constant during any single evaluation of the subquery.

Note

Support for correlated subqueries is limited. Not every standard form is supported.

EXISTS predicate

The EXISTS predicate determines if a subquery returns any rows:

SELECT name
FROM nation
WHERE EXISTS (SELECT * FROM region WHERE region.regionkey = nation.regionkey)

IN predicate

The IN predicate determines if any values produced by the subquery are equal to the provided expression. The result of IN follows the standard rules for NULL values. The subquery must produce exactly one column:

SELECT name
FROM nation
WHERE regionkey IN (SELECT regionkey FROM region)

SELECT DISTINCT statement

Use the SELECT DISTINCT statement instead of SELECT to return only distinct rows of data.

For example:

SELECT DISTINCT
  amperity_id
  ,order_id
  ,order_quantity
  ,order_datetime
FROM Unified_Transactions

FROM clause

The FROM clause specifies the name of the data table against which the SQL query will be run and is part of every SELECT statement.

TABLESAMPLE BERNOULLI clause

Use the TABLESAMPLE BERNOULLI clause to return a random sample of records. Specify the size of the random sample within parentheses. For example:

SELECT
  amperity_id
FROM Customer_360 TABLESAMPLE BERNOULLI (50)

will return a random sample of records from the Customer 360 table at a 50% rate.

The following example shows a segment for use with AmpIQ and campaigns that samples all customers with the surname “Smith” who do not have the given name “Joe” at a 30% rate:

SELECT
  "amperity_id"
FROM "Customer_360" TABLESAMPLE BERNOULLI (30)
WHERE
  (
    (
      "amperity_id" IN (
        SELECT DISTINCT "t0"."amperity_id"
        FROM "Customer_360" "t0"
        INNER JOIN "Customer_360" "t1"
        ON "t0"."amperity_id" = "t1"."amperity_id"
        WHERE
          (
            LOWER("t0"."surname") = 'smith'
            AND LOWER("t1"."given_name") <> 'joe'
          )
      )
    )
    AND LOWER("surname") = 'smith'
    AND LOWER("given_name") <> 'joe'
  )

and returns a list of Amperity IDs ready for use with a campaign.

LEFT JOIN clause

The LEFT JOIN clause joins rows from two tables. For a LEFT JOIN, each row in the left table is joined with all matching rows from the right table. For rows with no match in the right table, the join is completed with NULL to represent column values.

For example:

SELECT * FROM (VALUES 1, 2) t("left")
  LEFT JOIN (VALUES 1, 1) u("right")
  ON t."left" = u."right";

will return a table similar to:

-------- --------
 left     right
-------- --------
 1        1
 1        1
 2        NULL
-------- --------

WHERE clause

The WHERE clause filters records, and then returns only records that match the specified condition and value. The WHERE clause can be combined with AND, OR, and NOT operators.

Tip

Use parentheses to group the contents of AND, OR, and NOT operators within complex SQL to ensure that each element within the WHERE clause runs in the correct order. For example:

Do

AND (c360.email IS NOT NULL OR c360.phone IS NOT NULL)
AND (
    (is_cancellation IS NULL)
  OR (NOT is_cancellation)
)
AND (
  (is_return IS NULL)
  OR (NOT is_return)
)

Avoid

AND c360.email IS NOT NULL OR c360.phone IS NOT NULL
AND is_cancellation IS NULL OR NOT is_cancellation
AND is_return IS NULL OR NOT is_return

AND operator

The AND operator is used to filter records based on more than one condition. The AND operator displays a record if all the conditions separated by AND are TRUE.

OR operator

The AND operator is used to filter records based on more than one condition. The OR operator displays a record if any of the conditions separated by OR is TRUE.

NOT operator

The NOT operator displays a record if the condition(s) is NOT TRUE.

CASE expression

The standard SQL CASE expression has two forms: simple and searched.

Note

If a CASE expression is too long, add a new line between WHEN and THEN clauses to make the line more readable.

Simple form

The simple form searches each value expression from left to right until it finds one that equals expression:

CASE expression
  WHEN value THEN result
  [ WHEN ... ]
  [ ELSE result ]
END

The result for the matching value is returned.

If no match is found, the result from the ELSE clause is returned if it exists, otherwise NULL is returned:

SELECT a,
  CASE a
    WHEN 1 THEN 'one'
    WHEN 2 THEN 'two'
    ELSE 'many'
  END

Searched form

The searched form evaluates each boolean condition from left to right until one is true and returns the matching result:

CASE
  WHEN condition THEN result
  [ WHEN ... ]
  [ ELSE result ]
END

If no conditions are true, the result from the ELSE clause is returned if it exists, otherwise NULL is returned:

SELECT a, b,
  CASE
    WHEN a = 1 THEN 'aaa'
    WHEN b = 2 THEN 'bbb'
    ELSE 'ccc'
  END

Group states by time zones

The following CASE expression groups states by North American time zone:

,CASE
  WHEN state = 'HI' THEN 'Hawaii'
  WHEN state = 'AK' THEN 'Alaska'
  WHEN state IN ('WA','OR','CA','NV') THEN 'Pacific Time (US & Canada)'
  WHEN state IN ('MT','ID','WY','UT','CO','AZ','NM') THEN 'Mountain Time (US & Canada)'
  WHEN state IN ('ND','SD','NE','KS','OK','TX','MN','WI','IA','IL','MO','AR','TN','LA','MS','AL') THEN 'Central Time (US & Canada)'
  WHEN state IN ('MI','IN','OH','KY','FL','GA','SC','NC','WV','VA','MD','DE','DC','NJ','PA', 'NY','CT','RI','MA','VT','NH','ME') THEN 'Eastern Time (US & Canada)'
  WHEN state = 'PR' THEN 'Puerto Rico'
  WHEN state = 'GU' THEN 'Guam'
  WHEN state = 'VI' THEN 'US Virgin Islands'
  WHEN state = 'AS' THEN 'American Samoa'
  ELSE '' END AS time_zone

GROUP BY clause

The GROUP BY clause divides the output of a SELECT statement into groups of rows containing matching values. A simple GROUP BY clause may contain any expression composed of input columns or it may be an ordinal number selecting an output column by position (starting at one).

The following queries are equivalent. They both group the output by the given_name input column with the first query using the ordinal position of the output column and the second query using the input column name:

SELECT COUNT(*), given_name FROM Customer_360 GROUP BY 2

is equivalent to:

SELECT COUNT(*), given_name FROM Customer_360 GROUP BY given_name

GROUP BY clauses can group output by input column names not appearing in the output of a select statement. For example, the following query generates row counts for the customer table using the input column surname:

SELECT COUNT(*) FROM Customer_360 GROUP BY surname

returns a table similar to:

 _col0
-------
 82
 468
 42
 9
 1
(3984 rows)

When a GROUP BY clause is used in a SELECT statement all output expressions must be either aggregate functions or columns present in the GROUP BY clause.

CUBE operator

The CUBE operator generates all possible groupings for the specified group of columns.

For example, the following query:

SELECT
   amperity_id
   ,purchase_brand
   ,SUM(order_revenue) AS total_order_revenue
FROM Unified_Transactions
--WHERE amperity_id = '12345-abcde'
GROUP BY CUBE(amperity_id, purchase_brand)
ORDER BY amperity_id, purchase_brand

will return a table similar to:

------------- ---------------- ---------------------
 amperity_id   purchase_brand   total_order_revenue
------------- ---------------- ---------------------
               FoxEssentials    3976068.32
               FoxEssentials    3976068.32
               GoldenSwan       1358357.54
               GoldenSwan       1358357.54
               TrendyBear       5703142.95
 ...
------------- ---------------- ---------------------

ROLLUP operator

The ROLLUP operator generates all possible subtotals for the specified group of columns.

For example, the following query:

SELECT
  amperity_id
  ,purchase_brand
  ,SUM(order_revenue) AS total_order_revenue
FROM Unified_Transactions
--WHERE amperity_id = '12345-abcde'
GROUP BY ROLLUP(amperity_id, purchase_brand)
ORDER BY amperity_id, purchase_brand

will return a table similar to:

------------- ---------------- ---------------------
 amperity_id   purchase_brand   total_order_revenue
------------- ---------------- ---------------------
               FoxEssentials    3976068.32
               GoldenSwan       1358357.54
               TrendyBear       5703142.95
                                11037568.81
 ...
------------- ---------------- ---------------------

HAVING clause

The HAVING clause sorts a result set by one or more output expressions. Use in conjunction with aggregate functions and the GROUP BY clause to control which groups are selected. A HAVING clause eliminates groups that do not satisfy the given conditions and filters groups after groups and aggregates are computed.

For example:

SELECT
  COUNT(*)
  ,purchase_brand
  ,payment_method
  ,CAST(SUM(order_quantity) AS bigint) AS quantity
FROM Unified_Transactions
GROUP BY purchase_brand, payment_method
HAVING SUM(order_quantity) > 100
ORDER BY quantity DESC

returns a table similar to:

---------------- ---------------- ----------
 purchase_brand   payment_method   quantity
---------------- ---------------- ----------
 FoxEssentials    VISA             51753
 GoldenSwan       VISA             50933
 TrendyBear                        34329
 FoxEssentials    PayPal           33513
 GoldenSwan       MASTER           16651
 GoldenSwan       MASTER           6482
 FoxEssentials    DISCOVER         4673
 ...
---------------- ---------------- ----------

Set operations

Use one of the following set operation clauses to combine the results of more than one SELECT statement into a single result set:

EXCEPT clause

Use the EXCEPT clause to return the rows that are in the result set of the first query, but not the second.

For example, to compare two tables and find out which rows are in table B, but not in table A:

SELECT table_b
EXCEPT SELECT table_a

Tip

The EXCEPT clause deduplicates records. For example, if all of the rows in table A are present in table B, the results should be 0 rows.

Use a query similar to the following to verify that 0 rows is accurate:

SELECT COUNT(*)
FROM (
  SELECT DISTINCT *
  FROM table_b
)

INTERSECT clause

Use the INTERSECT clause to return only the rows that are in the result sets of both the first and the second queries.

For example, to find all rows in both tables A and B:

SELECT table_b
INTERSECT SELECT table_a

UNION clause

The UNION clause combines the the unique results of one query with the unique results of a second. For example:

SELECT shirts, pants
UNION
SELECT shoes

will return:

 _col0
-------
 shirts
 pants
 shoes

The UNION clause supports two arguments: ALL and DISTINCT. Use UNION ALL to return all rows, even if they are identical. UNION DISTINCT is the default behavior of UNION (when an argument is not specified) and will return only unique rows.

Tip

Use GROUP BY CUBE and/or GROUP BY ROLLUP clauses for more efficient unions of records.

For example, the following query:

SELECT
  amperity_id
  ,purchase_brand
  ,SUM(order_revenue) AS total_order_revenue
FROM Unified_Transactions
--WHERE amperity_id = '12345-abcde'
GROUP BY amperity_id, purchase_brand

UNION ALL

SELECT
  amperity_id
  ,NULL
  ,SUM(order_revenue) AS total_order_revenue
FROM Unified_Transactions
--WHERE amperity_id = '12345-abcde'
GROUP BY amperity_id

ORDER BY purchase_brand NULLS LAST

will return a table similar to:

------------- ---------------- ---------------------
 amperity_id   purchase_brand   total_order_revenue
------------- ---------------- ---------------------
               FoxEssentials    3976068.32
               GoldenSwan       1358357.54
               TrendyBear       5703142.95
                                11037568.81
 ...
------------- ---------------- ---------------------

UNNEST clause

The UNNEST clause expands an ARRAY or MAP into a relation.

  • Arrays are expanded into a single column.

  • Maps are expanded into two columns (key, value).

UNNEST can also be used with multiple arguments, in which case they are expanded into multiple columns, with as many rows as the highest cardinality argument (the other columns are padded with NULL values).

UNNEST can optionally have a WITH ORDINALITY clause, in which case an additional ordinality column is added to the end. UNNEST is normally used with a JOIN and can reference columns from relations on the left side of the join.

Use the UNNEST clause in the SQL segment to expand phone and/or email addresses so that the query looks at individual bad values. The following example shows using an UNNEST clause to expand email addresses that are part of a bad-values blocklist SQL query:

SELECT
  REGEXP_REPLACE(v.value,'\+.*@','@') AS value
  ,M.semantic
  ,M.datasource
  ,M.domain_table
  ,M.proxy
FROM (
  SELECT DISTINCT
    SPLIT(UPPER(email), ',') AS vs
    ,'email' AS semantic
    ,REGEXP_EXTRACT(datasource, '.+?(?=:)') AS datasource
    ,datasource AS domain_table
    ,UPPER(given_name) AS proxy
  FROM Unified_Coalesced AS UC
) AS M
CROSS JOIN UNNEST(vs) AS v(value)
WHERE value IS NOT NULL
AND value <> ''),

VALUES clause

The VALUES clause can be used anywhere a query can be used, such as the FROM clause of a SELECT statement, an INSERT, or even at the top level. The VALUES clause creates an anonymous table without column names, but the table and columns can be named using an AS keyword with column aliases.

To return a table with one column and three rows:

VALUES 1, 2, 3

To return a table with two columns and three rows:

VALUES
  (1, 'a'),
  (2, 'b'),
  (3, 'c')

To return table with column ID and name:

SELECT * FROM (
  VALUES
    (1, 'a'),
    (2, 'b'),
    (3, 'c')
) AS t (id, name)

To add a table with column ID and name:

CREATE TABLE example AS
SELECT * FROM (
  VALUES
    (1, 'a'),
    (2, 'b'),
    (3, 'c')
) AS t (id, name)

Window functions

A window function performs calculations across rows. A window function runs after the HAVING clause, but before the ORDER BY clause. A window function requires special syntax using the OVER clause to specify the window.

Tip

When a window function is too long, add a new code line to improve readability.

A window function has the following components:

  1. A list of columns that act as inputs to the window function.

  2. A window function that defines the operation to be performed by the window function. For example: RANK() or SUM(). This may be any of the ranking functions, value functions, or aggregate functions available in Presto SQL.

  3. OVER() defines the window frame, which represents a sliding window.

  4. PARTITION BY separates the input into different partitions; only rows in the specified partition will be considered by the window function

  5. ORDER BY determines the order in which the input will be processed by the window function.

    Tip

    Use ROWS BETWEEN to define any number of preceding and following rows, as related to the current row.

    For example:

    ROWS BETWEEN 1 preceding AND 1 following
    
    ROWS BETWEEN current row AND 1 following
    
    ROWS BETWEEN 5 preceding AND 2 preceding
    
  6. DESC sorts the output in descending order.

    Tip

    Add _uuid_pk to window function ordering–DESC, _uuid_pk–to return deterministic results in case of ties.

For example:

SELECT input_1, input_2, input_3,
  window_function() OVER (PARTITION BY input_2
                          ORDER BY input_3 DESC, _uuid_pk) AS rnk
FROM table_name
ORDER BY input_2, rnk

Rolling 7-day window

The following example shows a rolling seven day window for order revenue.

SELECT
  *
FROM (
  SELECT
    purchase_channel
    ,order_day
    ,SUM(order_revenue) OVER (PARTITION BY purchase_channel ORDER BY order_day ROWS BETWEEN 6 preceding AND current row) rolling_7_day_revenue
  FROM (
    SELECT
      purchase_channel
      ,DATE(order_datetime) order_day
      ,SUM(order_revenue) order_revenue
    FROM Unified_Transactions
    WHERE amperity_id IS NOT NULL
    AND order_datetime > (CURRENT_DATE - interval '36' day)
    GROUP BY 1,2
  )
)
WHERE order_day > (CURRENT_DATE - interval '30' day)
ORDER BY 1,2

Retention rate, previous year

The following example shows how to use a rolling 1-year window to return the values necessary for calculating retention rate during the previous year.

WITH customers AS (
  SELECT DISTINCT amperity_id AS amperity_id
  FROM Unified_Transactions
  WHERE order_datetime < CAST(CURRENT_DATE - interval '1' year AS date)
)
,orders AS (
  SELECT DISTINCT amperity_id AS ordered_in_past_year
  FROM Unified_Transactions
  WHERE order_datetime >= CAST(CURRENT_DATE - interval '1' year AS date)
)
SELECT
  SUM(CAST(ordered_in_past_year IS NULL AS int)) AS not_retained
  ,SUM(CAST(ordered_in_past_year IS NOT NULL AS int)) AS retained
FROM customers
LEFT JOIN orders ON amperity_id = ordered_in_past_year

This query will return all customers who purchased more than 1 year ago, and then separates them into two groups: customers who did purchase within the previous year (retained) and customers who did not (not_retained).

Retention rate, in this example, is the number of customers who made a repeat purchase within the previous year represented as a percentage of all customers.

For example, if this query returned a table similar to:

-------------- ----------
 not_retained   retained
-------------- ----------
 525217         87421
-------------- ----------

this means there are 612,638 Amperity IDs (not_retained plus retained) and 87,421 Amperity IDs that made a repeat purchase (retained). Divide the retained amount by the total number of Amperity IDs to get the retention rate: 14.2%.

ORDER BY clause

The ORDER BY clause sorts a result set by one or more output expressions.

ORDER BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...]

Each expression may be composed of output columns or it may be an ordinal number selecting an output column by position (starting at one). The ORDER BY clause is evaluated as the last step of a query after any GROUP BY or HAVING clause. The default null ordering is NULLS LAST, regardless of the ordering direction.

LIMIT clause

The LIMIT clause restricts the number of rows in the result set. The following example queries a large table, but the limit clause restricts the output to only have five rows (because the query lacks an ORDER BY, exactly which rows are returned is arbitrary):

SELECT orderdate FROM orders LIMIT 5;

returns a table similar to:

 o_orderdate
-------------
 1996-04-14
 1992-01-15
 1995-02-01
 1995-11-12
 1992-04-26
 (5 rows)

Operators

An operator compares two data items, and then returns a result.

BETWEEN

Use the BETWEEN operator to test if a value falls within the specified range using the syntax BETWEEN [minimum value] AND [maximum value]:

SELECT 3 BETWEEN 2 AND 6;

The statement shown above is equivalent to the following statement:

SELECT 3 >= 2 AND 3 <= 6;

Presence of NULL evaluates to NULL

The presence of NULL will result in the statement evaluating to NULL:

SELECT NULL BETWEEN 2 AND 4;

and:

SELECT 2 BETWEEN NULL AND 6;

String arguments must be of same type

Use the BETWEEN operator to evaluate string arguments as long as the value, min, and max parameters are of the same type:

SELECT 'Paul' BETWEEN 'John' AND 'Ringo';

whereas this query will produce an error:

SELECT '2.3' BETWEEN 'John' AND '35.2';

Within a CASE statement

CASE
  WHEN field BETWEEN 20 AND 1 THEN 'match'
  ELSE 'noop'
END

is equivalent to:

CASE
  WHEN field < 20 AND field > 1 THEN 'match'
  ELSE 'noop'
END

Comparison operators

The following comparison operators are available:

Operator

Description

<

Less than

>

Greater than

<=

Less than or equal to

>=

Greater than or equal to

=

Equal

<>

Not equal

!=

Not equal (non-standard but popular syntax)

Quantifiers–ALL, ANY, SOME–may be used along with comparison operators to provide more specific comparisons.

ALL

Use ALL together with comparison operators in the following way:

expression operator ALL ( subquery )

For example:

SELECT 21 < ALL (VALUES 19, 20, 21); -- false

Examples of ALL comparison operator combinations:

Expression

Meaning

A = ALL (…)

Evaluates to TRUE when A is equal to all values.

A <> ALL (…)

Evaluates to TRUE when A doesn’t match any value.

A < ALL (…)

Evaluates to TRUE when A is smaller than the smallest value.

ANY

Use ANY together with comparison operators in the following way:

expression operator ANY ( subquery )

For example:

SELECT 21 < ALL (VALUES 19, 20, 21); -- false

Examples of ANY comparison operator combinations:

Expression

Meaning

A = ANY (…)

Evaluates to TRUE when A is equal to any of the values.

Note

This form is equivalent to A IN (…).

A <> ANY (…)

Evaluates to TRUE when A doesn’t match one or more values.

A < ANY (…)

Evaluates to TRUE when A is smaller than the biggest value.

SOME

Use SOME together with comparison operators in the following way:

expression operator SOME ( subquery )

For example:

SELECT 42 >= SOME (SELECT 41 UNION ALL SELECT 42 UNION ALL SELECT 43); -- true

Examples of SOME comparison operator combinations:

Expression

Meaning

A = SOME (…)

Evaluates to TRUE when A is equal to any of the values.

Note

This form is equivalent to A IN (…).

A <> SOME (…)

Evaluates to TRUE when A doesn’t match one or more values.

A < SOME (…)

Evaluates to TRUE when A is smaller than the biggest value.

IS NULL

The IS NULL operator tests whether a value is NULL. This operator works for all data types.

Using NULL with IS NULL evaluates to TRUE:

select NULL IS NULL;

But any other constant does not evaluate to FALSE:

SELECT 3.0 IS NULL;

IS NOT NULL

The IS NULL operator tests whether a value is not NULL. This operator works for all data types.

Using NULL with IS NOT NULL evaluates to FALSE:

select NULL IS NOT NULL;

But any other constant evaluates TRUE:

SELECT 3.0 IS NOT NULL;

PIVOT operator

Warning

Presto SQL does not support the use of a PIVOT operator.

Functions

A function is a SQL statement that accepts input parameters, performs actions, and then returns results. This section highlights some useful functions for use when building segments in Amperity.

Note

This section highlights a very small subset of the complete list of functions available in Presto SQL , many of which can be useful depending on the type of query.

The following list contains some of the most frequently used functions for building segments via the SQL Segment Editor (alphabetized):

ARRAY_AGG()

Use the ARRAY_AGG(x) function to return an array created by x.

Aggregate order IDs

The following SQL query uses the ARRAY_JOIN() and ARRAY_AGG() functions to aggregate an array of order IDs using a comma as the delimiter:

SELECT
  amperity_id
  ,ARRAY_JOIN(ARRAY_AGG(order_id), ', ') AS Orders
FROM Transactions
GROUP BY amperity_id
LIMIT 1000

Sort in chronological order

The following SQL query uses the ARRAY_AGG() function to sort a list of products in chronological order:

SELECT DISTINCT
  sub.amperity_id
  ,ARRAY_JOIN(ARRAY_AGG(sub.item_name) OVER(
    PARTITION BY sub.amperity_id
    ORDER BY sub.order_datetime DESC
    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ), ', ') AS Products_Ordered
FROM (
  SELECT DISTINCT
    amperity_id
    ,item_name
    ,order_datetime
  FROM DW_OrderDetail Z
  WHERE order_closed = true AND order_status <> 'Void'
) AS sub

Warning

This example only runs in tenants that are running on Microsoft Azure.

Separate product IDs with pipe symbol

The following example joins product IDs and uses the pipe symbol (|) as a separator:

,ARRAY_JOIN(ARRAY_AGG(invoices), '|') AS product_ids

ARRAY_JOIN()

Use the ARRAY_JOIN(array, delimiter, string) function to concatenate elements of a given array using a delimiter and an optional string to replace NULL values.

Join order IDs

The following SQL query uses the ARRAY_JOIN() and ARRAY_AGG() functions to aggregate an array of order IDs using a comma as the delimiter:

SELECT
  amperity_id
  ,ARRAY_JOIN(ARRAY_AGG(order_id), ', ') AS Orders
FROM Transactions
GROUP BY amperity_id
LIMIT 1000

AVG()

Use the AVG(x) function to return the average of all input values.

Return mean

The following example returns the mean.

SELECT
  product_category
  ,1.0*AVG(item_revenue) avg_item_revenue
FROM Unified_Itemized_Transactions
WHERE amperity_id IS NOT NULL
GROUP BY 1
ORDER BY 1 DESC

CAST()

Use the CAST(value AS type) function to cast the value of value as type.

Cast RFM as REAL data type

The following example shows using the CAST() function to cast the values of Recency, Frequency, and Monetary to the real data type (a floating-point, 32-bit inexact, variable-precision value).

,ROUND((CAST(Recency AS real) + CAST(Frequency AS real) + CAST(Monetary AS real)) / 3, 2)

Cast as UUID

The following example casts email addresses as a UUID:

,CAST(email AS UUID) AS "email_address"

Cast date as year

,CAST(YEAR(purchase_date) AS VARCHAR) AS purchase_year

CHR()

Use the CHR(codepoint) function to return the Unicode codepoint as a single-character string.

Tip

Use the CODEPOINT(string) function to return the Unicode codepoint as the only character of string. For example:

CODEPOINT(')

Returns:

39

Add leading apostrophe

Some use cases for CSV files have a downstream dependency on Microsoft Excel. Excel automatically removes leading zeros and converts large numbers to scientific notation, such as 1.23E+15, to ensure that formulas and math operations work correctly.

Long strings are output from Amperity in the form of the Amperity ID, customer keys, loyalty program IDs, product codes, and so on. Add a leading apostrophe () to these strings within the Amperity segment to ensure these string values are interpreted as text by Excel.

Use the CONCAT() function to build a string with a leading character, and then use the CHR() function to apply Unicode codepoint 39, which is an apostrophe.

For example:

CONCAT(CHR(39),customer_key)

COALESCE()

Use the COALESCE(value1, value2[, …]) function to return the first non-null value in the argument list. Arguments are only evaluated if necessary.

CONCAT()

Use the CONCAT(array1, array2, …) function to concatenate a set of arrays into a single value that can be used elsewhere in a SQL query.

Note

The CONCAT() function will return NULL if the value of any field is NULL. Use the COALESCE() function to coalesce to a zero-length string prior to concatenation. For example, use:

CONCAT(COALESCE(firstname, ""), " ", COALESE(lastname, ""))

to concatenate a FullName field.

COUNT()

Use the COUNT(*) function to return the number of input rows.

Count customers by state

The following example counts customers in the United States, and then also in California, Oregon, Washington, Alaska, and Hawaii who also belong to the loyalty program (which is indicated when loyalty_id is not NULL):

SELECT
  state
  ,COUNT(amperity_id) AS TotalCustomers
FROM
  Customer360
WHERE (UPPER("country") = 'US'
AND UPPER("state") in ('AK', 'CA', 'HI', 'OR', 'WA')
AND LOWER("loyalty_id") IS NOT NULL)
GROUP BY state

CURRENT_DATE

Use the CURRENT_DATE function to return the current date as of the start of the query. You may use an interval to offset the returned date by the value of the interval.

Birthdays, tomorrow

The following example returns all users who have a birthday tomorrow:

SELECT DISTINCT
  amperity_id
  ,email
  ,given_name
  ,birthdate
FROM Customer360
WHERE MONTH('birthdate') = MONTH(CURRENT_DATE + interval '1' day)
AND DAY('birthdate') = DAY(CURRENT_DATE + interval '1' day)

CURRENT_TIMESTAMP

Use the CURRENT_TIMESTAMP function to return the current timestamp for the time zone specified by the AT TIME ZONE operator:

CURRENT_TIMESTAMP AT TIME ZONE 'time/zone'+ interval '1' day

where time/zone is a valid TZ database name , such as “America/Los_Angeles” or “America/New_York”.

Important

When the Enable performance mode option is enabled for queries, the CURRENT_TIMESTAMP function must be cast as a timestamp. For example:

CAST(CURRENT_TIMESTAMP AS timestamp)

DATE_DIFF()

Use the DATE_DIFF(unit, timestamp1, timestamp2) function as a way to return the difference between two columns that contain timestamp data, expressed in terms of an interval unit.

Possible interval unit values:

  • millisecond

  • second

  • minute

  • hour

  • day

  • week

  • month

  • quarter

  • year

Calculate purchase day

SELECT
  ,t.amperity_id
  ,t.purchasedate
  ,firstpurchase.firstpurchasedate
  ,DATE_DIFF('day', firstpurchase.firstpurchasedate, t.purchasedate) AS PurchaseDecay
FROM TransactionsEcomm t
INNER JOIN
  (
  SELECT
    i.amperity_id
    ,MIN(i.purchasedate) AS firstpurchasedate
  FROM TransactionsEcomm i
  GROUP BY i.amperity_id
  LIMIT 1000
  ) firstpurchase
ON t.amperity_id = firstpurchase.amperity_id
LIMIT 1000

DATE_FORMAT()

Use the DATE_FORMAT(timestamp, format) function to format a timestamp as a string based on the format specifier.

Specifier

Description

%a

Abbreviated weekday name (Sun .. Sat)

%b

Abbreviated month name (Jan .. Dec)

%c

Month, numeric (1 .. 12); This specifier does not support 0 as a month or day.

%d

Day of the month, numeric (01 .. 31); This specifier does not support 0 as a month or day.

%e

Day of the month, numeric (1 .. 31); This specifier does not support 0 as a month or day.

%f

Fraction of second (6 digits for printing: 000000 .. 999000; 1 - 9 digits for parsing: 0 .. 999999999), truncated to milliseconds

%H

Hour (00 .. 23)

%h

Hour (01 .. 12)

%I

Hour (01 .. 12)

%i

Minutes, numeric (00 .. 59)

%j

Day of year (001 .. 366)

%k

Hour (0 .. 23)

%l

Hour (1 .. 12)

%M

Month name (January .. December)

%m

Month, numeric (01 .. 12); This specifier does not support 0 as a month or day.

%p

AM or PM

%r

Time, 12-hour (hh:mm:ss followed by AM or PM)

%S

Seconds (00 .. 59)

%s

Seconds (00 .. 59)

%T

Time, 24-hour (hh:mm:ss)

%v

Week (01 .. 53), where Monday is the first day of the week; used with %x

%W

Weekday name (Sunday .. Saturday)

%x

Year for the week, where Monday is the first day of the week, numeric, four digits; used with %v

%Y

Year, numeric, four digits

%y

Year, numeric (two digits); When parsing, two-digit year format assumes range 1970 to 2069, so “70” will result in year 1970 but “69” will produce 2069

%%

A literal % character

%x

x, for any x not listed above

Format as Stitch date

,DATE_FORMAT(CURRENT_DATE, '%Y-%m-%d 00:00:00') AS "amperity_stitch_date"

DATE_TRUNC()

Use the DATE_TRUNC(unit, x) function to return x truncated to one of the following unit values (shown in bold):

Unit

Example Truncated Value

second

2001-08-22 03:04:05.000

minute

2001-08-22 03:04:00.000

hour

2001-08-22 03:00:00.000

day

2001-08-22 00:00:00.000

week

2001-08-20 00:00:00.000

month

2001-08-01 00:00:00.000

quarter

2001-07-01 00:00:00.000

year

2001-01-01 00:00:00.000

Last purchase, this year

WHERE DATE_TRUNC('year', LastPurchase) = DATE_TRUNC('year', Now())

DAY(), MONTH(), YEAR()

Identifying the day, month, and year from within a date can be helpful when building segments that use these values to improve the timing of communication:

  • Use DAY() to return the day of the month from a date or timestamp.

  • Use MONTH() to return the month of the year from a date or timestamp.

  • Use YEAR() to return the year from a date or timestamp.

Tip

Functions also exist for HOUR(), MINUTE(), QUARTER(), and WEEK(). These functions are used in the same manner as DAY(), MONTH(), and YEAR(), but against different elements within a date or timestamp column value.

One and dones, by year

The following example shows using a common table expression to identify all one-and-done purchasers for a single calendar year:

WITH one_and_dones_2022 AS (
  SELECT
    amperity_id
  FROM Transaction_Attributes
  WHERE one_and_done AND YEAR(first_order_datetime) = 2022
)

SELECT
  COUNT(*) one_and_dones_2022
FROM
  one_and_dones_2022

Find guests who canceled last month

The following example uses member, consumer, and confirmation IDs, along with the reservation confirmation number and the checkin date to return all customers who canceled their stay one month ago:

SELECT
  g.amperity_id AS AMPERITY_ID
  ,g.member_id AS MEMBER_ID
  ,g.consumer_id AS CONSUMER_ID
  ,r.confirmation_id AS CONFIRMATION_ID
  ,r.booking_id AS CONFIRMATION_NUMBER
  ,r.stay_start AS CHECKIN_DATE
FROM Reservation_Table r
JOIN Guest_Table g ON r.amperity_id = g.amperity_id
WHERE (
  r.stay_status = 'canceled'
  AND DAY(r.stay_start) = DAY(CURRENT_TIMESTAMP + INTERVAL '1' MONTH)
)

Find tomorrow’s birthdays

SELECT
  amperity_id
  ,email
  ,given_name
  ,birthdate
FROM "Customer360"
WHERE
  MONTH(birthdate) = MONTH(CURRENT_TIMESTAMP AT TIME ZONE 'America/Los_Angeles'+ INTERVAL '1' DAY)
AND
  DAY(birthdate) = DAY(CURRENT_TIMESTAMP AT TIME ZONE 'America/Los_Angeles' + INTERVAL '1' DAY)

DENSE_RANK()

Use the DENSE_RANK() function to compute the rank of a value in a group of values. DENSE_RANK() will assign the same rank to rows with the same value and will not create gaps in the ranked sequence.

Note

The DENSE_RANK() and RANK() functions both assign a rank to rows with the same value. The difference is that RANK() will create gaps in the sequence. The following example shows rows that are ranked in ascending order by column B:

------- ------- ------------ ------
 col_A   col_B   dense_rank   rank
------- ------- ------------ ------
 A       30      3            4
 A       20      2            3
 A       20      2            3
 A       10      1            1
 A       10      1            1
------- ------- ------------ ------

FROM_UNIXTIME()

Use the FROM_UNIXTIME(unixtime) function to return unixtime (a UNIX timestamp) as a timestamp.

  • Use FROM_UNIXTIME(unixtime, string) to return unixtime as a timestamp and string as a timezone.

  • Use FROM_UNIXTIME(unixtime, hours, minutes) to return unixtime as a timestamp and hours and minutes as a timezone offset.

Convert _updated to a date

Amperity requires each feed to specify a field that describes when each record was last updated. If multiple records in the incoming data and/or the existing domain table have the same primary key, the record with the most recent “last updated” field will be retained. This may be associated with a field that has a datetime field type, or an integer (such as for unix timestamps).

Note

Amperity does not use a field with a date data type because that value is not granular enough to determine priority.

If you have no such updated field, you can choose to autogenerate a field, in which case the following logic is used to determine which record to keep in the case a primary key appears more than once:

  1. Records from newly-ingested data will always overwrite records that already exist in the domain table.

  2. If couriers are run over a date range, records from files associated with later dates will be retained.

  3. If multiple files are loaded for the same date, records for the latest-loaded file are retained. File loading order depends on the behavior of the source system, but is generally deterministic.

  4. If the same primary key appears on multiple records on the same text-based file, the latest row on the file is retained.

Note

When using ingest queries, the above tiebreakers are unavailable, so upserting behavior can be nondeterministic. Ensure that you either specify a “last updated” field, or that your ingest query only returns a single record for each primary key, to ensure deterministic results.

Use the FROM_UNIXTIME(unixtime) function to convert the values in the last updated column (_updated) into a date format.

CAST(FROM_UNIXTIME(_updated / 1048576000) AS date) AS "Date"

GREATEST()

Use the GREATEST(column_name, column_name, …) function to return the column with the greatest value among the values of all named columns.

IF()

Use the IF(condition,x) function to evaluate and return true, NULL, or false.

  • Use IF(condition,true_value) to evaluate and return true_value when the condition is true, otherwise return NULL.

  • Use IF(condition,true_value, false_value) to return true_value when the condition is true or return false_value when the condition is false.

LEAST()

Use the LEAST(column_name, column_name, …) function to return the column with the lowest value among the values of all named columns.

Find earliest dates

SELECT
  amperity_id
  ,email
  ,given_name
  ,birthdate
FROM "Customer360"
WHERE
  LEAST(date_col1, date_col2, date_col3, date_col4)

LENGTH()

Use the LENGTH(string) function to return the length of string in characters.

LOWER()

Use the LOWER(string) function to convert string to lowercase.

MAX()

Use the MAX() function to return the maximum value of all input values.

Collapse rows by Amperity ID

The following SQL will collapse all rows with Amperity IDs in the Unified Coalesced table into a single row per Amperity ID:

SELECT
  amperity_id AS amperity_id
  ,MAX(given_name) AS given_name
  ,MAX(surname) AS surname
  ,MAX(email) AS email
  ,MAX(phone) AS phone
  ,MAX(address) AS address
  ,MAX(city) AS city
  ,MAX(state) AS state
  ,MAX(postal) AS postal
  ,MAX(birthdate) AS birthdate
  ,MAX(gender) AS gender
FROM Unified_Coalesced
GROUP BY amperity_id

MEDIAN()

Warning

The MEDIAN() function is not supported within Amperity. Use NTILE() of 2 over the value, GROUP BY over the NTILE(), and then take the MAX() of the first value or the MIN() of the second value.

MIN()

Use the MIN() function to return the minimum value of all input values.

Replace NULL with 0

Wrap the MIN() function to replace NULL values with 0:

MIN(CASE
  WHEN field IS NULL THEN 0
  ELSE field
END)

or:

MIN(COALESCE(field,0))

NOW()

Use the NOW() function to return the current timestamp (with time zone) as of the start of the query. This function is an alias for current_timestamp.

NTILE()

Use the NTILE(n) window function to evenly distribute large numbers of rows across n buckets, ranging from 1 to n.

Important

Results can be unpredictable when the number of returned records is less than the (n) value of NTILE().

Tip

Use the PERCENT_RANK() function when the expected number of returned rows is a small number.

Partition predicted CLV by brand

The following example shows how to separate values into brand tiers using predicted customer lifetime value.

NTILE(100) OVER (PARTITION BY brand ORDER BY predicted_clv desc, _uuid_pk)

Percentiles by 10

The following example shows how to return 10th percentiles.

SELECT
  tier*.1 AS tier
  ,MIN(lifetime_order_revenue) lifetime_revenue_amount
FROM (
  SELECT
    amperity_id
    ,lifetime_order_revenue
    ,NTILE(10) OVER (ORDER BY lifetime_order_revenue) tier
  FROM Transaction_Attributes
)
GROUP BY 1
ORDER BY 1

Percentiles by 4

The following example shows how to return the 25th, median, and 75th percentiles.

SELECT
  tier*.25 tier
  ,MIN(lifetime_order_revenue) lifetime_revenue_amount
FROM (
  SELECT DISTINCT
    amperity_id
    ,lifetime_order_revenue
    ,NTILE(4) OVER (ORDER BY lifetime_order_revenue) tier
  FROM Transaction_Attributes
)
GROUP BY 1
ORDER BY 1

NULLIF()

Use the NULLIF(expression1, expression2) function to return NULL if expression1 is equal to expression2.

PERCENT_RANK()

Use the PERCENT_RANK() function to return the percentage ranking of a value in group of values. The result is (r - 1) / (n - 1) where r is the rank of the row and n is the total number of rows in the window partition.

Percent rank of all purchases

The following example shows how to return the percent rank of all purchases.

SELECT DISTINCT
  amperity_id
  ,order_revenue
  ,PERCENT_RANK() OVER (ORDER BY order_revenue) pct_rank
FROM Unified_Transactions
WHERE order_datetime > DATE('2023-03-01') AND amperity_id IS NOT NULL
ORDER BY pct_rank DESC

PERCENTILE()

Warning

The PERCENTILE() function is not supported within Amperity.

RANK()

Use the RANK() function to return the rank of a value within a group of values. The rank is one plus the number of rows preceding the row that is not a peer to the ranked row. Tie values in the ordering will produce gaps in the sequence. The ranking is performed for each window partition.

Rank by Amperity ID

The following SQL query uses the RANK() function to find the largest transactions by Amperity ID, and then returns them in ascending order:

WITH ranked_transactions AS (
SELECT
  t.orderid
  ,t.amperity_id
  ,t.transactiontotal
  ,RANK() OVER (PARTITION BY t.amperity_id ORDER BY t.transactiontotal DESC) AS rank
FROM
  TransactionsEcomm t
ORDER BY t.amperity_id, rank ASC
LIMIT 100
)
SELECT * FROM ranked_transactions WHERE rank = 1

REGEXP_EXTRACT()

Use the REGEXP_EXTRACT(string, pattern) function to replace every instance of the substring matched by the regex pattern from string.

REGEXP_LIKE()

Use the REGEXP_LIKE(string, ‘expression’) function to return true when string matches expression.

Validate email addresses

The following example shows using the REGEXP_LIKE() function within a CASE statement to return valid email addresses:

CASE
  WHEN REGEXP_LIKE(email, '^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$')
  AND email_completion > 0 THEN true
  ELSE false
END AS contactable_email

Validate phone numbers

The following example shows using the REGEXP_LIKE() function within a CASE statement to return valid phone numbers:

CASE
  WHEN REGEXP_LIKE(phone, '^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$')
  AND phone_completion > 0 THEN true
  ELSE false
END AS contactable_phone

REGEXP_REPLACE()

Use the REGEXP_REPLACE(string, regex) function to remove every instance of the substring matched by the regex pattern from string.

Use the REGEXP_REPLACE(string, regex, replace) function to replace every instance of the substring.

Remove whitespace

NULLIF(REGEXP_REPLACE(field, '^\\s*(\\S.*\\S)\\s*$', '\\1'), '')

Remove spaces

SELECT
  REGEXP_REPLACE(AddressLine1, '(\s*)([ ])', '')
  ,AddressLine1
FROM Customer360
LIMIT 100

Keep A-z, 0-9 from string

SELECT
  REGEXP_REPLACE(AddressLine1, '(\s*)([^a-zA-Z0-9])', '')
  ,AddressLine1
FROM Customer360
LIMIT 100

Keep alphabetical characters

,REGEXP_REPLACE(LOWER(given_name), '[^a-zA-Z]+', '') AS given_name
,REGEXP_REPLACE(LOWER(surname), '[^a-zA-Z]+', '') AS surname
,REGEXP_REPLACE(LOWER(full_name), '[^a-zA-Z ]+', '') AS full_name

Replace characters after + symbol

,REGEXP_REPLACE(LOWER(email), '\+(.*?)\@', '@') AS email

Keep characters before @ symbol

,SPLIT(REGEXP_REPLACE(LOWER(email), '\+(.*?)\@', '@'), '@')[1] AS email_username

Keep characters after @ symbol

,REPLACE(LOWER(email), SPLIT(REGEXP_REPLACE(LOWER(email), '\+(.*?)\@', '@'), '@')[1], '') AS email_domain

Clean up semantic values

,REGEXP_REPLACE(LOWER(phone), '[^0-9]+', '') AS phone
,REGEXP_REPLACE(LOWER(address), '[.]+', '') AS address
,REGEXP_REPLACE(LOWER(address2), '[.]+', '') AS address2
,REGEXP_REPLACE(LOWER(city), '[^a-zA-Z]+', '') AS city
,REGEXP_REPLACE(LOWER(state), '[^a-zA-Z]+', '') AS state

Formatting for Facebook Ads

Note

Facebook Ads has specific requirements for data formatting and naming of certain fields. Destinations and campaigns configured for Facebook Ads automatically apply the correct formatting to fields that are required by Facebook Ads. This example shows the regular expression formatting for those fields.

,REGEXP_REPLACE(LOWER(given_name), '[.,\/#!$%\^&\*;:{}=\-_`~()@'\"+ ]', '') AS FN
,REGEXP_REPLACE(LOWER(surname), '[.,\/#!$%\^&\*;:{}=\-_`~()@'\"+ ]', '') AS LN
,REGEXP_REPLACE(LOWER(city), '[^a-z]', '') AS CT
,REGEXP_REPLACE(LOWER(state), '[^a-z]', '') AS ST
,REGEXP_REPLACE(LOWER(postal), ' " "', '') AS ZIP
,REGEXP_REPLACE(LOWER(email), '[^a-z]', '') AS EMAIL

REPLACE()

Use the REPLACE() function to remove and/or replace all instances of search from string. There are two variants:

  • Use REPLACE(string, search) to remove all instances of search from string, i.e. “replace string with nothing”.

  • Use REPLACE(string, search, replace) to replace all instances of search from string with replace.

ROUND()

Use the ROUND(x) function to return x rounded to the nearest integer.

SPLIT()

Use the SPLIT(string, regex, limit) function to split a string at occurrences that match regex, and then return the results. Add a positive integer to LIMIT the number of occurrences to that integer.

Note

In Presto SQL, the index starts at 1.

Return username from email

To return only the username from an email address (the characters before the @ symbol):

SPLIT(REGEXP_REPLACE(LOWER(email), '[.]+', ''), '@')[1] AS email_username

SPLIT_PART()

Use the SPLIT_PART(string, delimiter, index) function to split a string at occurrences that match index, and then return the results. If the index is larger than the number of available characters, or if the specified index position is out of range, the function will return NULL.

Note

In Presto SQL, the index starts at 1.

For example:

TRIM(SPLIT_PART(GIFTCARDCODE, '-', 1))

Tip

Use the SPLIT() function in Spark SQL. For example:

SPLIT(GIFTCARDCODE,'-')[0]

In Spark SQL, the index starts at 0.

SUBSTR()

Use the SUBSTR() function to return N characters in a string. There are two variants:

  • Use SUBSTR(string, start) to return string from the start position that is equal to the value of start. A positive starting position (1) is relative to the start of string; a negative starting position (-1) is relative to the end of string.

  • Use SUBSTR(string, start, length) to return string from the start position that contains the number of characters specified by length. A positive starting position (1) is relative to the start of string; a negative starting position (-1) is relative to the end of string.

Return two characters

To return the last two characters of column_name, use either of:

SUBSTR(column_name, -2)

or

SUBSTR(column_name, -1, 2)

SUM()

Use the SUM(x) function to return the sum of all input values.

Note

The SUM() function will return NULL if the value of any field is NULL. In some situations you must use the COALESCE() function to coalesce to a zero-length string prior to concatenation. For example:

SUM(COALESCE(lifetimeValue,0))

Sum invoices as total amount

,SUM(orders.invoice_amount) AS total_amount

Total revenue, descending order

The following example uses the SUM() function to return total revenue, returns, and quantity, groups them, and then arranges the results in descending order:

SELECT
  purchase_brand
  ,SUM(order_revenue) AS total_revenue
  ,SUM(order_returned_revenue) AS total_returns
  ,SUM(order_quantity) AS total_quantity
FROM Unified_Transactions
GROUP BY 1
ORDER BY 1 DESC

TO_HEX()

Use the TO_HEX(binary) function to encode binary into a hex string representation.

One-way SHA-256 hash

The following example applies a one-way SHA-256 hash to email addresses:

TO_HEX(SHA256(TO_UTF8(UPPER(TRIM(email)))))

TO_UNIXTIME()

Use the TO_UNIXTIME(timestamp) function to return timestamp as a UNIX timestamp.

TRIM()

Use the TRIM(string) function to remove leading and trailing whitespace from string.

Tip

To remove only leading whitespace use the LTRIM(string) function instead of TRIM(string).

To remove only trailing whitespace use the RTRIM(string) function instead of TRIM(string).

TRY_CAST()

Use the TRY_CAST(value AS type) function to cast a value as a type, and then return NULL if the cast fails.

UPPER()

Use the UPPER(string) function to convert string to uppercase.