About Presto SQL¶
Presto is a distributed SQL query engine that is designed to efficiently query vast amounts of data using distributed queries. Presto is used within the Queries and Segments pages in the Amperity user interface.
Why should I use this reference?¶
The SQL Segment Editor in the Queries and Segment tabs uses Presto SQL as the underlying SQL engine. Amperity segments are built almost exclusively by using the SELECT statement, along with the clauses, operators, expressions, and functions you would expect to be available.
This reference is focused on how Presto SQL is used with Amperity and is not focused on anything that you would not expect to do from the Queries and Segment tabs in Amperity.
Please refer to this reference first, and then to the official Presto SQL documentation.
Amazon AWS vs. Azure¶
Amperity uses different versions of Presto SQL, depending on if the tenant runs in Amazon AWS or Microsoft Azure.
Tenants that run in Amazon AWS run Athena engine version 2.
Tenants in Azure run Trino, version 346: release notes, SELECT statement reference (for current version), functions reference (for current version).
Amperity behaves the same on either platform when using Presto SQL to build segments from within the Segments tab, with the following exceptions:
The way Presto SQL handles fixed precision rounding for a DECIMAL data type was changed in Presto SQL version 0.198. A decimal without an explicit type specifier–for example, 1.2–is parsed as a
DOUBLE
on tenants that run in Amazon AWS and as aDECIMAL
on tenants that run in Azure.
Recommendations¶
This is a list of recommendations for using Presto SQL with Amperity segments.
General¶
Do
Use consistent and descriptive identifiers and names.
Make judicious use of white space and indentation to make code easier to read.
Store ISO-8601 compliant time and date information (YYYY-MM-DD HH:MM:SS.SSSSS).
Try to use only standard SQL functions instead of vendor-specific functions for reasons of portability.
Keep code succinct and devoid of redundant SQL, such as unnecessary quoting or parentheses or WHERE clauses that can otherwise be derived.
Include comments in SQL code where necessary.
You may use the C-language style slash and asterisk opening (
/*
) and closing (*/
) pair of delimiters to enclose comments single- and multi-line comments, as long as comments are not nested.You may precede single-line comments using
--
.Convert string values to lowercase. For example:
LOWER("column_name") = 'value'
.Code review complex SQL statements.
Keep the Segments page free of draft segments as much as possible.
Ensure segments are run automatically.
Use one-way SHA-256 hashes for fields that contain PII data.
Use double quotes (
""
) for database table and field names. Use single quotes (''
) for values used in comparisons. For example:"State" = 'Washington'
.Use parentheses to group the contents of AND, OR, and NOT operators within complex SQL to ensure that each element within the WHERE clause runs in the correct order.
Avoid
Using reserved keyword names as identifiers.
CamelCase, with the exception of the table name. CamelCase is difficult to scan quickly.
Descriptive prefixes or Hungarian notation such as
sp_
ortbl
.Plurals. Use the more natural collective term where possible instead. For example staff instead of employees or people instead of individuals.
Quote identifiers. Use SQL92 double quotes for portability, if possible.
Object-oriented design principles with SQL or database structures.
Commas¶
Commas are used as separators in SQL queries and are often added at the end of a line, like this:
SELECT
amperity_id AS amperity_id,
max(given_name) AS given_name,
max(surname) AS surname,
max(email) AS email,
max(phone) AS phone,
max(address) AS address,
max(city) AS city,
max(state) AS state,
max(postal) AS postal,
max(birthdate) AS birthdate,
max(gender) AS gender
FROM Unified_Coalesced
GROUP BY amperity_id
Adding the comma at the start of the row is recommended for two reasons:
It makes it visually obvious that a comma was not missed.
It allows a field to be commented out without breaking anything else in the query.
For example:
SELECT
amperity_id AS amperity_id
,max(given_name) AS given_name
,max(surname) AS surname
,max(email) AS email
,max(phone) AS phone
,max(address) AS address
-- ,max(address_2) AS address_line_two
,max(city) AS city
,max(state) AS state
,max(postal) AS postal
,max(birthdate) AS birthdate
,max(gender) AS gender
FROM Unified_Coalesced
GROUP BY amperity_id
Indentation¶
To ensure that SQL is readable it is important that standards of indentation are followed. Use a two-character indentation pattern for SQL queries as often as possible to help ensure consistency and readability. A two-character indentation pattern also helps minimize the width of the query, making it easier to read on the page and less likely to wrap or run off the page to the right.
Formalisms¶
Make use of BETWEEN where possible instead of combining multiple statements with AND. Similarly use IN() instead of multiple OR clauses. Where a value needs to be interpreted before leaving the database use the CASE expression. CASE expressions can be nested to form more complex logical structures. Avoid the use of UNION clauses and temporary tables where possible. If the schema can be optimized to remove the reliance on these features then it most likely should be.
SELECT CASE postcode
WHEN 'BN1' THEN 'Brighton'
WHEN 'EH1' THEN 'Edinburgh'
END AS 'city_uk'
FROM office_locations
WHERE country = 'United Kingdom'
AND opening_time BETWEEN 8 AND 9
AND postcode IN ('EH1', 'BN1', 'NN1', 'KW1')
Joins¶
Joins should be aligned with the FROM clause and grouped with a new line where necessary. Indenting the ON and AND statements in a JOIN is optional, but can make it easier to see individual joins when many are present.
SELECT r.last_name
FROM riders AS r
INNER JOIN bikes AS b
ON r.bike_vin_num = b.vin_num
AND b.engine_tally > 2
INNER JOIN crew AS c
ON r.crew_chief_last_name = c.last_name
AND c.chief = 'Y';
Subqueries¶
Subqueries should be aligned to the line above them, but then follow standard indentation patters from that location. Sometimes it will make sense to have the closing parenthesis on a new line at the same character position as its opening partner—this is especially true where you have nested subqueries.
SELECT r.last_name,
(SELECT MAX(YEAR(championship_date))
FROM champions AS c
WHERE c.last_name = r.last_name
AND c.confirmed = 'Y') AS `last_championship_year`
FROM riders AS r
WHERE r.last_name IN
(SELECT c.last_name
FROM champions AS c
WHERE YEAR(championship_date) > '2008'
AND c.confirmed = 'Y');
Naming conventions¶
Ensure the name is unique and does not exist as a reserved keyword. Keep the length to a maximum of 30 bytes—in practice this is 30 characters unless you are using multi-byte character set. Names must begin with a letter and may not end with an underscore. Only use letters, numbers, and underscores in names. Avoid the use of multiple consecutive underscores—these can be hard to read. Use underscores where you would naturally include a space in the name (first name becomes first_name
). Avoid abbreviations and if you have to use them make sure they are commonly understood.
SELECT first_name
FROM Customer360
Alias names¶
An alias assigns a temporary name to a column or a table that only exists for the duration of a query. An alias is defined after the AS keyword.
For columns:
SELECT column_name AS alias_name
FROM table_name;
For tables:
SELECT column_name(s)
FROM table_name AS alias_name;
When using an alias:
Assign names that clearly relate to the column or table.
Important
Do not use an alias when its name matches the original column name.
Always use the AS keyword to ensure readability.
..tip:: Include the AS keyword when aliasing columns in a SELECT statement.
For computed data – SUM() or AVG() – use the name you would give it were it a column defined in the schema.
Always wrap an aliased field name with single back ticks:
`given_name`
The following example shows using aliases to add clarity to a SELECT statement that selects columns from more than one table. The “c360” alias refers to the Customer360
table and the “el” alias refers to the Email_List
table:
SELECT c360.email, c360.given_name, c360.surname, el.status
FROM Customer360 c360
LEFT JOIN Email_List el ON c360.email = el.email
WHERE status IS NOT NULL
Column names¶
Always use the singular name. Where possible avoid simply using id as the primary identifier for the table. Do not add a column with the same name as its table and vice versa. Always use lowercase except where it may make sense not to, such as with proper nouns.
Correlation names¶
A correlation name assigns a temporary name to a table. A correlation name is defined within the FROM clause after the table name.
Syntax:
FROM table_name correlation_name
Example:
FROM Merged_Customers mc
When using correlations:
Assign names that clearly relate to the table. For example: a table named “Merged_Customers” should have a correlation name of “mc”.
Always use the first letter of the table as the first letter of the correlation.
Append additional alphanumeric characters as necessary’s to ensure unique correlation names.
Ensure that correlation names are used only within the same query.
Suffix names¶
The following suffixes represent patterns that should be applied to column names in SQL tables. These patterns help ensure the purpose of columns can be easily understood by all users. Use the correct suffixes, use them consistently, and use them everywhere they are appropriate.
Suffix |
Description |
---|---|
_id |
A unique identifier such as a column that is a primary key. |
_status |
A flag value or some other status of any type such as |
_total |
The total or sum of a collection of values. |
_num |
Denotes the field contains any kind of number. |
_name |
Signifies a name such as |
_seq |
Contains a contiguous sequence of values. |
_date |
Denotes a column that contains the date of something. |
_tally |
A count. |
_size |
The size of something such as a file size or clothing. |
_addr |
An address for the record could be physical or intangible such as |
Table names¶
Use a collective name or, less ideally, a plural form. For example (in order of preference) staff and employees. Do not prefix with tbl or any other such descriptive prefix or Hungarian notation. Never give a table the same name as one of its columns and vice versa. Avoid, where possible, concatenating two table names together to create the name of a relationship table. Rather than cars_mechanics prefer services.
NULL values¶
Functions may fail when they encounter a NULL value and others may return NULL values if any of their arguments return NULL values.
Use the COALESCE() function to convert to a zero-length string when using the CONCAT() and SUM() functions.
Use the COALESCE() function to to identify math. For example, multiplication will return NULL if any field is NULL. For example, because
1
is the identity for multiplication, useCOALESCE(myColumn, 1)
Reserved words¶
Always use uppercase for the reserved keywords like SELECT and WHERE. It is best to avoid the abbreviated keywords and use the full length ones where available (prefer ABSOLUTE to ABS). Do not use database server specific keywords where an ANSI SQL keyword already exists performing the same function. This helps to make code more portable.
SELECT model_num
FROM Phones AS p
WHERE p.release_date > '2014-09-30'
Reserved keywords¶
The following keywords are reserved and must be double-quoted to be used as an identifier.
ALTER
AND
AS
BETWEEN
BY
CASE
CAST
CONSTRAINT
CREATE
CROSS
CUBE
CURRENT_DATE
CURRENT_TIME
CURRENT_TIMESTAMP
CURRENT_USER
DEALLOCATE
DELETE
DESCRIBE
DISTINCT
DROP
ELSE
END
ESCAPE
EXCEPT
EXECUTE
EXISTS
EXTRACT
FALSE
FOR
FROM
FULL
GROUP
GROUPING
HAVING
IN
INNER
INSERT
INTERSECT
INTO
IS
JOIN
LEFT
LIKE
LOCALTIME
LOCALTIMESTAMP
NATURAL
NORMALIZE
NOT
NULL
ON
OR
ORDER
OUTER
PREPARE
RECURSIVE
RIGHT
ROLLUP
SELECT
TABLE
THEN
TRUE
UESCAPE
UNION
UNNEST
USING
VALUES
WHEN
WHERE
WITH
One-way SHA-256 hashes¶
Apply one-way SHA-256 hashes to fields that contain PII data. A one-way hash ensures that data can no longer be recognizable as valid PII, yet still allows that data to applied to segments that report on users who were flagged for removal or deletion and helps ensure that removed or obfuscated data is not sent from Amperity.
A one-way SHA-256 hash has the following syntax:
TO_HEX(SHA256(TO_UTF8(UPPER(TRIM(FIELD)))))
and uses the following Presto SQL functions:
TRIM()
removes whitespace from the field.UPPER()
sets all characters to upper-case.TO_UTF8()
converts the data into a binary format.SHA256()
hashes data with a one-way SHA-256 hash.TO_HEX()
converts the binary data into a string.
Whitespace¶
To make the code easier to read it is important that the correct complement of spacing is used. Do not crowd code or remove natural language spaces.
Line spacing¶
Always include newlines/vertical space:
before AND or OR
after semicolons to separate queries for easier reading
after each keyword definition
before a comma when separating multiple columns into logical groups
to separate code into related sections, which helps to ease the readability of large chunks of code.
Putting commas and conjunctions at the start of the line makes it easier to comment out a single line without disturbing the rest of the query
SELECT
a.title
,a.release_date
--,a.recording_date
,a.production_date
FROM albums AS a
WHERE a.title = 'Charcoal Lane'
OR a.title = 'The New Danger';
Spaces¶
Spaces should be used to line up the code so that the root keywords all start on the same character boundary, and also so that This makes it easy to keep track of where you are in a query that may be multiple layers deep.
(SELECT f.species_name
,AVG(f.height) AS `average_height`
,AVG(f.diameter) AS `average_diameter`
FROM flora AS f
WHERE f.species_name = 'Banksia'
OR f.species_name = 'Sheoak'
OR f.species_name = 'Wattle'
GROUP BY f.species_name, f.observation_date)
UNION ALL
(SELECT b.species_name
,AVG(b.height) AS `average_height`
,AVG(b.diameter) AS `average_diameter`
FROM botanic_garden_flora AS b
WHERE b.species_name = 'Banksia'
OR b.species_name = 'Sheoak'
OR b.species_name = 'Wattle'
GROUP BY b.species_name, b.observation_date)
Although not exhaustive always include spaces:
before and after equals (
=
)after commas (
,
)surrounding apostrophes (
'
), but not within parentheses or with a trailing comma or semicolon.
SELECT
a.title
,a.release_date
,a.recording_date
FROM albums AS a
WHERE a.title = 'Charcoal Lane'
OR a.title = 'The New Danger'
Example SELECT statement¶
The following example shows selecting the Amperity ID, purchase date, and order ID, and then uses two window functions to rank by transaction totals, and then put in descending order to sort the most recent purchase date first:
SELECT
t.Amperity_Id,
t.purchasedate,
t.orderid,
rank() OVER (PARTITION BY t.Amperity_Id
ORDER BY t.transactiontotal DESC) AS rank,
t.transactiontotal,
sum(t.transactiontotal) OVER (PARTITION BY t.Amperity_Id
ORDER BY t.purchasedate) AS rolling_sum
FROM
TransactionsEcomm t
ORDER BY t.Amperity_Id, rank
LIMIT 100
WITH clause¶
The WITH clause defines a common table expression (CTE).
A common table expression (CTE) is a named subquery defined by the WITH
clause. A CTE defines a set of variables that act as a temporary view, an optional list of column names, and a query expression. The results of the query expression behaves like a table. Each column of that table is defined by the optional list of column names.
The following example shows using multiple CTEs:
WITH new_in_21 AS (
SELECT
amperity_id
,one_and_done
FROM Transaction_Attributes
WHERE YEAR(first_order_datetime) = 2021
),
product_categories AS (
SELECT DISTINCT
new_in_21.amperity_id
,one_and_done
,product_category
FROM Unified_Itemized_Transactions uit
INNER JOIN new_in_21 ON new_in_21.amperity_id=uit.amperity_id
)
SELECT
product_category
,COUNT(distinct amperity_id) customer_count
,1.0000*SUM(CASE when one_and_done THEN 1 ELSE 0 END) / COUNT(DISTINCT amperity_id) pct_one_done
FROM product_categories
GROUP BY 1
ORDER BY 3 DESC
SELECT statement¶
The SELECT statement defines a set of data to be returned from a data table. The set of returned data is often referred to as the result-set. Use the SELECT statement to retrieve rows from any table in the customer 360 database.
A SELECT statement can be complex, depending on the type of query you need to make. For example, the following SELECT statement ranks transactions by Amperity ID and by largest dollar totals:
SELECT
t.orderid,
t.Amperity_Id,
t.transactiontotal,
rank() OVER (partition by t.Amperity_Id order BY t.transactiontotal DESC) AS rank
FROM
TransactionsEcomm t
ORDER BY t.Amperity_Id, rank asc
LIMIT 100
The rest of this topic describes the clauses, expressions, functions, and operators that are the most commonly used within the SQL Segment Editor in Amperity. More functionality than what is described in this topic is supported, as the segment editors use Presto SQL.
Important
Not all of the functionality described in the official documentation for Presto SQL should be used in the SQL Segment Editor.
Subquery predicates¶
A subquery is an expression which is composed of a query. The subquery is correlated when it refers to columns outside of the subquery. Logically, the subquery will be evaluated for each row in the surrounding query. The referenced columns will thus be constant during any single evaluation of the subquery.
Note
Support for correlated subqueries is limited. Not every standard form is supported.
EXISTS predicate¶
The EXISTS predicate determines if a subquery returns any rows:
SELECT name
FROM nation
WHERE EXISTS (SELECT * FROM region WHERE region.regionkey = nation.regionkey)
IN predicate¶
The IN predicate determines if any values produced by the subquery are equal to the provided expression. The result of IN follows the standard rules for NULL values. The subquery must produce exactly one column:
SELECT name
FROM nation
WHERE regionkey IN (SELECT regionkey FROM region)
SELECT DISTINCT statement¶
Use the SELECT DISTINCT statement instead of SELECT to return only distinct rows of data.
For example:
SELECT DISTINCT
amperity_id
,order_id
,order_quantity
,order_datetime
FROM Unified_Transactions
FROM clause¶
The FROM clause specifies the name of the data table against which the SQL query will be run and is part of every SELECT statement.
TABLESAMPLE BERNOULLI clause¶
Use the TABLESAMPLE BERNOULLI clause to return a random sample of records. Specify the size of the random sample within parentheses. For example:
SELECT
amperity_id
FROM Customer_360 TABLESAMPLE BERNOULLI (50)
will return a random sample of records from the Customer 360 table at a 50% rate.
The following example shows a segment for use with campaigns that samples all customers with the surname “Smith” who do not have the given name “Joe” at a 30% rate:
SELECT
"amperity_id"
FROM "Customer_360" TABLESAMPLE BERNOULLI (30)
WHERE
(
(
"amperity_id" IN (
SELECT DISTINCT "t0"."amperity_id"
FROM "Customer_360" "t0"
INNER JOIN "Customer_360" "t1"
ON "t0"."amperity_id" = "t1"."amperity_id"
WHERE
(
LOWER("t0"."surname") = 'smith'
AND LOWER("t1"."given_name") <> 'joe'
)
)
)
AND LOWER("surname") = 'smith'
AND LOWER("given_name") <> 'joe'
)
and returns a list of Amperity IDs ready for use with a campaign.
LEFT JOIN clause¶
The LEFT JOIN clause joins rows from two tables. For a LEFT JOIN, each row in the left table is joined with all matching rows from the right table. For rows with no match in the right table, the join is completed with NULL to represent column values.
For example:
SELECT * FROM (VALUES 1, 2) t("left")
LEFT JOIN (VALUES 1, 1) u("right")
ON t."left" = u."right";
will return a table similar to:
-------- --------
left right
-------- --------
1 1
1 1
2 NULL
-------- --------
WHERE clause¶
The WHERE clause filters records, and then returns only records that match the specified condition and value. The WHERE clause can be combined with AND, OR, and NOT operators.
Tip
Use parentheses to group the contents of AND, OR, and NOT operators within complex SQL to ensure that each element within the WHERE clause runs in the correct order. For example:
Do
AND (c360.email IS NOT NULL OR c360.phone IS NOT NULL)
AND (
(is_cancellation IS NULL)
OR (NOT is_cancellation)
)
AND (
(is_return IS NULL)
OR (NOT is_return)
)
Avoid
AND c360.email IS NOT NULL OR c360.phone IS NOT NULL
AND is_cancellation IS NULL OR NOT is_cancellation
AND is_return IS NULL OR NOT is_return
AND operator¶
The AND operator is used to filter records based on more than one condition. The AND operator displays a record if all the conditions separated by AND are TRUE
.
OR operator¶
The AND operator is used to filter records based on more than one condition. The OR operator displays a record if any of the conditions separated by OR is TRUE
.
NOT operator¶
The NOT operator displays a record if the condition(s) is NOT TRUE
.
CASE expression¶
The standard SQL CASE expression has two forms: simple and searched.
Note
If a CASE expression is too long, add a new line between WHEN and THEN clauses to make the line more readable.
Simple form¶
The simple form searches each value expression from left to right until it finds one that equals expression:
CASE expression
WHEN value THEN result
[ WHEN ... ]
[ ELSE result ]
END
The result for the matching value is returned.
If no match is found, the result from the ELSE clause is returned if it exists, otherwise NULL is returned:
SELECT a,
CASE a
WHEN 1 THEN 'one'
WHEN 2 THEN 'two'
ELSE 'many'
END
Searched form¶
The searched form evaluates each boolean condition from left to right until one is true and returns the matching result:
CASE
WHEN condition THEN result
[ WHEN ... ]
[ ELSE result ]
END
If no conditions are true, the result from the ELSE clause is returned if it exists, otherwise NULL is returned:
SELECT a, b,
CASE
WHEN a = 1 THEN 'aaa'
WHEN b = 2 THEN 'bbb'
ELSE 'ccc'
END
Group states by time zones¶
The following CASE expression groups states by North American time zone:
,CASE
WHEN state = 'HI' THEN 'Hawaii'
WHEN state = 'AK' THEN 'Alaska'
WHEN state IN ('WA','OR','CA','NV') THEN 'Pacific Time (US & Canada)'
WHEN state IN ('MT','ID','WY','UT','CO','AZ','NM') THEN 'Mountain Time (US & Canada)'
WHEN state IN ('ND','SD','NE','KS','OK','TX','MN','WI','IA','IL','MO','AR','TN','LA','MS','AL') THEN 'Central Time (US & Canada)'
WHEN state IN ('MI','IN','OH','KY','FL','GA','SC','NC','WV','VA','MD','DE','DC','NJ','PA', 'NY','CT','RI','MA','VT','NH','ME') THEN 'Eastern Time (US & Canada)'
WHEN state = 'PR' THEN 'Puerto Rico'
WHEN state = 'GU' THEN 'Guam'
WHEN state = 'VI' THEN 'US Virgin Islands'
WHEN state = 'AS' THEN 'American Samoa'
ELSE '' END AS time_zone
GROUP BY clause¶
The GROUP BY clause divides the output of a SELECT statement into groups of rows containing matching values. A simple GROUP BY clause may contain any expression composed of input columns or it may be an ordinal number selecting an output column by position (starting at one).
The following queries are equivalent. They both group the output by the given_name
input column with the first query using the ordinal position of the output column and the second query using the input column name:
SELECT COUNT(*), given_name FROM Customer_360 GROUP BY 2
is equivalent to:
SELECT COUNT(*), given_name FROM Customer_360 GROUP BY given_name
GROUP BY clauses can group output by input column names not appearing in the output of a select statement. For example, the following query generates row counts for the customer table using the input column surname
:
SELECT COUNT(*) FROM Customer_360 GROUP BY surname
returns a table similar to:
_col0
-------
82
468
42
9
1
(3984 rows)
When a GROUP BY clause is used in a SELECT statement all output expressions must be either aggregate functions or columns present in the GROUP BY clause.
CUBE operator¶
The CUBE operator generates all possible groupings for the specified group of columns.
For example, the following query:
SELECT
amperity_id
,purchase_brand
,SUM(order_revenue) AS total_order_revenue
FROM Unified_Transactions
--WHERE amperity_id = '12345-abcde'
GROUP BY CUBE(amperity_id, purchase_brand)
ORDER BY amperity_id, purchase_brand
will return a table similar to:
------------- ---------------- ---------------------
amperity_id purchase_brand total_order_revenue
------------- ---------------- ---------------------
FoxEssentials 3976068.32
FoxEssentials 3976068.32
GoldenSwan 1358357.54
GoldenSwan 1358357.54
TrendyBear 5703142.95
...
------------- ---------------- ---------------------
ROLLUP operator¶
The ROLLUP operator generates all possible subtotals for the specified group of columns.
For example, the following query:
SELECT
amperity_id
,purchase_brand
,SUM(order_revenue) AS total_order_revenue
FROM Unified_Transactions
--WHERE amperity_id = '12345-abcde'
GROUP BY ROLLUP(amperity_id, purchase_brand)
ORDER BY amperity_id, purchase_brand
will return a table similar to:
------------- ---------------- ---------------------
amperity_id purchase_brand total_order_revenue
------------- ---------------- ---------------------
FoxEssentials 3976068.32
GoldenSwan 1358357.54
TrendyBear 5703142.95
11037568.81
...
------------- ---------------- ---------------------
HAVING clause¶
The HAVING clause sorts a result set by one or more output expressions. Use in conjunction with aggregate functions and the GROUP BY clause to control which groups are selected. A HAVING clause eliminates groups that do not satisfy the given conditions and filters groups after groups and aggregates are computed.
For example:
SELECT
COUNT(*)
,purchase_brand
,payment_method
,CAST(SUM(order_quantity) AS bigint) AS quantity
FROM Unified_Transactions
GROUP BY purchase_brand, payment_method
HAVING SUM(order_quantity) > 100
ORDER BY quantity DESC
returns a table similar to:
---------------- ---------------- ----------
purchase_brand payment_method quantity
---------------- ---------------- ----------
FoxEssentials VISA 51753
GoldenSwan VISA 50933
TrendyBear 34329
FoxEssentials PayPal 33513
GoldenSwan MASTER 16651
GoldenSwan MASTER 6482
FoxEssentials DISCOVER 4673
...
---------------- ---------------- ----------
Set operations¶
Use one of the following set operation clauses to combine the results of more than one SELECT statement into a single result set:
EXCEPT clause¶
Use the EXCEPT clause to return the rows that are in the result set of the first query, but not the second.
For example, to compare two tables and find out which rows are in table B, but not in table A:
SELECT table_b
EXCEPT SELECT table_a
Tip
The EXCEPT clause deduplicates records. For example, if all of the rows in table A are present in table B, the results should be 0 rows.
Use a query similar to the following to verify that 0 rows is accurate:
SELECT COUNT(*)
FROM (
SELECT DISTINCT *
FROM table_b
)
INTERSECT clause¶
Use the INTERSECT clause to return only the rows that are in the result sets of both the first and the second queries.
For example, to find all rows in both tables A and B:
SELECT table_b
INTERSECT SELECT table_a
UNION clause¶
The UNION clause combines the the unique results of one query with the unique results of a second. For example:
SELECT shirts, pants
UNION
SELECT shoes
will return:
_col0
-------
shirts
pants
shoes
The UNION clause supports two arguments: ALL and DISTINCT. Use UNION ALL to return all rows, even if they are identical. UNION DISTINCT is the default behavior of UNION (when an argument is not specified) and will return only unique rows.
Tip
Use GROUP BY CUBE and/or GROUP BY ROLLUP clauses for more efficient unions of records.
For example, the following query:
SELECT
amperity_id
,purchase_brand
,SUM(order_revenue) AS total_order_revenue
FROM Unified_Transactions
--WHERE amperity_id = '12345-abcde'
GROUP BY amperity_id, purchase_brand
UNION ALL
SELECT
amperity_id
,NULL
,SUM(order_revenue) AS total_order_revenue
FROM Unified_Transactions
--WHERE amperity_id = '12345-abcde'
GROUP BY amperity_id
ORDER BY purchase_brand NULLS LAST
will return a table similar to:
------------- ---------------- ---------------------
amperity_id purchase_brand total_order_revenue
------------- ---------------- ---------------------
FoxEssentials 3976068.32
GoldenSwan 1358357.54
TrendyBear 5703142.95
11037568.81
...
------------- ---------------- ---------------------
UNNEST clause¶
The UNNEST clause expands an ARRAY or MAP into a relation.
Arrays are expanded into a single column.
Maps are expanded into two columns (key, value).
UNNEST can also be used with multiple arguments, in which case they are expanded into multiple columns, with as many rows as the highest cardinality argument (the other columns are padded with NULL values).
UNNEST can optionally have a WITH ORDINALITY clause, in which case an additional ordinality column is added to the end. UNNEST is normally used with a JOIN and can reference columns from relations on the left side of the join.
Use the UNNEST clause in the SQL segment to expand phone and/or email addresses so that the query looks at individual bad values. The following example shows using an UNNEST clause to expand email addresses that are part of a bad-values blocklist SQL query:
SELECT
REGEXP_REPLACE(v.value,'\+.*@','@') AS value
,M.semantic
,M.datasource
,M.domain_table
,M.proxy
FROM (
SELECT DISTINCT
SPLIT(UPPER(email), ',') AS vs
,'email' AS semantic
,REGEXP_EXTRACT(datasource, '.+?(?=:)') AS datasource
,datasource AS domain_table
,UPPER(given_name) AS proxy
FROM Unified_Coalesced AS UC
) AS M
CROSS JOIN UNNEST(vs) AS v(value)
WHERE value IS NOT NULL
AND value <> ''),
VALUES clause¶
The VALUES clause can be used anywhere a query can be used, such as the FROM clause of a SELECT statement, an INSERT, or even at the top level. The VALUES clause creates an anonymous table without column names, but the table and columns can be named using an AS keyword with column aliases.
To return a table with one column and three rows:
VALUES 1, 2, 3
To return a table with two columns and three rows:
VALUES
(1, 'a'),
(2, 'b'),
(3, 'c')
To return table with column ID and name:
SELECT * FROM (
VALUES
(1, 'a'),
(2, 'b'),
(3, 'c')
) AS t (id, name)
To add a table with column ID and name:
CREATE TABLE example AS
SELECT * FROM (
VALUES
(1, 'a'),
(2, 'b'),
(3, 'c')
) AS t (id, name)
Window functions¶
A window function performs calculations across rows. A window function runs after the HAVING clause, but before the ORDER BY clause. A window function requires special syntax using the OVER clause to specify the window.
Tip
When a window function is too long, add a new code line to improve readability.
A window function has the following components:
A list of columns that act as inputs to the window function.
A window function that defines the operation to be performed by the window function. For example: RANK() or SUM(). This may be any of the ranking functions, value functions, or aggregate functions available in Presto SQL.
OVER() defines the window frame, which represents a sliding window.
PARTITION BY separates the input into different partitions; only rows in the specified partition will be considered by the window function
ORDER BY determines the order in which the input will be processed by the window function.
Tip
Use ROWS BETWEEN to define any number of preceding and following rows, as related to the current row.
For example:
ROWS BETWEEN 1 preceding AND 1 following
ROWS BETWEEN current row AND 1 following
ROWS BETWEEN 5 preceding AND 2 preceding
DESC sorts the output in descending order.
Tip
Add
_uuid_pk
to window function ordering–DESC, _uuid_pk
–to return deterministic results in case of ties.
For example:
SELECT input_1, input_2, input_3,
window_function() OVER (PARTITION BY input_2
ORDER BY input_3 DESC, _uuid_pk) AS rnk
FROM table_name
ORDER BY input_2, rnk
Rolling 7-day window¶
The following example shows a rolling seven day window for order revenue.
SELECT
*
FROM (
SELECT
purchase_channel
,order_day
,SUM(order_revenue) OVER (PARTITION BY purchase_channel ORDER BY order_day ROWS BETWEEN 6 preceding AND current row) rolling_7_day_revenue
FROM (
SELECT
purchase_channel
,DATE(order_datetime) order_day
,SUM(order_revenue) order_revenue
FROM Unified_Transactions
WHERE amperity_id IS NOT NULL
AND order_datetime > (CURRENT_DATE - interval '36' day)
GROUP BY 1,2
)
)
WHERE order_day > (CURRENT_DATE - interval '30' day)
ORDER BY 1,2
Retention rate, previous year¶
The following example shows how to use a rolling 1-year window to return the values necessary for calculating retention rate during the previous year.
WITH customers AS (
SELECT DISTINCT amperity_id AS amperity_id
FROM Unified_Transactions
WHERE order_datetime < CAST(CURRENT_DATE - interval '1' year AS date)
)
,orders AS (
SELECT DISTINCT amperity_id AS ordered_in_past_year
FROM Unified_Transactions
WHERE order_datetime >= CAST(CURRENT_DATE - interval '1' year AS date)
)
SELECT
SUM(CAST(ordered_in_past_year IS NULL AS int)) AS not_retained
,SUM(CAST(ordered_in_past_year IS NOT NULL AS int)) AS retained
FROM customers
LEFT JOIN orders ON amperity_id = ordered_in_past_year
This query will return all customers who purchased more than 1 year ago, and then separates them into two groups: customers who did purchase within the previous year (retained
) and customers who did not (not_retained
).
Retention rate, in this example, is the number of customers who made a repeat purchase within the previous year represented as a percentage of all customers.
For example, if this query returned a table similar to:
-------------- ----------
not_retained retained
-------------- ----------
525217 87421
-------------- ----------
this means there are 612,638 Amperity IDs (not_retained plus retained) and 87,421 Amperity IDs that made a repeat purchase (retained). Divide the retained amount by the total number of Amperity IDs to get the retention rate: 14.2%.
ORDER BY clause¶
The ORDER BY clause sorts a result set by one or more output expressions.
ORDER BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...]
Each expression may be composed of output columns or it may be an ordinal number selecting an output column by position (starting at one). The ORDER BY clause is evaluated as the last step of a query after any GROUP BY or HAVING clause. The default null ordering is NULLS LAST, regardless of the ordering direction.
LIMIT clause¶
The LIMIT clause restricts the number of rows in the result set. The following example queries a large table, but the limit clause restricts the output to only have five rows (because the query lacks an ORDER BY, exactly which rows are returned is arbitrary):
SELECT orderdate FROM orders LIMIT 5;
returns a table similar to:
o_orderdate
-------------
1996-04-14
1992-01-15
1995-02-01
1995-11-12
1992-04-26
(5 rows)
Operators¶
An operator compares two data items, and then returns a result.
BETWEEN¶
Use the BETWEEN operator to test if a value falls within the specified range using the syntax BETWEEN [minimum value] AND [maximum value]:
SELECT 3 BETWEEN 2 AND 6;
The statement shown above is equivalent to the following statement:
SELECT 3 >= 2 AND 3 <= 6;
Presence of NULL evaluates to NULL
The presence of NULL will result in the statement evaluating to NULL:
SELECT NULL BETWEEN 2 AND 4;
and:
SELECT 2 BETWEEN NULL AND 6;
String arguments must be of same type
Use the BETWEEN operator to evaluate string arguments as long as the value, min, and max parameters are of the same type:
SELECT 'Paul' BETWEEN 'John' AND 'Ringo';
whereas this query will produce an error:
SELECT '2.3' BETWEEN 'John' AND '35.2';
Within a CASE statement
CASE
WHEN field BETWEEN 20 AND 1 THEN 'match'
ELSE 'noop'
END
is equivalent to:
CASE
WHEN field < 20 AND field > 1 THEN 'match'
ELSE 'noop'
END
Comparison operators¶
The following comparison operators are available:
Operator |
Description |
---|---|
< |
Less than |
> |
Greater than |
<= |
Less than or equal to |
>= |
Greater than or equal to |
= |
Equal |
<> |
Not equal |
!= |
Not equal (non-standard but popular syntax) |
Quantifiers–ALL, ANY, SOME–may be used along with comparison operators to provide more specific comparisons.
ALL¶
Use ALL together with comparison operators in the following way:
expression operator ALL ( subquery )
For example:
SELECT 21 < ALL (VALUES 19, 20, 21); -- false
Examples of ALL comparison operator combinations:
Expression |
Meaning |
---|---|
A = ALL (…) |
Evaluates to |
A <> ALL (…) |
Evaluates to |
A < ALL (…) |
Evaluates to |
ANY¶
Use ANY together with comparison operators in the following way:
expression operator ANY ( subquery )
For example:
SELECT 21 < ALL (VALUES 19, 20, 21); -- false
Examples of ANY comparison operator combinations:
Expression |
Meaning |
---|---|
A = ANY (…) |
Evaluates to Note This form is equivalent to A IN (…). |
A <> ANY (…) |
Evaluates to |
A < ANY (…) |
Evaluates to |
SOME¶
Use SOME together with comparison operators in the following way:
expression operator SOME ( subquery )
For example:
SELECT 42 >= SOME (SELECT 41 UNION ALL SELECT 42 UNION ALL SELECT 43); -- true
Examples of SOME comparison operator combinations:
Expression |
Meaning |
---|---|
A = SOME (…) |
Evaluates to Note This form is equivalent to A IN (…). |
A <> SOME (…) |
Evaluates to |
A < SOME (…) |
Evaluates to |
IS NULL¶
The IS NULL operator tests whether a value is NULL. This operator works for all data types.
Using NULL with IS NULL evaluates to TRUE
:
select NULL IS NULL;
But any other constant does not evaluate to FALSE
:
SELECT 3.0 IS NULL;
IS NOT NULL¶
The IS NULL operator tests whether a value is not NULL. This operator works for all data types.
Using NULL with IS NOT NULL evaluates to FALSE
:
select NULL IS NOT NULL;
But any other constant evaluates TRUE
:
SELECT 3.0 IS NOT NULL;
PIVOT operator¶
Warning
Presto SQL does not support the use of a PIVOT operator.
Functions¶
A function is a SQL statement that accepts input parameters, performs actions, and then returns results. This section highlights some useful functions for use when building segments in Amperity.
Note
This section highlights a very small subset of the complete list of functions available in Presto SQL, many of which can be useful depending on the type of query.
The following list contains some of the most frequently used functions for building segments via the SQL Segment Editor (alphabetized):
ARRAY_AGG()¶
Use the ARRAY_AGG(x) function to return an array created by x
.
Aggregate order IDs¶
The following SQL query uses the ARRAY_JOIN() and ARRAY_AGG() functions to aggregate an array of order IDs using a comma as the delimiter:
SELECT
amperity_id
,ARRAY_JOIN(ARRAY_AGG(order_id), ', ') AS Orders
FROM Transactions
GROUP BY amperity_id
LIMIT 1000
Sort in chronological order¶
The following SQL query uses the ARRAY_AGG() function to sort a list of products in chronological order:
SELECT DISTINCT
sub.amperity_id
,ARRAY_JOIN(ARRAY_AGG(sub.item_name) OVER(
PARTITION BY sub.amperity_id
ORDER BY sub.order_datetime DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
), ', ') AS Products_Ordered
FROM (
SELECT DISTINCT
amperity_id
,item_name
,order_datetime
FROM DW_OrderDetail Z
WHERE order_closed = true AND order_status <> 'Void'
) AS sub
Warning
This example only runs in tenants that are running on Microsoft Azure.
Separate product IDs with pipe symbol¶
The following example joins product IDs and uses the pipe symbol (|
) as a separator:
,ARRAY_JOIN(ARRAY_AGG(invoices), '|') AS product_ids
ARRAY_JOIN()¶
Use the ARRAY_JOIN(array, delimiter, string) function to concatenate elements of a given array
using a delimiter
and an optional string
to replace NULL values.
Join order IDs¶
The following SQL query uses the ARRAY_JOIN() and ARRAY_AGG() functions to aggregate an array of order IDs using a comma as the delimiter:
SELECT
amperity_id
,ARRAY_JOIN(ARRAY_AGG(order_id), ', ') AS Orders
FROM Transactions
GROUP BY amperity_id
LIMIT 1000
AVG()¶
Use the AVG(x) function to return the average of all input values.
Return mean¶
The following example returns the mean.
SELECT
product_category
,1.0*AVG(item_revenue) avg_item_revenue
FROM Unified_Itemized_Transactions
WHERE amperity_id IS NOT NULL
GROUP BY 1
ORDER BY 1 DESC
CAST()¶
Use the CAST(value AS type) function to cast the value of value
as type
.
Cast RFM as REAL data type¶
The following example shows using the CAST() function to cast the values of Recency
, Frequency
, and Monetary
to the real
data type (a floating-point, 32-bit inexact, variable-precision value).
,ROUND((CAST(Recency AS real) + CAST(Frequency AS real) + CAST(Monetary AS real)) / 3, 2)
Cast as UUID¶
The following example casts email addresses as a UUID:
,CAST(email AS UUID) AS "email_address"
Cast date as year¶
,CAST(YEAR(purchase_date) AS VARCHAR) AS purchase_year
CHR()¶
Use the CHR(codepoint) function to return the Unicode codepoint
as a single-character string.
Tip
Use the CODEPOINT(string) function to return the Unicode codepoint
as the only character of string
. For example:
CODEPOINT(')
Returns:
39
Add leading apostrophe¶
Some use cases for CSV files have a downstream dependency on Microsoft Excel. Excel automatically removes leading zeros and converts large numbers to scientific notation, such as 1.23E+15, to ensure that formulas and math operations work correctly.
Long strings are output from Amperity in the form of the Amperity ID, customer keys, loyalty program IDs, product codes, and so on. Add a leading apostrophe (’) to these strings within the Amperity segment to ensure these string values are interpreted as text by Excel.
Use the CONCAT() function to build a string with a leading character, and then use the CHR() function to apply Unicode codepoint 39
, which is an apostrophe.
For example:
CONCAT(CHR(39),customer_key)
COALESCE()¶
Use the COALESCE(value1, value2[, …]) function to return the first non-null value in the argument list. Arguments are only evaluated if necessary.
CONCAT()¶
Use the CONCAT(array1, array2, …) function to concatenate a set of arrays into a single value that can be used elsewhere in a SQL query.
Note
The CONCAT() function will return NULL if the value of any field is NULL. Use the COALESCE() function to coalesce to a zero-length string prior to concatenation. For example, use:
CONCAT(COALESCE(firstname, ""), " ", COALESE(lastname, ""))
to concatenate a FullName field.
COUNT()¶
Use the COUNT(*) function to return the number of input rows.
Count customers by state¶
The following example counts customers in the United States, and then also in California, Oregon, Washington, Alaska, and Hawaii who also belong to the loyalty program (which is indicated when loyalty_id
is not NULL):
SELECT
state
,COUNT(amperity_id) AS TotalCustomers
FROM
Customer360
WHERE (UPPER("country") = 'US'
AND UPPER("state") in ('AK', 'CA', 'HI', 'OR', 'WA')
AND LOWER("loyalty_id") IS NOT NULL)
GROUP BY state
CURRENT_DATE¶
Use the CURRENT_DATE function to return the current date as of the start of the query. You may use an interval to offset the returned date by the value of the interval.
Birthdays, tomorrow¶
The following example returns all users who have a birthday tomorrow:
SELECT DISTINCT
amperity_id
,email
,given_name
,birthdate
FROM Customer360
WHERE MONTH('birthdate') = MONTH(CURRENT_DATE + interval '1' day)
AND DAY('birthdate') = DAY(CURRENT_DATE + interval '1' day)
CURRENT_TIMESTAMP¶
Use the CURRENT_TIMESTAMP function to return the current timestamp for the time zone specified by the AT TIME ZONE operator:
CURRENT_TIMESTAMP AT TIME ZONE 'time/zone'+ interval '1' day
where time/zone
is a valid TZ database name , such as “America/Los_Angeles” or “America/New_York”.
Important
When the Enable performance mode option is enabled for queries, the CURRENT_TIMESTAMP function must be cast as a timestamp. For example:
CAST(CURRENT_TIMESTAMP AS timestamp)
DATE_DIFF()¶
Use the DATE_DIFF(unit, timestamp1, timestamp2) function as a way to return the difference between two columns that contain timestamp data, expressed in terms of an interval unit
.
Possible interval unit values:
millisecond
second
minute
hour
day
week
month
quarter
year
Calculate purchase day¶
SELECT
,t.amperity_id
,t.purchasedate
,firstpurchase.firstpurchasedate
,DATE_DIFF('day', firstpurchase.firstpurchasedate, t.purchasedate) AS PurchaseDecay
FROM TransactionsEcomm t
INNER JOIN
(
SELECT
i.amperity_id
,MIN(i.purchasedate) AS firstpurchasedate
FROM TransactionsEcomm i
GROUP BY i.amperity_id
LIMIT 1000
) firstpurchase
ON t.amperity_id = firstpurchase.amperity_id
LIMIT 1000
DATE_FORMAT()¶
Use the DATE_FORMAT(timestamp, format) function to format a timestamp
as a string based on the format
specifier.
Specifier |
Description |
---|---|
%a |
Abbreviated weekday name (Sun .. Sat) |
%b |
Abbreviated month name (Jan .. Dec) |
%c |
Month, numeric (1 .. 12); This specifier does not support 0 as a month or day. |
%d |
Day of the month, numeric (01 .. 31); This specifier does not support 0 as a month or day. |
%e |
Day of the month, numeric (1 .. 31); This specifier does not support 0 as a month or day. |
%f |
Fraction of second (6 digits for printing: 000000 .. 999000; 1 - 9 digits for parsing: 0 .. 999999999), truncated to milliseconds |
%H |
Hour (00 .. 23) |
%h |
Hour (01 .. 12) |
%I |
Hour (01 .. 12) |
%i |
Minutes, numeric (00 .. 59) |
%j |
Day of year (001 .. 366) |
%k |
Hour (0 .. 23) |
%l |
Hour (1 .. 12) |
%M |
Month name (January .. December) |
%m |
Month, numeric (01 .. 12); This specifier does not support 0 as a month or day. |
%p |
AM or PM |
%r |
Time, 12-hour (hh:mm:ss followed by AM or PM) |
%S |
Seconds (00 .. 59) |
%s |
Seconds (00 .. 59) |
%T |
Time, 24-hour (hh:mm:ss) |
%v |
Week (01 .. 53), where Monday is the first day of the week; used with %x |
%W |
Weekday name (Sunday .. Saturday) |
%x |
Year for the week, where Monday is the first day of the week, numeric, four digits; used with %v |
%Y |
Year, numeric, four digits |
%y |
Year, numeric (two digits); When parsing, two-digit year format assumes range 1970 to 2069, so “70” will result in year 1970 but “69” will produce 2069 |
%% |
A literal % character |
%x |
x, for any x not listed above |
Format as Stitch date¶
,DATE_FORMAT(CURRENT_DATE, '%Y-%m-%d 00:00:00') AS "amperity_stitch_date"
DATE_TRUNC()¶
Use the DATE_TRUNC(unit, x) function to return x
truncated to one of the following unit
values (shown in bold):
Unit |
Example Truncated Value |
---|---|
second |
2001-08-22 03:04:05.000 |
minute |
2001-08-22 03:04:00.000 |
hour |
2001-08-22 03:00:00.000 |
day |
2001-08-22 00:00:00.000 |
week |
2001-08-20 00:00:00.000 |
month |
2001-08-01 00:00:00.000 |
quarter |
2001-07-01 00:00:00.000 |
year |
2001-01-01 00:00:00.000 |
Last purchase, this year¶
WHERE DATE_TRUNC('year', LastPurchase) = DATE_TRUNC('year', Now())
DAY(), MONTH(), YEAR()¶
Identifying the day, month, and year from within a date can be helpful when building segments that use these values to improve the timing of communication:
Use DAY() to return the day of the month from a date or timestamp.
Use MONTH() to return the month of the year from a date or timestamp.
Use YEAR() to return the year from a date or timestamp.
Tip
Functions also exist for HOUR(), MINUTE(), QUARTER(), and WEEK(). These functions are used in the same manner as DAY(), MONTH(), and YEAR(), but against different elements within a date or timestamp column value.
One and dones, by year¶
The following example shows using a common table expression to identify all one-and-done purchasers for a single calendar year:
WITH one_and_dones_2022 AS (
SELECT
amperity_id
FROM Transaction_Attributes
WHERE one_and_done AND YEAR(first_order_datetime) = 2022
)
SELECT
COUNT(*) one_and_dones_2022
FROM
one_and_dones_2022
Find guests who canceled last month¶
The following example uses member, consumer, and confirmation IDs, along with the reservation confirmation number and the checkin date to return all customers who canceled their stay one month ago:
SELECT
g.amperity_id AS AMPERITY_ID
,g.member_id AS MEMBER_ID
,g.consumer_id AS CONSUMER_ID
,r.confirmation_id AS CONFIRMATION_ID
,r.booking_id AS CONFIRMATION_NUMBER
,r.stay_start AS CHECKIN_DATE
FROM Reservation_Table r
JOIN Guest_Table g ON r.amperity_id = g.amperity_id
WHERE (
r.stay_status = 'canceled'
AND DAY(r.stay_start) = DAY(CURRENT_TIMESTAMP + INTERVAL '1' MONTH)
)
Find tomorrow’s birthdays¶
SELECT
amperity_id
,email
,given_name
,birthdate
FROM "Customer360"
WHERE
MONTH(birthdate) = MONTH(CURRENT_TIMESTAMP AT TIME ZONE 'America/Los_Angeles'+ INTERVAL '1' DAY)
AND
DAY(birthdate) = DAY(CURRENT_TIMESTAMP AT TIME ZONE 'America/Los_Angeles' + INTERVAL '1' DAY)
DENSE_RANK()¶
Use the DENSE_RANK() function to compute the rank of a value in a group of values. DENSE_RANK() will assign the same rank to rows with the same value and will not create gaps in the ranked sequence.
Note
The DENSE_RANK() and RANK() functions both assign a rank to rows with the same value. The difference is that RANK() will create gaps in the sequence. The following example shows rows that are ranked in ascending order by column B:
------- ------- ------------ ------
col_A col_B dense_rank rank
------- ------- ------------ ------
A 30 3 4
A 20 2 3
A 20 2 3
A 10 1 1
A 10 1 1
------- ------- ------------ ------
FROM_UNIXTIME()¶
Use the FROM_UNIXTIME(unixtime) function to return unixtime
(a UNIX timestamp) as a timestamp.
Use FROM_UNIXTIME(unixtime, string) to return
unixtime
as a timestamp andstring
as a timezone.Use FROM_UNIXTIME(unixtime, hours, minutes) to return
unixtime
as a timestamp andhours
andminutes
as a timezone offset.
Convert _updated to a date¶
Amperity requires each feed to specify a field that describes when each record was last updated. If multiple records in the incoming data and/or the existing domain table have the same primary key, the record with the most recent “last updated” field will be retained. This may be associated with a field that has a datetime field type, or an integer (such as for unix timestamps).
Note
Amperity does not use a field with a date data type because that value is not granular enough to determine priority.
If you have no such updated field, you can choose to autogenerate a field, in which case the following logic is used to determine which record to keep in the case a primary key appears more than once:
Records from newly-ingested data will always overwrite records that already exist in the domain table.
If couriers are run over a date range, records from files associated with later dates will be retained.
If multiple files are loaded for the same date, records for the latest-loaded file are retained. File loading order depends on the behavior of the source system, but is generally deterministic.
If the same primary key appears on multiple records on the same text-based file, the latest row on the file is retained.
Note
When using ingest queries, the above tiebreakers are unavailable, so upserting behavior can be nondeterministic. Ensure that you either specify a “last updated” field, or that your ingest query only returns a single record for each primary key, to ensure deterministic results.
Use the FROM_UNIXTIME(unixtime) function to convert the values in the last updated column (_updated
) into a date format.
CAST(FROM_UNIXTIME(_updated / 1048576000) AS date) AS "Date"
GREATEST()¶
Use the GREATEST(column_name, column_name, …) function to return the column with the greatest value among the values of all named columns.
IF()¶
Use the IF(condition,x) function to evaluate and return true, NULL, or false.
Use IF(condition,true_value) to evaluate and return
true_value
when thecondition
istrue
, otherwise return NULL.Use IF(condition,true_value, false_value) to return
true_value
when thecondition
istrue
or returnfalse_value
when thecondition
isfalse
.
LEAST()¶
Use the LEAST(column_name, column_name, …) function to return the column with the lowest value among the values of all named columns.
Find earliest dates¶
SELECT
amperity_id
,email
,given_name
,birthdate
FROM "Customer360"
WHERE
LEAST(date_col1, date_col2, date_col3, date_col4)
LENGTH()¶
Use the LENGTH(string) function to return the length of string
in characters.
LOWER()¶
Use the LOWER(string) function to convert string
to lowercase.
MAX()¶
Use the MAX() function to return the maximum value of all input values.
Collapse rows by Amperity ID¶
The following SQL will collapse all rows with Amperity IDs in the Unified Coalesced table into a single row per Amperity ID:
SELECT
amperity_id AS amperity_id
,MAX(given_name) AS given_name
,MAX(surname) AS surname
,MAX(email) AS email
,MAX(phone) AS phone
,MAX(address) AS address
,MAX(city) AS city
,MAX(state) AS state
,MAX(postal) AS postal
,MAX(birthdate) AS birthdate
,MAX(gender) AS gender
FROM Unified_Coalesced
GROUP BY amperity_id
MEDIAN()¶
Warning
The MEDIAN() function is not supported within Amperity. Use NTILE() of 2 over the value, GROUP BY over the NTILE(), and then take the MAX() of the first value or the MIN() of the second value.
MIN()¶
Use the MIN() function to return the minimum value of all input values.
Replace NULL with 0¶
Wrap the MIN() function to replace NULL values with 0
:
MIN(CASE
WHEN field IS NULL THEN 0
ELSE field
END)
or:
MIN(COALESCE(field,0))
NOW()¶
Use the NOW() function to return the current timestamp (with time zone) as of the start of the query. This function is an alias for current_timestamp
.
NTILE()¶
Use the NTILE(n) window function to evenly distribute large numbers of rows across n
buckets, ranging from 1 to n.
Important
Results can be unpredictable when the number of returned records is less than the (n)
value of NTILE().
Tip
Use the PERCENT_RANK() function when the expected number of returned rows is a small number.
Partition predicted CLV by brand¶
The following example shows how to separate values into brand tiers using predicted customer lifetime value.
NTILE(100) OVER (PARTITION BY brand ORDER BY predicted_clv desc, _uuid_pk)
Percentiles by 10¶
The following example shows how to return 10th percentiles.
SELECT
tier*.1 AS tier
,MIN(lifetime_order_revenue) lifetime_revenue_amount
FROM (
SELECT
amperity_id
,lifetime_order_revenue
,NTILE(10) OVER (ORDER BY lifetime_order_revenue) tier
FROM Transaction_Attributes
)
GROUP BY 1
ORDER BY 1
Percentiles by 4¶
The following example shows how to return the 25th, median, and 75th percentiles.
SELECT
tier*.25 tier
,MIN(lifetime_order_revenue) lifetime_revenue_amount
FROM (
SELECT DISTINCT
amperity_id
,lifetime_order_revenue
,NTILE(4) OVER (ORDER BY lifetime_order_revenue) tier
FROM Transaction_Attributes
)
GROUP BY 1
ORDER BY 1
NULLIF()¶
Use the NULLIF(expression1, expression2) function to return NULL if expression1
is equal to expression2
.
PERCENT_RANK()¶
Use the PERCENT_RANK() function to return the percentage ranking of a value in group of values. The result is (r - 1) / (n - 1)
where r
is the rank of the row and n
is the total number of rows in the window partition.
Percent rank of all purchases¶
The following example shows how to return the percent rank of all purchases.
SELECT DISTINCT
amperity_id
,order_revenue
,PERCENT_RANK() OVER (ORDER BY order_revenue) pct_rank
FROM Unified_Transactions
WHERE order_datetime > DATE('2023-03-01') AND amperity_id IS NOT NULL
ORDER BY pct_rank DESC
PERCENTILE()¶
Warning
The PERCENTILE() function is not supported within Amperity.
RANK()¶
Use the RANK() function to return the rank of a value within a group of values. The rank is one plus the number of rows preceding the row that is not a peer to the ranked row. Tie values in the ordering will produce gaps in the sequence. The ranking is performed for each window partition.
Rank by Amperity ID¶
The following SQL query uses the RANK() function to find the largest transactions by Amperity ID, and then returns them in ascending order:
WITH ranked_transactions AS (
SELECT
t.orderid
,t.amperity_id
,t.transactiontotal
,RANK() OVER (PARTITION BY t.amperity_id ORDER BY t.transactiontotal DESC) AS rank
FROM
TransactionsEcomm t
ORDER BY t.amperity_id, rank ASC
LIMIT 100
)
SELECT * FROM ranked_transactions WHERE rank = 1
REGEXP_EXTRACT()¶
Use the REGEXP_EXTRACT(string, pattern) function to replace every instance of the substring matched by the regex
pattern from string
.
REGEXP_LIKE()¶
Use the REGEXP_LIKE(string, ‘expression’) function to return true when string
matches expression
.
Validate email addresses¶
The following example shows using the REGEXP_LIKE() function within a CASE statement to return valid email addresses:
CASE
WHEN REGEXP_LIKE(email, '^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$')
AND email_completion > 0 THEN true
ELSE false
END AS contactable_email
Validate phone numbers¶
The following example shows using the REGEXP_LIKE() function within a CASE statement to return valid phone numbers:
CASE
WHEN REGEXP_LIKE(phone, '^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$')
AND phone_completion > 0 THEN true
ELSE false
END AS contactable_phone
REGEXP_REPLACE()¶
Use the REGEXP_REPLACE(string, regex) function to remove every instance of the substring matched by the regex
pattern from string
.
Use the REGEXP_REPLACE(string, regex, replace) function to replace every instance of the substring.
Remove whitespace¶
NULLIF(REGEXP_REPLACE(field, '^\\s*(\\S.*\\S)\\s*$', '\\1'), '')
Remove spaces¶
SELECT
REGEXP_REPLACE(AddressLine1, '(\s*)([ ])', '')
,AddressLine1
FROM Customer360
LIMIT 100
Keep A-z, 0-9 from string¶
SELECT
REGEXP_REPLACE(AddressLine1, '(\s*)([^a-zA-Z0-9])', '')
,AddressLine1
FROM Customer360
LIMIT 100
Keep alphabetical characters¶
,REGEXP_REPLACE(LOWER(given_name), '[^a-zA-Z]+', '') AS given_name
,REGEXP_REPLACE(LOWER(surname), '[^a-zA-Z]+', '') AS surname
,REGEXP_REPLACE(LOWER(full_name), '[^a-zA-Z ]+', '') AS full_name
Replace characters after + symbol¶
,REGEXP_REPLACE(LOWER(email), '\+(.*?)\@', '@') AS email
Keep characters before @ symbol¶
,SPLIT(REGEXP_REPLACE(LOWER(email), '\+(.*?)\@', '@'), '@')[1] AS email_username
Keep characters after @ symbol¶
,REPLACE(LOWER(email), SPLIT(REGEXP_REPLACE(LOWER(email), '\+(.*?)\@', '@'), '@')[1], '') AS email_domain
Clean up semantic values¶
,REGEXP_REPLACE(LOWER(phone), '[^0-9]+', '') AS phone
,REGEXP_REPLACE(LOWER(address), '[.]+', '') AS address
,REGEXP_REPLACE(LOWER(address2), '[.]+', '') AS address2
,REGEXP_REPLACE(LOWER(city), '[^a-zA-Z]+', '') AS city
,REGEXP_REPLACE(LOWER(state), '[^a-zA-Z]+', '') AS state
Formatting for Facebook Ads¶
Note
Facebook Ads has specific requirements for data formatting and naming of certain fields. Destinations and campaigns configured for Facebook Ads automatically apply the correct formatting to fields that are required by Facebook Ads. This example shows the regular expression formatting for those fields.
,REGEXP_REPLACE(LOWER(given_name), '[.,\/#!$%\^&\*;:{}=\-_`~()@'\"+ ]', '') AS FN
,REGEXP_REPLACE(LOWER(surname), '[.,\/#!$%\^&\*;:{}=\-_`~()@'\"+ ]', '') AS LN
,REGEXP_REPLACE(LOWER(city), '[^a-z]', '') AS CT
,REGEXP_REPLACE(LOWER(state), '[^a-z]', '') AS ST
,REGEXP_REPLACE(LOWER(postal), ' " "', '') AS ZIP
,REGEXP_REPLACE(LOWER(email), '[^a-z]', '') AS EMAIL
REPLACE()¶
Use the REPLACE() function to remove and/or replace all instances of search
from string
. There are two variants:
Use REPLACE(string, search) to remove all instances of
search
fromstring
, i.e. “replace string with nothing”.Use REPLACE(string, search, replace) to replace all instances of
search
fromstring
withreplace
.
ROUND()¶
Use the ROUND(x) function to return x
rounded to the nearest integer.
SPLIT()¶
Use the SPLIT(string, regex, limit) function to split a string
at occurrences that match regex
, and then return the results. Add a positive integer to LIMIT the number of occurrences to that integer.
Note
In Presto SQL, the index starts at 1.
Return username from email¶
To return only the username from an email address (the characters before the @ symbol):
SPLIT(REGEXP_REPLACE(LOWER(email), '[.]+', ''), '@')[1] AS email_username
SPLIT_PART()¶
Use the SPLIT_PART(string, delimiter, index) function to split a string
at occurrences that match index
, and then return the results. If the index is larger than the number of available characters, or if the specified index position is out of range, the function will return NULL.
Note
In Presto SQL, the index starts at 1.
For example:
TRIM(SPLIT_PART(GIFTCARDCODE, '-', 1))
Tip
Use the SPLIT()
function in Spark SQL. For example:
SPLIT(GIFTCARDCODE,'-')[0]
In Spark SQL, the index starts at 0.
SUBSTR()¶
Use the SUBSTR() function to return N characters in a string. There are two variants:
Use SUBSTR(string, start) to return
string
from thestart
position that is equal to the value ofstart
. A positive starting position (1
) is relative to the start ofstring
; a negative starting position (-1
) is relative to the end ofstring
.Use SUBSTR(string, start, length) to return
string
from thestart
position that contains the number of characters specified bylength
. A positive starting position (1
) is relative to the start ofstring
; a negative starting position (-1
) is relative to the end ofstring
.
Return two characters¶
To return the last two characters of column_name
, use either of:
SUBSTR(column_name, -2)
or
SUBSTR(column_name, -1, 2)
SUM()¶
Use the SUM(x) function to return the sum of all input values.
Note
The SUM() function will return NULL if the value of any field is NULL. In some situations you must use the COALESCE() function to coalesce to a zero-length string prior to concatenation. For example:
SUM(COALESCE(lifetimeValue,0))
Sum invoices as total amount¶
,SUM(orders.invoice_amount) AS total_amount
Total revenue, descending order¶
The following example uses the SUM() function to return total revenue, returns, and quantity, groups them, and then arranges the results in descending order:
SELECT
purchase_brand
,SUM(order_revenue) AS total_revenue
,SUM(order_returned_revenue) AS total_returns
,SUM(order_quantity) AS total_quantity
FROM Unified_Transactions
GROUP BY 1
ORDER BY 1 DESC
TO_HEX()¶
Use the TO_HEX(binary) function to encode binary
into a hex string representation.
One-way SHA-256 hash¶
The following example applies a one-way SHA-256 hash to email addresses:
TO_HEX(SHA256(TO_UTF8(UPPER(TRIM(email)))))
TO_UNIXTIME()¶
Use the TO_UNIXTIME(timestamp) function to return timestamp
as a UNIX timestamp.
TRIM()¶
Use the TRIM(string) function to remove leading and trailing whitespace from string
.
Tip
To remove only leading whitespace use the LTRIM(string) function instead of TRIM(string).
To remove only trailing whitespace use the RTRIM(string) function instead of TRIM(string).
TRY_CAST()¶
Use the TRY_CAST(value AS type) function to cast a value
as a type
, and then return NULL if the cast fails.
UPPER()¶
Use the UPPER(string) function to convert string
to uppercase.