About Spark SQL¶
Spark SQL is a high performance SQL query engine used by Amperity to ingest data, create domain tables, and extend the outcome of the Stitch process within your customer profile database.
Why should I use this reference?¶
The Sources and Databases pages use Spark SQL as the underlying SQL engine. Amperity database tables, custom domain tables, and ingest queries are built almost exclusively by using the SELECT statement, along with the clauses, operators, expressions, and functions you would expect to be available, though you may use additional functionality within Spark SQL as necessary.
Please refer to this reference first, and then to the official Spark SQL, version 3.5.1 documentation.
About Spark SQL¶
Use Spark SQL to define all SQL queries related to the following areas of Amperity:
Ingesting data, including ingest queries
Processing data into domain tables
Building custom domain tables
Loading data into Stitch
Running Stitch
Loading the results of Stitch into the customer 360 database
Defining tables in the customer 360 database
Note
Spark SQL is used to define all SQL queries related to the Stitch process up to (and including) building the tables in the customer 360 database. Presto SQL is used to define SQL queries for segments. Why both?
Spark SQL performs better in more traditional processes like machine learning and ETL-like processes that are resource intensive.
Presto SQL performs better when running real-time queries against cloud datasets.
Tip
Amperity uses Spark SQL, version 3.5.1.
Recommendations¶
This is a list of recommendations for using Spark SQL with Amperity.
General¶
Do
Use consistent and descriptive identifiers and names.
Include comments in SQL code where necessary. Use the C style opening
/*
and closing*/
where possible; otherwise precede comments with--
and finish them with a new line.Make judicious use of white space and indentation to make code easier to read.
Store ISO-8601 compliant date and time information (YYYY-MM-DD HH:MM:SS.SSSSS).
Try to use only standard SQL functions instead of vendor-specific functions for reasons of portability.
Keep code succinct and devoid of redundant SQL, such as unnecessary quoting or parentheses or WHERE clauses that can otherwise be derived.
Convert string values to lowercase. For example: LOWER(“column_name”) = ‘value’.
Code review complex SQL statements.
Treat varchars as string values.
Use one-way SHA-256 hashes for fields that contain PII data.
Ensure that window functions are complete and use OVER() along with PARTITION BY and ORDER BY.
Use backticks around column names when they contain spaces or special characters, when they are fully numeric, or when the column name is also a reserved word.
Avoid
Reserved keyword names that are used as identifiers.
Columns aliased to the same name. For example, avoid
email AS email
.Quotes (
" "
) around column names unless using them is necessary.CamelCase, with the exception of the table name. CamelCase is more difficult to scan quickly.
Descriptive prefixes or Hungarian notation such as
sp_
ortbl
.Plurals. Use the more natural collective term where possible instead. For example staff instead of employees or people instead of individuals.
Quote identifiers. Use SQL92 double quotes for portability, if possible.
Object-oriented design principles with SQL or database structures.
Boolean values wrapped in quotes. Use TRUE and FALSE without quotes.
Large tables defined within the NOT IN expression of a WHERE statement. These will run slowly and/or result in out-of-memory errors.
Add comments¶
Be sure to add comments to all SQL code that defines database tables or performs any type of pre-ingest processing, such as an ingest query.
Code comments should describe:
The purpose of the table, especially for an intermediate table that is consumed by other tables in a database.
The purpose of any sub-queries.
Situations where data is coalesced into single fields. Be sure to note what the rule is so that future maintainers of that SQL statement can apply it to their changes. For example: “Items are populated in the following order, when available: from loyalty tables, then POS tables, and then any historical tables.”
Non-obvious actions within the SQL statement. For example, when inserting static text as a date value or when joining on fields that appear to have different data types.
Avoid low cardinality¶
A JOIN operation will run slowly, use more memory, and return lower quality results when many rows on both sides of the operation contain the same value. (It’s fine if one side contains many rows with the same value.)
For example, a pos table that contains over a million rows of POS data with an email column where 50% of the rows have the email address “no@email.com”. In addition, a loyaltyProgram table with 100,000 rows, also 50% “no@email.com”.
Do this
1SELECT *
2FROM pos
3LEFT JOIN loyaltyTable AS l ON (pos.email = l.email)
4WHERE pos.email <> 'no@email.com'
This will filter out records with “no@email.com”, and then join them.
Not this
1SELECT *
2FROM pos
3LEFT JOIN loyaltyTable AS l ON (pos.email = l.email)
This will cause the query processor to try and match every “no@email.com” email in the pos table to every email in the loyaltyProgram table, which results in 500,000 * 50,000 matches, or 25,000,000,000 records.
Avoid multiple passes¶
In some cases information from Amperity must be added to the data ingest process so that Amperity can correctly shape the data prior to the Stitch process. This should be done in a single pass.
At the end of processing on day 1 create a “feedback segment” that contains the data to be shaped on day 2, and then output that segment data so that it may be ingested by Amperity on day 2.
On day 2 ingest data from the customer’s sources normally, and then ingest the feedback segment.
Join the day 2 data to the feedback segment, and then shape the results for Stitch.
Ensure the data is made available to Stitch, and then run Stitch.
Repeat the feedback segment process on subsequent days if necessary.
Commas¶
Commas are used as separators in SQL queries and are typically added at the end of a line, like this:
1SELECT
2 amperity_id AS amperity_id,
3 max(given_name) AS given_name,
4 max(surname) AS surname,
5 max(email) AS email,
6 max(phone) AS phone,
7 max(address) AS address,
8 max(city) AS city,
9 max(state) AS state,
10 max(postal) AS postal,
11 max(birthdate) AS birthdate,
12 max(gender) AS gender
13FROM Unified_Coalesced
14GROUP BY amperity_id
Adding the comma at the start of the row is recommended for two reasons:
It makes it visually obvious that a comma was not missed.
It allows a field to be commented out without breaking anything else in the query.
For example:
1SELECT
2 amperity_id AS amperity_id
3 ,max(given_name) AS given_name
4 ,max(surname) AS surname
5 ,max(email) AS email
6 ,max(phone) AS phone
7 ,max(address) AS address
8-- ,max(address_2) AS address_line_two
9 ,max(city) AS city
10 ,max(state) AS state
11 ,max(postal) AS postal
12 ,max(birthdate) AS birthdate
13 ,max(gender) AS gender
14FROM Unified_Coalesced
15GROUP BY amperity_id
Escape characters¶
Use a backslash ( \
) as an escape character for strings that contain single and double quotes. For example:
,CASE
WHEN store_name = 'Macy\'s'
THEN 'Macy\'s' ELSE 'Other'
END AS store_name
Important
A backslash ( \
) contained within a regular expression MUST be escaped using another backslash. For example, a regular expression that matches a single numeric digit should be \\d
and not \d
.
Daisy-chain LEFT JOINs¶
Daisy-chain the ON conditions when joining several tables together with a LEFT JOIN.
Do this
FROM tableA AS a
LEFT JOIN tableB AS b ON (a.key = b.key)
LEFT JOIN tableC AS c ON (b.key = c.key)
This allows the query processor to prune from the right when executing the query, only evaluating the rows of table A against the rows of table B (with whatever columns came along from table C).
Not this
FROM tableA as A
LEFT JOIN tableB AS b ON (a.key = b.key)
LEFT JOIN tableC AS c ON (a.key = c.key)
If b.key and c.key are unique, the query will work as expected. If “b.key” and “c.key” are not unique in their source tables, the query will run for a long time and return the wrong data because the system will return rows for all matching pairs of “b.key” and “c.key” for each value of “a.key”. An acceptable workaround in situations like this is to first aggregate table B and table C on the key in a subquery, and then join to table A.
Faster operations are better¶
The cost of operations in Spark SQL depends on many things.
Cost |
Discussion |
---|---|
Minimal |
A query will run fastest when:
|
Low |
The following actions have a minimal affect on Spark SQL performance:
|
Medium |
The following actions will reduce the performance of Spark SQL in various ways. They are often necessary, but should be used carefully and with consideration of the performance cost:
|
Expensive |
The following actions are expensive:
|
Filter, then combine¶
A pattern that filters out unneeded rows and selects only necessary columns is much faster than first gathering together all of the information that may be required.
Do this
1SELECT
2 uct.amperity_id
3 ,uct.first_name
4 ,uct.last_name
5 ,ecomm.last_order_date
6FROM
7(
8 SELECT
9 amperity_id
10 ,first_name
11 ,last_name
12 FROM Unified
13 WHERE datasource <> 'Example'
14 GROUP BY amperity_id
15) AS uct
16LEFT JOIN ecomm ON (ecomm.amperity_id = uct.amperity_id)
Not this
1SELECT
2 uct.amperity_id,
3 ,uct.first_name
4 ,uct.last_name
5 ecomm.last_order_date
6FROM Unified
7LEFT JOIN ecomm ON (ecomm.amperity_id = uct.amperity_id)
8WHERE Unified.datasource <> 'Example'
Filter NULL values, then join¶
In some cases the presence of NULL values can cause skew when joined to other tables. You can reduce skew by filtering out NULL values prior to the join operation. For example:
1LEFT JOIN (
2 SELECT *
3 FROM Merged_Customers
4 WHERE email IS NOT NULL
5 AND phone IS NOT NULL
6) mc
7on table_A.email = mc.email
8or table_A.phone = mc.phone
Identifiers¶
An identifier is a string that associates a database, table, or column to its parent and child objects within the same database or table. An identifier is qualified using a dot separator ( . ) and is often referred to as “dot notation”. For example:
database_name.table_name.column_name
To use table identifiers to flatten nested XML data
Note
This example uses an example XML file as the data source for sales transactions.
Use identifiers and aliases to flatten nested XML data with an ingest query, similar to:
1SELECT
2 salesTransactionId AS id
3 ,type
4 ,dateTime AS salesDateTime
5 ,salesOrder.salesOrderId AS salesOrderId
6 ,salesOrder.channelType AS channelType
7 ,salesOrder.orderSummary.totalAmount AS totalAmount
8FROM PosXml
returns a table similar to:
----- ------ ---------------------- -------------- ------------- -------------
id type salesDateTime salesOrderId channelType totalAmount
----- ------ ---------------------- -------------- ------------- -------------
ABC Add 2020-11-15T04:54:34Z A1zyBCxwvDu Cafe 120
DEF Add 2020-11-15T04:55:25Z B1yxCDwvuEt Cafe 14
GHI Add 2020-11-15T04:57:12Z C1xwDEvutFs Cafe 27
----- ------ ---------------------- -------------- ------------- -------------
Indentation¶
To ensure that SQL is readable it is important that standards of indentation are followed.
Formalisms¶
Make use of BETWEEN where possible instead of combining multiple statements with AND. Similarly, use IN() instead of multiple OR clauses. Where a value needs to be interpreted before leaving the database use the CASE expression. CASE statements can be nested to form more complex logical structures. Avoid the use of UNION clauses and temporary tables where possible.
Join operations¶
Joins should be aligned with the FROM and grouped with a new line where necessary.
Subqueries¶
Subqueries should be aligned to the line above them, but then follow standard indentation patters from that location. Sometimes it will make sense to have the closing parenthesis on a new line at the same character position as its opening partner—this is especially true where you have nested subqueries.
Limit expensive operations¶
Apache Spark breaks problems into little parts, works on each little part, and then rolls them back together. This means that operations that can be easily done in chunks, like “find everyone named ‘Justin’”, run much faster than operations over the whole dataset, like “give me a list of unique first names.”
Tip
This behavior also affects operations like ORDER BY for similar reasons. In order for Spark to determine the correct order it breaks problems into little parts, works on each little part, and then rolls them back together.
The following diagram shows how Apache Spark finds everyone named Justin:

The following diagram shows how Apache Spark finds a list of unique first names. Each worker must deduplicate first names, be recombined, and then broken into pieces (again), after which more deduplication is performed.
This process continues until there is no overlap between record sets for each worker. Apache Spark has to break the work apart and bring it back together again several times to get to a final set of unique names:

Limit extra work¶
Limit the work done by a SQL query to improve overall performance. The following examples show how to use window functions to create subsections, order them by date, and then pick the first one.
Do this
1SELECT
2 amperity_id,
3 FIRST_VALUE(LOWER(EMAIL)) OVER
4 (PARTITION BY amperity_id)
5 AS email_address,
6 CAST('2017-12-22' AS DATE) as merged_date
7FROM FlatFiles_CampsiteGawEmailListDec2017
8WHERE EMAIL IS NOT NULL
This example performs better and achieves the same result.
Not this
1SELECT DISTINCT
2 amperity_id
3 ,FIRST_VALUE(LOWER(EMAIL_ADDRESS_)) OVER
4 (PARTITION BY amperity_id ORDER BY merged_date DESC)
5 AS email_address
6 ,FIRST_VALUE(merged_date) OVER
7 (PARTITION BY LOWER(EMAIL_ADDRESS_) ORDER BY merged_date DESC)
8 AS mergeddate
9 FROM (
10 SELECT
11 amperity_id
12 ,EMAIL_ADDRESS_
13 ,CAST('2017-12-22' AS DATE) as merged_date
14 FROM FlatFiles_TevaSweepstakesDecember2017
15 WHERE EMAIL_ADDRESS_ IS NOT NULL)
16 WHERE merged_date IS NOT NULL)
This example does extra work to set all dates to the same value and is slower.
Limit tables¶
A database should only include tables that are useful to downstream activity, such as building segments or for a database export. As a general rule, segment authors must never use any tables with a name that starts with “Unified” to build a segment. Use the passthrough option to make available certain domain tables that contain data that is useful for segmentation.
Load sizes¶
Apache Spark prefers load sizes to range between 1-10000 files and file sizes to range between 1-1000 MB. Apache Spark will parse 100 x 10 MB files faster than 10 x 100 MB files and much faster than 1 x 10000 MB file. When loading large files to Amperity, as a general guideline to optimize the performance of Apache Spark, look to create situations where:
The number of individual files is below 3000.
The range of individual file sizes is below 100 MB.
Put differently, Apache Spark will parse 3000 x 100 MB files faster than 300 x 1000 MB files and much faster than 30 x 10000 MB files.
Naming conventions¶
Ensure the name is unique and does not exist as a reserved keyword. Keep the length to a maximum of 30 bytes—in practice this is 30 characters unless you are using multi-byte character set. Names must begin with a letter and may not end with an underscore. Only use letters, numbers, and underscores in names. Avoid the use of multiple consecutive underscores, as they can be hard to read. Use underscores where you would naturally include a space in the name (first name becomes first_name
). Avoid abbreviations and if you have to use them make sure they are commonly understood.
SELECT first_name
FROM Customer360
Alias names¶
An alias assigns a temporary name to a column or a table that only exists for the duration of a query. An alias is defined after the AS keyword.
For columns:
SELECT column_name AS alias_name
FROM table_name;
For tables:
SELECT column_name(s)
FROM table_name AS alias_name;
When using an alias:
Assign names that clearly relate to the column or table. .. important:: “Do not use an alias when its name matches the original column name.
Always use the AS keyword to ensure readability. .. tip:: Include the AS keyword when aliasing columns in a ‘SELECT’ statement.
For computed data–SUM() or AVG()–use the name you would give it were it a column defined in the schema.
Always wrap an aliased field name with single back ticks:
`given_name`
The following example shows using aliases to add clarity to a SELECT statement that selects columns from more than one table. The “c360” alias refers to the Customer 360 table and the “el” alias refers to the Email_List table:
SELECT c360.email, c360.given_name, c360.surname, el.status
FROM Customer360 c360
LEFT JOIN Email_List el ON c360.email = el.email
WHERE status IS NOT NULL
Column names¶
Always use the singular name. Where possible avoid simply using id as the primary identifier for the table. Do not add a column with the same name as its table and vice versa. Always use lowercase except where it may make sense not to, such as with proper nouns.
Correlation names¶
A correlation name assigns a temporary name to a table. A correlation name is defined within the FROM clause after the table name. For example:
FROM table_name correlation_name;
When using correlations:
Assign names that clearly relate to the table. For example: a table named “Merged_Customers” should have a correlation name of “um”.
Always use the first letter of the table as the first letter of the correlation.
Append numbers as necessary to ensure unique correlation names.
Only use the correlation name within the rest of same query.
Suffix names¶
The following suffixes represent patterns that should be applied to column names in SQL tables. These patterns help ensure the purpose of columns can be easily understood by all users. Use the correct suffixes, use them consistently, and use them everywhere they are appropriate.
Suffix |
Description |
---|---|
_id |
A unique identifier such as a column that is a primary key. |
_status |
A flag value or some other status of any type such as “publication_status”. |
_total |
The total or sum of a collection of values. |
_num |
Denotes the field contains any kind of number. |
_name |
Signifies a name such as “first_name”. |
_seq |
Contains a contiguous sequence of values. |
_date |
Denotes a column that contains the date of something. |
_tally |
A count. |
_size |
The size of something such as a file size or clothing. |
_addr |
An address for the record could be physical or intangible such as “ip_addr”. |
Table names¶
Table names must be unique within the same database or database generation will fail.
NULL values¶
Functions may fail when they encounter a NULL value and others may return NULL values if any of their arguments return NULL values.
Use the COALESCE() function to convert to a zero-length string when using the CONCAT() and SUM() functions.
Use the COALESCE() function to to identify math. For example, multiplication will return NULL if any field is NULL. For example, because 1 is the identity for multiplication, use COALESCE(myColumn, 1)
One-way SHA-256 hashes¶
Apply one-way SHA-256 hashes to fields that contain PII data. A one-way hash ensures that data can no longer be recognizable as valid PII, yet still allows that data to identified by downstream processes.
A one-way SHA-256 hash has the following syntax:
SHA2(LOWER(TRIM(FIELD)),256) AS FIELD
and uses the following Spark SQL functions:
SHA2() hashes data with a one-way SHA-256 hash.
LOWER() sets all characters to lowercase.
TRIM() removes leading and trailing whitespace.
FIELD is the name of the field that contains PII data.
For example:
SHA2(LOWER(TRIM(email)),256) AS email
Reserved words¶
Always use uppercase for the reserved keywords like SELECT and WHERE. It is best to avoid the abbreviated keywords and use the full length ones where available (prefer ABSOLUTE to ABS). Do not use database server specific keywords where an ANSI SQL keyword already exists performing the same function. This helps to make code more portable.
Reserved keywords¶
The following keywords are reserved and must be double-quoted to be used as an identifier.
ALTER
AND
AS
BETWEEN
BY
CASE
CAST
CONSTRAINT
CREATE
CROSS
CUBE
CURRENT_DATE
CURRENT_TIME
CURRENT_TIMESTAMP
CURRENT_USER
DEALLOCATE
DELETE
DESCRIBE
DISTINCT
DROP
ELSE
END
ESCAPE
EXCEPT
EXECUTE
EXISTS
EXTRACT
FALSE
FOR
FROM
FULL
GROUP
GROUPING
HAVING
IN
INNER
INSERT
INTERSECT
INTO
IS
JOIN
LEFT
LIKE
LOCALTIME
LOCALTIMESTAMP
NATURAL
NORMALIZE
NOT
NULL
ON
OR
ORDER
OUTER
PREPARE
RECURSIVE
RIGHT
ROLLUP
SELECT
TABLE
THEN
TRUE
UESCAPE
UNION
UNNEST
USING
VALUES
WHEN
WHERE
WITH
Skew¶
When Spark SQL runs domain tables and database tables it partitions the work and distributes each partition to an individual executor. When each partition is roughly equal in size each executor is assigned a roughly equal amount of work.
Skew occurs when one partition is assigned a greater amount of work than other partitions. All of the executors assigned lesser amounts of work will finish first, and then wait for the executor that was assigned the greater amount of work to finish.
Skew increases the amount of time it takes to finish processing. Small amounts of skew are common and are sometimes unavoidable. Large amounts of skew can lead to increased costs and increased runtimes.
Common causes of skew¶
Common causes of skew include:
Unintended duplication of values in upstream tables. Removing this duplication helps prevent skew.
Values that are not needed in the final output, including NULL values or placeholders for NULL values, such as “1900-01-01”. Filter these values out to prevent them from causing skew.
If these values are required, one approach is to replace them with a primary key. For example:
COALESCE(NULLIF({field}, ‘{bad_value}’), _uuid_pk)
Note
Replacing a value that causes skew with primary key works because:
A primary key is an extremely well-distributed field. All records have a unique primary key, which ensures that distribution of work is even across executors.
The value of a primary key will never match to a value on the other side of a JOIN operation. The non-match is the same result as what would happen with the bad NULL or placeholder value the primary key replaced.
Avoiding skew¶
To avoid large amounts of skew:
Ensure window functions are partitioned by fields in a PARTITION BY() clause.
Caution
A window function without a PARTITION BY() clause is calculated by a single executor.
Ensure that fields that are used to join tables within JOIN operations are well-distributed.
JOIN operations¶
One-to-many and many-to-many JOIN operations result in tables with duplicated results. This is sometimes the intended outcome, but more often it is unintended.
One-to-many and many-to-many JOIN operations often cause incorrect aggregate calculations. High duplication rates lead to skew, increased costs, and increased runtimes. A small amount of duplication repeated across many JOIN operations can cause exponential growth in the number of records for a given join key.
Avoid duplication in JOIN operations by:
Setting good values for primary keys, especially for domain tables.
Avoiding using all fields in the table as primary key values. This may introduce bad data or lead to higher rates of duplication.
Confirming the uniqueness of fields before using them in a JOIN operation.
Avoiding the use of DISTINCT statements. Use the GROUP BY clause instead.
Fields in a GROUP BY clause are always unique by grouping and can be used in a JOIN operation without risk of duplication.
Using the db/unique semantic tag on fields that are supposed to be unique.
This semantic tag adds an extra check to ensure that table results are unique by that field. If unexpected duplication occurs it will show as a warning in database notiications.
The following SQL query can help find JOIN operations with high duplication rates.
1SELECT
2 {field1}
3 ,COUNT(*)
4FROM {table}
5GROUP BY 1
6ORDER BY 2 DESC
7LIMIT 100
Note
JOIN operations are processed in the defined order. Even when join keys are unique for each table and are evenly distributed, they can cause skew when more than two tables are joined. For example:
1SELECT
2 {field1}
3 ,{field2}
4 ,{field3}
5FROM {tableA} a
6LEFT JOIN {tableB} b
7ON a.{joinkey}1 = b.{joinkey}1
8LEFT JOIN {tableC} c
9ON b.{joinkey}2 = c.{joinkey}2
If “Table A” has many keys that are not present in “Table B” then the LEFT JOIN operation between those tables will result in an intermediate table with many records where the value is NULL. This will create skew during the second JOIN operation between tables B and C.
PARTITION BY clauses¶
Use a query to help identify which field is causing skew in a PARTITION BY clause within a window function. Run the following query for each {field}
in the PARTITION BY clause:
1SELECT
2 {field1}
3 ,{field2}
4 ,{field3}
5 ,{field4}
6 ,COUNT(*)
7FROM {table}
8GROUP BY 1,2,3,4
9ORDER BY 5 DESC
10LIMIT 100
Fields that appear more often are sources of skew. A field that appears significantly more often than others may be causing a high amount of skew and should be filtered out or converted to a primary key.
High frequency, low cardinality¶
In some situations a JOIN operation must be done using a high frequency or low cardinality field. For example the “action type” for email engagement data may have a small number of types: “send”, “open”, “click”. A JOIN operation on the “action type” will cause a single executor to process all of the “send” action types.
Salting and hashing the joining keys can turn high frequency or low cardinality fields into values that can be spread across a greater number of executors. This will reduce skew, but will also increase the total amount of data that needs processing.
Ask your Amperity representative about using salting and hashing to mitigate skew with high frequency or low cardinality fields.
Subqueries¶
A subquery can be useful for shaping data prior to running a query. A subquery is a task that is required to be completed before additional processing can be performed. A subquery often runs quickly and can be used with little risk. That said, a poorly formed subquery can have adverse performance effects. Keep the following tips in mind when using a subquery:
Don’t join to an outer query from a subquery, as it may cause the subquery to run recursively for every value of the join key in the outer query. This may cause the subquery to run many times (or even millions of times). This type of subquery is sometimes referred to as a correlated subquery. The best way to avoid them is to never join between the inner and outer queries.
Each level of a subquery adds overhead. The fewer levels within a subquery, the easier it is to optimize the query. Try to flatten queries using joins instead of using additional levels. If you need to pull data into a subquery for multiple fields, try to use a single subquery instead of many.
Remove columns and/or use as few columns in a subquery as possible. For example, use:
SELECT column_1 ,column_2 FROM
instead of
SELECT * FROM
This will reduce the amount of data that must be processed.
Temporary tables¶
A temporary table is a table that is built from another table in the database and can be a useful way to processes repeatable tasks one time as opposed to using a subquery that runs many times. A temporary table is built into the database, which means it will be visible to users who have access to authoring segments. This is not always a desired outcome.
Consider using a temporary table in place of a subquery, but only when that subquery is being run multiple times.
Caution
If you need to use a temporary table, be sure to use a naming convention for the table that makes it obvious to segment authors that it is not a table that should be used for segmentation. For example, prefix a temporary table with ZZ_TEMP_
so that it appears at the end of the list of tables. Column names within the temporary table should be easy to comprehend.
Whitespace¶
To make the code easier to read it is important that the correct complement of spacing is used. Do not crowd code or remove natural language spaces.
Line spacing¶
Always include newlines/vertical space:
before AND or OR
after semicolons to separate queries for easier reading
after each keyword definition
before a comma when separating multiple columns into logical groups
to separate code into related sections, which helps to ease the readability of large chunks of code.
Putting commas and conjunctions at the start of the line makes it easier to comment out a single line without disturbing the rest of the query
1SELECT a.title
2 ,a.release_date
3 --,a.recording_date
4 ,a.production_date
5FROM albums AS a
6WHERE a.title = 'Charcoal Lane'
7 OR a.title = 'The New Danger'
Spaces¶
Spaces should be used to line up the code so that the root keywords all start on the same character boundary, and also so that this makes it easy to keep track of where you are in a query that may be multiple layers deep.
1(SELECT f.species_name
2 ,AVG(f.height) AS `average_height`
3 ,AVG(f.diameter) AS `average_diameter`
4FROM flora AS f
5WHERE f.species_name = 'Banksia'
6 OR f.species_name = 'Sheoak'
7 OR f.species_name = 'Wattle'
8GROUP BY f.species_name, f.observation_date)
9
10UNION ALL
11
12(SELECT b.species_name
13 ,AVG(b.height) AS `average_height`
14 ,AVG(b.diameter) AS `average_diameter`
15FROM botanic_garden_flora AS b
16WHERE b.species_name = 'Banksia'
17 OR b.species_name = 'Sheoak'
18 OR b.species_name = 'Wattle'
19GROUP BY b.species_name, b.observation_date)
Although not exhaustive always include spaces:
before and after equals ( = )
after commas ( , )
surrounding apostrophes ( ‘ ) where not within parentheses or with a trailing comma or semicolon.
1SELECT a.title, a.release_date, a.recording_date
2FROM albums AS a
3WHERE a.title = 'Charcoal Lane'
4 OR a.title = 'The New Danger'
WITH clause¶
The WITH clause defines a common table expression (CTE), also referred to as a “window function”.
A common table expression (CTE) is a named subquery defined by the WITH clause. A CTE defines a set of variables that act as a temporary view, an optional list of column names, and a query expression. The results of the query expression behaves like a table.
A window function is often used to group subsets of data prior to running a query. A window function should always include the OVER and PARTITION BY clauses to optimize the performance of your query.
Caution
A window function with an OVER statement that does not include a PARTITION BY clause often leads to performance issues when the OVER statement is asked to run across a large number of rows.
LATERAL VIEW clause¶
The LATERAL VIEW clause generates a virtual table containing one or more rows, where each row is applied to the original output row.
Use LATERAL VIEW OUTER to return NULL if the input array or map is empty or NULL.
Note
The LATERAL VIEW clause is often used along with the EXPLODE generator function.
For example, the following common table expression generates virtual tables for email and physical addresses:
1WITH
2 emails AS (
3 SELECT
4 unique_id
5 ,UPPER(em.email_address) AS email_address
6 FROM customer_table
7 LATERAL VIEW OUTER EXPLODE(email) AS em
8 ),
9 addresses AS (
10 SELECT
11 unique_id
12 ,UPPER(ad.address) AS `address`
13 ,UPPER(ad.address2) AS `address2`
14 ,UPPER(ad.city) AS `city`
15 ,UPPER(ad.state) AS `state`
16 ,UPPER(ad.country) AS `country`
17 ,ad.zip AS `postal`
18 ,ad.zip4 AS `postal4`
19 FROM customer_table
20 LATERAL VIEW OUTER EXPLODE(address) AS ad
21 ),
after which you can use the SELECT statement to select individual columns from the virtual tables, and then join them:
1SELECT
2 c.unique_id AS `uuid`
3 ,a.address
4 ,a.address2
5 ,a.city
6 ,a.state
7 ,a.country
8 ,a.postal
9 ,a.postal4
10 ,e.email_address AS `email`
11FROM customer_table c
12LEFT JOIN emails e ON e.unique_id = c.unique_id
13LEFT JOIN addresses a ON a.v = c.unique_id
SELECT statement¶
The SELECT statement is used to define a set of data to be returned from a data table. The set of returned data is often referred to as the result-set. Use the SELECT statement to retrieve rows from any table.
SELECT DISTINCT statement¶
The SELECT DISTINCT statement is used instead of SELECT to return only distinct rows of data.
FROM clause¶
The FROM clause specifies the name of the data table against which the SQL query will be run and is part of every SELECT statement.
JOIN operations¶
Spark SQL supports all JOIN operations available in standard SQL.
Warning
Joining on fields with many matching values can be slow and can introduce skew. Spark SQL finds the match between each possible pair of matching values and the number of pairs goes up by the product of the matching duplicates in each table.
Joining two tables with 1 million records in each table, where each record in one table matches only 1 record in the other will result in 1 million pairs.
Joining two tables with 1 million records in each table, where all the records match will result in 1 trillion records.
One-to-many and many-to-many JOIN operations often cause incorrect aggregate calculations. High duplication rates lead to skew, increased costs, and increased runtimes.
ON() clause¶
Use the ON() clause to specify the join criteria. Contents of the ON() clause must be wrapped in parentheses.
For example:
1FULL JOIN table_name AS tn
2ON (Members.`amperity_id` = tn.`Members_PK`)
USING() clause¶
The USING() clause can be applied on column names to specify the join criteria. Contents of the USING() clause must be wrapped in parentheses.
For example:
1FULL JOIN table_name AS tn
2USING (column_name)
CROSS JOIN¶
A CROSS JOIN returns a Cartesian product. A Cartesian product combines every item in the first table with every item in the second. For example, if table A has three items and table B has three items, the Cartesian product is 9 pairs.
A CROSS JOIN should generally be avoided due to typical size of tables in Amperity databases, which can often have millions of rows.
A Cartesian product in Amperity between any two tables is often a very, very large number of pairs and, as a result, is an expensive operation. It is recommended to optimize your Spark SQL queries to avoid implicit cross joins.
Tip
Take steps to avoid implicit cross-joins:
Use the ON() or USING() clauses with any JOIN statement.
For situations where you need to run a SELECT statement like
SELECT * FROM Table_A, Table_B
first filter out NULL values before running the SELECT statement.
FULL OUTER JOIN¶
A FULL OUTER JOIN returns all of the rows in both tables A and B, no matter whether there’s a row in the other table with the same key value. Where there is a matching key value, columns from the other table will be brought in, where there’s no matching value for key columns from the other table will be set to NULL. While there is no filtering applied in a FULL OUTER JOIN, rows that match on the key are only included once, so the row count of the results will always be at most equal to the count of table A + the count of table B, and will usually be considerably smaller.
INNER JOIN¶
An INNER JOIN returns only those rows where there are values for the field key in both tables. Any rows in either table A or table B that have a value for key that doesn’t appear in the other table are not returned.
LEFT JOIN¶
The LEFT JOIN clause joins rows from two tables. For a LEFT JOIN, each row in the left table is joined with all matching rows from the right table. For rows with no match in the right table, the join is completed with NULL to represent column values.
For example:
SELECT *
FROM (VALUES 1, 2) t("left")
LEFT JOIN (VALUES 1, 1) u("right")
ON t."left" = u."right";
will return a table similar to:
-------- --------
left right
-------- --------
1 1
1 1
2 NULL
-------- --------
Note
In some situations you may want to use an LEFT ANTI JOIN. A LEFT ANTI JOIN returns values from the left-side table when they do not have matches on the right-side table. Use a LEFT ANTI JOIN to avoid performance issues that may be caused by a broadcast join.
OUTER JOIN¶
An outer joins bring all of the rows from the named table (left for LEFT JOIN, right for RIGHT JOIN), but the overlapping rows can be removed by adding the condition WHERE <key value in the other table> IS NULL. You can do the same to select the non-intersecting portion of a FULL OUTER join by using OR with both key values as IS NULL.
Same as LEFT JOIN and RIGHT JOIN.
RIGHT JOIN¶
The RIGHT JOIN clause joins rows from two tables. For a RIGHT JOIN, each row in the right table is joined with all matching rows from the left table. For rows with no match in the left table, the join is completed with NULL to represent column values.
WHERE clause¶
The WHERE clause is used to filter records, and then return only those records that match the specified condition and value.
The WHERE clause can be combined with the AND, OR, and NOT operators and the IN, NOT IN, EXISTS, and NOT EXISTS expressions.
AND operator¶
The AND operator is used to filter records based on more than one condition and displays a record if all the conditions separated by AND are TRUE.
OR operator¶
The OR operator is used to filter records based on more than one condition and displays a record if any of the conditions separated by OR is TRUE.
NOT operator¶
The NOT operator displays a record if the condition is NOT TRUE.
IN expression¶
The IN expression returns a TRUE or FALSE value or UNKNOWN when the returned list contains NULL.
Tip
The IN expression is equivalent to the OR operator. For example: IN (a, b) is equivalent to a OR b.
NOT IN expression¶
The NOT IN expression returns a TRUE or FALSE value or UNKNOWN when the returned list contains NULL. Use WHERE … NOT IN … when the argument within the NOT IN expression is:
A small table with no more than a few thousand rows and is not expected to grow significantly over time.
A hard-coded list of values. For example:
WHERE color NOT IN ('red', 'blue', 'green')
Warning
The use of WHERE … NOT IN … will cause performance issues when the argument within NOT IN is a large table or large query. This is because Spark SQL will implement the table or query as a broadcast join, and then attempt to broadcast that table or query to every worker node in the Spark cluster. This may cause Spark SQL to run slowly or result in out-of-memory errors.
To prevent slow queries and/or out-of-memory errors use a LEFT ANTI JOIN operation instead of a WHERE … NOT IN <table or query>. For example:
SELECT *
FROM table_a
LEFT ANTI JOIN (SELECT id FROM table_b) AS remove
ON table_a.id = remove.id
instead of:
SELECT *
FROM table_a
WHERE table_a.id NOT IN (SELECT id FROM table_b)
EXISTS expression¶
The EXISTS expression returns TRUE when the subquery to which it refers returns one or more rows, or returns FALSE.
NOT EXISTS expression¶
The NOT EXISTS returns TRUE when the subquery to which it refers returns no rows or zero rows, or returns FALSE.
Tip
The NOT EXISTS expression runs the same way in Spark SQL and Presto SQL. This is useful for running the same query in the Customer 360 and the Queries pages. For example:
SELECT *
FROM table_a
WHERE table_a.id NOT EXISTS (SELECT id FROM table_b)
is similar to:
SELECT *
FROM table_a
LEFT ANTI JOIN (SELECT id FROM table_b) AS remove
ON table_a.id = remove.id
The important difference is WHERE … NOT EXISTS … runs the same way in both Spark SQL and Presto SQL, wheres LEFT ANTI JOIN may only be used in Spark SQL.
CASE expression¶
The standard SQL CASE expression has two forms: simple and searched.
Note
If a CASE expression is too long, add a new line between WHEN and THEN clauses to make the line more readable.
Simple form¶
The simple form searches each value expression from left to right until it finds one that equals expression:
CASE expression
WHEN value THEN result
[ WHEN ... ]
[ ELSE result ]
END
The result for the matching value is returned.
If no match is found, the result from the ELSE clause is returned if it exists, otherwise NULL is returned:
SELECT a,
CASE a
WHEN 1 THEN 'one'
WHEN 2 THEN 'two'
ELSE 'many'
END
Searched form¶
The searched form evaluates each boolean condition from left to right until one is true and returns the matching result:
CASE
WHEN condition THEN result
[ WHEN ... ]
[ ELSE result ]
END
If no conditions are true, the result from the ELSE clause is returned if it exists, otherwise NULL is returned:
SELECT a, b,
CASE
WHEN a = 1 THEN 'aaa'
WHEN b = 2 THEN 'bbb'
ELSE 'ccc'
END
GROUP BY clause¶
The GROUP BY clause divides the output of a SELECT statement into groups of rows containing matching values. A simple GROUP BY clause may contain any expression composed of input columns or it may be an ordinal number selecting an output column by position (starting at one).
The following queries are equivalent. They both group the output by the nationkey input column with the first query using the ordinal position of the output column and the second query using the input column name:
SELECT
COUNT(*)
,nationkey
FROM customer
GROUP BY 2
is equivalent to:
SELECT
COUNT(*)
,nationkey
FROM customer
GROUP BY nationkey
GROUP BY clauses can group output by input column names not appearing in the output of a select statement. For example, the following query generates row counts for the customer table using the input column “mktsegment”:
SELECT
COUNT(*)
FROM customer
GROUP BY mktsegment
returns a table similar to:
_col0
-------
29968
30142
30189
29949
29752
(5 rows)
When a GROUP BY clause is used in a SELECT statement all output expressions must be either aggregate functions or columns present in the GROUP BY clause.
HAVING clause¶
The HAVING clause is used to sort a result set by one or more output expressions. Use in conjunction with aggregate functions and the GROUP BY clause to control which groups are selected. A HAVING clause eliminates groups that do not satisfy the given conditions and filters groups after groups and aggregates are computed.
For example:
1SELECT
2 COUNT(*)
3 ,segment
4 ,key
5 ,CAST(SUM(actual_loyalty) AS bigint) AS loyalty
6FROM Customer360
7GROUP BY segment, key
8HAVING SUM(actual_loyalty) > 5700000
9ORDER BY loyalty DESC;
returns a table similar to:
------- ------------ ------ ----------
id segment key loyalty
------- ------------ ------ ----------
1272 CARS 19 5856939
1253 PLANES 14 5794887
1248 HOTELS 9 5784628
1243 HOTELS 12 5757371
1231 CARS 3 5753216
1251 CARS 2 5719140
1247 PLANES 8 5701952
------- ------------ ------ ----------
Window functions¶
Window functions are a way to evaluate rows around each row as it is being evaluated. There’s great flexibility in controlling how the windows are made (i.e. which other rows to consider), but for most uses I’ve seen in Amperity databases, we use a relatively small subset to group the rows in the data set by the unique values of some field (i.e. like a GROUP BY) and then select a row from that group. In addition to great flexibility on which rows to include in a group, there’s a powerful set of functions you can run across the group as well, and again the portions we generally use in Amperity are relatively small. So, you can use the info below for guidelines of what to write, and can learn more of the expressiveness available at your leisure.
Caution
From a performance point of view, window functions tend to be relatively performant as long as the OVER() function contains both PARTITION BY and ORDER BY and will run quickly in Spark and have performance similar to a JOIN operation.
Tip
When a WINDOWS function is too long, add a new code line to improve readability.
That said, an OVER() clause that does not contain both PARTITION BY and ORDER BY may run into situations where a window function without PARTITION BY will result in Spark SQL running 50 million rows through a single partition, which can cause a significant reduction in performance.
A common use in Amperity might look like this:
FIRST_VALUE(LOWER(EMAIL_ADDRESS_)) OVER (
PARTITION BY amperity_id
ORDER BY merged_date DESC
) AS email_address,
It says “Group the records by amperity_id, and for each group return the one with the latest merged_date”. Or, more briefly, “Give me the most recent email address for each customer.” The window function bits do the following:
FIRST_VALUE(<field name>) - the FIRST_VALUE() window function picks the first record out of some set of records. In this case, it’s selecting from the window / partition, which is defined next.
OVER() - the OVER() function sets up the window that we’ll be finding records within. This function should include both PARTITION BY and ORDER_BY functions.
PARTITION BY amperity_id - the PARTITION BY <field name> function behaves in a way analogous to a GROUP BY in that it groups all records with unique values for the specified field together. So here it creates a subset of rows for each Amperity ID.
ORDER BY merged_date DESC - the ORDER BY() function is just like it is in a SELECT statement, it sorts the rows being operated on. The only difference is that in this case, it is only sorting the rows within the partition, so in this example it’s sorting the rows for each Amperity ID. And DESC simply says sort in descending order, so most recent date is first.
AS <field name> - the final clause of the statement above isn’t really part of the window function at all, but is the AS statement you’ve used before to set the name of the projected column. In this example the results will be put in a column with the name email address.
ORDER BY clause¶
The ORDER BY clause is used to sort a result set by one or more output expressions.
ORDER BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...]
Each expression may be composed of output columns or it may be an ordinal number selecting an output column by position (starting at one).
The default sort order is ascending (ASC). NULL values will sort first when the sort order is ascending and will sort last when the sort order is descending (DESC). Use NULLS FIRST to sort NULL values first, regardless of sort order. Use NULLS LAST to sort NULL values last, regardless of sort order.
The ORDER BY clause is evaluated as the last step of a query after any GROUP BY or HAVING clause.
LIMIT clause¶
The LIMIT clause restricts the number of rows in the result set. The following example queries a large table, but the limit clause restricts the output to only have five rows (because the query lacks an ORDER BY, exactly which rows are returned is arbitrary):
SELECT orderdate
FROM orders
LIMIT 5
returns a table similar to:
o_orderdate
-------------
1996-04-14
1992-01-15
1995-02-01
1995-11-12
1992-04-26
(5 rows)
Operators¶
An operator compares two data items, and then returns a result.
BETWEEN¶
Use the BETWEEN operator to test if a value falls within the specified range using the syntax value BETWEEN min AND max:
SELECT 3 BETWEEN 2 AND 6
The statement shown above is equivalent to the following statement:
SELECT 3 >= 2 AND 3 <= 6
Presence of NULL evaluates to NULL
The presence of NULL will result in the statement evaluating to NULL:
SELECT NULL BETWEEN 2 AND 4
and:
SELECT 2 BETWEEN NULL AND 6
String arguments must be of same type
Use the BETWEEN operator to evaluate string arguments as long as the value, min, and max parameters are of the same type:
SELECT 'Paul' BETWEEN 'John' AND 'Ringo'
whereas this query will produce an error:
SELECT '2.3' BETWEEN 'John' AND '35.2'
Comparison operators¶
The following comparison operators are available:
Operator |
Description |
---|---|
< |
Less than |
> |
Greater than |
<= |
Less than or equal to |
>= |
Greater than or equal to |
= |
Equal |
<> |
Not equal |
!= |
Not equal (non-standard but popular syntax) |
ALL¶
Use ALL together with comparison operators in the following way:
expression operator ALL ( subquery )
For example:
SELECT 21 < ALL (VALUES 19, 20, 21); -- false
Examples of ALL comparison operator combinations:
Expression |
Meaning |
---|---|
A = ALL (…) |
Evaluates to TRUE when A is equal to all values. |
A <> ALL (…) |
Evaluates to TRUE when A doesn’t match any value. |
A < ALL (…) |
Evaluates to TRUE when A is smaller than the smallest value. |
ANY¶
Use ANY together with comparison operators in the following way:
expression operator ANY ( subquery )
For example:
SELECT 21 < ALL (VALUES 19, 20, 21); -- false
Examples of ANY comparison operator combinations:
Expression |
Meaning |
---|---|
A = ANY (…) |
Evaluates to TRUE when A is equal to any of the values. Note This form is equivalent to A IN (…). |
A <> ANY (…) |
Evaluates to TRUE when A doesn’t match one or more values. |
A < ANY (…) |
Evaluates to TRUE when A is smaller than the biggest value. |
SOME¶
Use SOME together with comparison operators in the following way:
expression operator SOME ( subquery )
For example:
SELECT 42 >= SOME (SELECT 41 UNION ALL SELECT 42 UNION ALL SELECT 43); -- true
Examples of SOME comparison operator combinations:
Expression |
Meaning |
---|---|
A = SOME (…) |
Evaluates to TRUE when A is equal to any of the values. Note This form is equivalent to A IN (…). |
A <> SOME (…) |
Evaluates to TRUE when A doesn’t match one or more values. |
A < SOME (…) |
Evaluates to TRUE when A is smaller than the biggest value. |
IS NULL¶
The IS NULL operator tests whether a value is NULL. This operator works for all data types.
Using NULL with IS NULL evaluates to TRUE:
SELECT NULL IS NULL
But any other constant does not evaluate to FALSE:
SELECT 3.0 IS NULL
IS NOT NULL¶
The IS NOT NULL operator tests whether a value is not NULL. This operator works for all data types.
Using NULL with IS NOT NULL evaluates to FALSE:
SELECT NULL IS NOT NULL
But any other constant evaluates TRUE:
SELECT 3.0 IS NOT NULL
PIVOT operator¶
The PIVOT operator allows data to be reshaped into a new table. Use values from within a field to define column names, and then group rows by those values and by other fields, such as the Amperity ID. A pivot must declare how data is to be reshaped – SUM(), MAX(), or by some other grouping or aggregation – and then must declare from which field in the existing table the new column names are derived.
For example, a transactions table has a field with three values: PUR (purchases), RET (returns), and OTH (other). Use PIVOT to return a table that contains columns named Purchases, Returns, and Other, each containing the sum for that value in the transactions table, and then a row for each Amperity ID and brand:
1SELECT
2 amperity_id
3 ,brand
4 ,PUR AS Purchases
5 ,RET AS Returns
6 ,OTH AS Other
7FROM transactions
8PIVOT
9 (
10 SUM(net_amount)
11 FOR category IN ('PUR', 'RET', 'OTH')
12 )
Functions¶
A function is a SQL statement that accepts input parameters, performs actions, and then returns results. This section highlights some useful functions for use when using Spark SQL in Amperity.
Note
This section highlights a very small subset of the complete list of functions available in Spark SQL, version 3.5.1, many of which can be useful depending on the type of query.
The following list contains some of the most frequently used functions for building segments via the SQL Segment Editor (alphabetized):
ABS()¶
Use the ABS(expression) function to return the absolute value of “expression”.
Coalesce to empty (non-NULL) values¶
The following example is a common table expression that uses the ABS() function within a CASE statement to determine the how many days have passed since a customer’s most recent purchase, and then to apply a lifecycle status to each set of customers who fall into each range: active, lapsed, dormant, prospect, or no status.
1customer_lifecycle_status AS (
2 SELECT
3 c.amperity_id
4 ,CASE
5 WHEN ABS(DATEDIFF(tae.latest_order_datetime, current_date)) <= 365 THEN 'active'
6 WHEN ABS(DATEDIFF(tae.latest_order_datetime, current_date)) > 365 THEN ''
7 AND ABS(DATEDIFF(tae.latest_order_datetime, current_date)) <= 730 THEN 'lapsed'
8 WHEN ABS(DATEDIFF(tae.latest_order_datetime, current_date)) > 730 THEN 'dormant'
9 ELSE 'prospect'
10 END AS lifecycle_status
11 FROM Customer_360 c
12 LEFT JOIN Transaction_Attributes_Extended tae
13 ON c.amperity_id = tae.amperity_id
14),
AVG()¶
Use the AVG(x) function to return the average of all input values.
CAST()¶
Use the CAST(field-or-expression AS type) function to cast a “field” or an “expression” as “type”. For example, to cast a varchar to a numeric or cast a date or datetime to a timestamp.
For example:
CAST(order_date AS timestamp)
COALESCE()¶
Use the COALESCE(column_name, value) function to return the first non-NULL argument. Use a value of ""
, 0
, or 1
.
Tip
Some functions will fail if they encounter NULL values. Use this function prior to the CONCAT() and SUM() functions to ensure they do not encounter NULL values.
Note
The CONCAT() function will return NULL if the value of any field is NULL. Use the COALESCE() function to coalesce to a zero-length string prior to concatenation. For example, use:
CONCAT(COALESCE(firstname, ""), " ", COALESE(lastname, ""))
to concatenate a FullName field.
Coalesce to empty (non-NULL) values¶
,COALESCE(column_name, "")
Coalesce to 0 (non-NULL) values¶
,COALESCE(column_name, 0)
Coalesce to 1 (non-NULL) values¶
,COALESCE(column_name, 1)
Multiple delimiter types¶
The bad-values blocklist uses a regular expression to identify domain tables. Domain tables are built using a source:feed pattern, whereas custom domain tables use a SQL-safe pattern that uses underscores (_
) instead of a colon (:
) as a delimiter. When custom domain table names are present, the default regular expression will not identify the underscores (and any related custom domain tables), and may return NULL values.
If a blocklist returns NULL values and if custom domain tables are present, update the regular expression in the SELECT statements for the following sections:
bad_addresses
bad_emails
bad_phones
For each SELECT statement, change:
REGEXP_EXTRACT(datasource, '.+?(?=:)') AS datasource,
to:
COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource,
This update will allow these SELECT statements to continue using a regular expression to find domain tables, and then use *
to find custom domain tables and will prevent NULL values from being returned.
Get the first non-NULL value¶
,COALESCE(order_quantity, canceled_quantity, order_quantity) AS qty_coalesce
Add columns (accounting for NULL values)¶
1COALESCE(order_quantity,0)
2 + COALESCE(canceled_quantity,0)
3 + COALESCE(canceled_quantity,0)
4AS qty_sum
Standardize values for USA states¶
The following example standardizes values for all fifty states in the United States to only a two-character value (AK, AL, AR, etc.). The CASE statement uses the following strings to determine:
The correct two-character value
The correct spelled out value
Other variations that appear in the data, which may be common (or known) abbreviations, misspellings, slang, or shortcuts
1CASE
2 WHEN UPPER(TRIM(COALESCE(state))) IN ('AL','ALABAMA', 'BAMA') THEN 'AL'
3 WHEN UPPER(TRIM(COALESCE(state))) IN ('AK','ALASKA') THEN 'AK'
4 WHEN UPPER(TRIM(COALESCE(state))) IN ('AZ','ARIZONA') THEN 'AZ'
5 WHEN UPPER(TRIM(COALESCE(state))) IN ('AR','ARKANSAS') THEN 'AR'
6 WHEN UPPER(TRIM(COALESCE(state))) IN ('CA','CALIF','CALIFORNIA','CALIFORNIZ','CALIFRONIA') THEN 'CA'
7 WHEN UPPER(TRIM(COALESCE(state))) IN ('CO','COLORADO') THEN 'CO'
8 WHEN UPPER(TRIM(COALESCE(state))) IN ('CT','CONNECTICUT', 'CONNETICUT') THEN 'CT'
9 WHEN UPPER(TRIM(COALESCE(state))) IN ('DE','DELAWARE', 'DELWARE') THEN 'DE'
10 WHEN UPPER(TRIM(COALESCE(state))) IN ('FL','FLORIDA') THEN 'FL'
11 WHEN UPPER(TRIM(COALESCE(state))) IN ('GA','GEORGIA') THEN 'GA'
12 WHEN UPPER(TRIM(COALESCE(state))) IN ('HI','HAWAII', 'HAWAI\'I') THEN 'HI'
13 WHEN UPPER(TRIM(COALESCE(state))) IN ('ID','IDAHO') THEN 'ID'
14 WHEN UPPER(TRIM(COALESCE(state))) IN ('IL','ILLINOIS') THEN 'IL'
15 WHEN UPPER(TRIM(COALESCE(state))) IN ('IN','INDIANA') THEN 'IN'
16 WHEN UPPER(TRIM(COALESCE(state))) IN ('IA','IOWA') THEN 'IA'
17 WHEN UPPER(TRIM(COALESCE(state))) IN ('KS','KANSAS') THEN 'KS'
18 WHEN UPPER(TRIM(COALESCE(state))) IN ('KY','KENTUCKY') THEN 'KY'
19 WHEN UPPER(TRIM(COALESCE(state))) IN ('LA','LOUISIANA', 'LOUSIANA') THEN 'LA'
20 WHEN UPPER(TRIM(COALESCE(state))) IN ('ME','MAINE') THEN 'ME'
21 WHEN UPPER(TRIM(COALESCE(state))) IN ('MD','MARYLAND') THEN 'MD'
22 WHEN UPPER(TRIM(COALESCE(state))) IN ('MA','MASS','MASSACHUSETES','MASSACHUSETTS','MASSACHUSETTES') THEN 'MA'
23 WHEN UPPER(TRIM(COALESCE(state))) IN ('MI','MICHIGAN') THEN 'MI'
24 WHEN UPPER(TRIM(COALESCE(state))) IN ('MN','MINNESOTA') THEN 'MN'
25 WHEN UPPER(TRIM(COALESCE(state))) IN ('MS','MISSISSIPPI') THEN 'MS'
26 WHEN UPPER(TRIM(COALESCE(state))) IN ('MO','MISSOURI') THEN 'MO'
27 WHEN UPPER(TRIM(COALESCE(state))) IN ('MT','MONTANA') THEN 'MT'
28 WHEN UPPER(TRIM(COALESCE(state))) IN ('NE','NEBRASKA') THEN 'NE'
29 WHEN UPPER(TRIM(COALESCE(state))) IN ('NV','NEVADA') THEN 'NV'
30 WHEN UPPER(TRIM(COALESCE(state))) IN ('NH','NEW HAMPSHIRE') THEN 'NH'
31 WHEN UPPER(TRIM(COALESCE(state))) IN ('NJ','NEW JERSEY', 'JERSEY') THEN 'NJ'
32 WHEN UPPER(TRIM(COALESCE(state))) IN ('NM','NEW MEXICO') THEN 'NM'
33 WHEN UPPER(TRIM(COALESCE(state))) IN ('NY','NEW YORK') THEN 'NY'
34 WHEN UPPER(TRIM(COALESCE(state))) IN ('NC','NORTH CAROLINA') THEN 'NC'
35 WHEN UPPER(TRIM(COALESCE(state))) IN ('ND','NORTH DAKOTA') THEN 'ND'
36 WHEN UPPER(TRIM(COALESCE(state))) IN ('OH','OHIO') THEN 'OH'
37 WHEN UPPER(TRIM(COALESCE(state))) IN ('OK','OKLAHOMA') THEN 'OK'
38 WHEN UPPER(TRIM(COALESCE(state))) IN ('OR','ORE','OREGON','OREGONE') THEN 'OR'
39 WHEN UPPER(TRIM(COALESCE(state))) IN ('PA','PENNSYLVANIA') THEN 'PA'
40 WHEN UPPER(TRIM(COALESCE(state))) IN ('RI','RHODE ISLAND') THEN 'RI'
41 WHEN UPPER(TRIM(COALESCE(state))) IN ('SC','SOUTH CAROLINA') THEN 'SC'
42 WHEN UPPER(TRIM(COALESCE(state))) IN ('SD','SOUTH DAKOTA') THEN 'SD'
43 WHEN UPPER(TRIM(COALESCE(state))) IN ('TN','TENNESSEE') THEN 'TN'
44 WHEN UPPER(TRIM(COALESCE(state))) IN ('TX','TEXAS') THEN 'TX'
45 WHEN UPPER(TRIM(COALESCE(state))) IN ('UT','UTAH') THEN 'UT'
46 WHEN UPPER(TRIM(COALESCE(state))) IN ('VT','VERMONT') THEN 'VT'
47 WHEN UPPER(TRIM(COALESCE(state))) IN ('VA','VIRGINIA') THEN 'VA'
48 WHEN UPPER(TRIM(COALESCE(state))) IN ('WA','WASHINGTON') THEN 'WA'
49 WHEN UPPER(TRIM(COALESCE(state))) IN ('WV','WEST VIRGINIA') THEN 'WV'
50 WHEN UPPER(TRIM(COALESCE(state))) IN ('WI','WISCONSIN') THEN 'WI'
51 WHEN UPPER(TRIM(COALESCE(state))) IN ('WY','WYOMING') THEN 'WY'
52ELSE NULL
COLLECT_LIST()¶
Use the COLLECT_LIST(expression) function to collect and return a list of non-unique elements, as defined by “expression”.
Note
This function is non-deterministic because it depends on the order of the list of non-unique elements that are returned, which may be non-deterministic. Wrap the COLLECT_LIST() function in a SORT_ARRAY() function to return a deterministic list of non-unique elements.
Return array of ordered IDs¶
Collect a list from the Unified Transactions table, and then returned it as an array of order IDs:
1SELECT
2 COLLECT_LIST(ut.order_id) AS array_order_ids
3FROM Unified_Transactions AS ut
COLLECT_SET()¶
Use the COLLECT_LIST(expression) function to collect and return a set of unique elements.
Note
This function is non-deterministic because it depends on the order of the list of unique elements that are returned, which may be non-deterministic. Wrap the COLLECT_SET() function in a SORT_ARRAY() function to return a deterministic list of unique elements.
CONCAT()¶
Use the CONCAT(column1, column2, …, columnN) function to return a concatenation of all columns. This function returns NULL if any value of any field is NULL. Use the COALESCE() function to map fields to zero-length string values prior to using this function.
Concatenate to full name¶
,CONCAT(COALESCE(firstname, ""), " ", COALESE(lastname, ""))
Concatenate order IDs and dates¶
,CONCAT(order_id, order_date) AS order_id
Add a +1 to phone numbers¶
The following CASE statement uses the TRIM() function to find empty or NULL phone numbers, sets them to NULL, and then standardizes all phone numbers to add a +1:
1CASE
2 WHEN TRIM(phone) LIKE ''
3 OR TRIM(phone) IS NULL
4 THEN NULL
5 ELSE CONCAT('+1','',REGEXP_REPLACE(phone,'[^0123456789]',''))
6END AS phone
CONCAT_WS()¶
Use the CONCAT_WS(separator, [string | array(string)]+ function to return a concatenation of “string” or an “array” of “string”, as separated by “separator”.
Combine five- and four-digit postal codes¶
Some data sources do not contain fields for complete postal codes and instead contain fields that separate the five- and four-digit codes. Some use cases require a single field for postal codes that includes both components, after which the postal semantic tag is applied.
The following example shows how to use a CASE statement to do the following:
Find situations where the five- and four-digit codes are both present, and then combine them.
Find situations where only the five-digit code is present, and then use only the five-digit code.
Uses the CONCAT_WS function to return “zip_code” and “zip_code_plus_four” separated by “-“.
Use NULL for situations where the five-digit code is not present.
Return as the postal field, to which the postal semantic tag may be applied.
1,CASE
2 WHEN zip_code != '(NULL)'
3 AND zip_code_plus_four != '(NULL)'
4 THEN CONCAT_WS('-',zip_code, zip_code_plus_four)
5
6 WHEN zip_code != '(NULL)'
7 THEN zip_code
8
9 ELSE NULL
10END AS postal
Combine five- and four-digit postal codes¶
The following example concatenates columns into a single field, where “ACME” represents a hardcoded value present in the filename.
1SELECT
2 'ACME' AS Brand_Name
3 ,CONCAT_WS('-',
4 ,'ACME'
5 ,card_number) AS Loyalty_ID
6 ,CONCAT_WS(' '
7 ,residential_unit_number
8 ,residential_street_number
9 ,residential_street) AS residential_address
10 ,*
11FROM Loyalty_ACME
12CLUSTER BY card_number
COUNT()¶
Use the COUNT(*) function to return the total number of rows, including rows with unique and non-NULL values.
Use the COUNT(expression) function to return the number of non-NULL rows for “expression”.
COUNT(*) AS `L12M_order_channels`,
or:
1-- Join for one-time customers
2LEFT JOIN (
3 SELECT
4 amperity_id
5 ,CASE
6 WHEN COUNT(*) = 1 THEN '1X Customer'
7 WHEN COUNT(*) > 1 THEN 'Repeat Customer'
8 END AS repeat_purchaser
9 FROM Unified_Transactions
10 GROUP BY amperity_id
11) AS onetime ON onetime.amperity_id = customers.amperity_id
or:
COUNT(channel) AS `channel_frequency`
COUNT(DISTINCT)¶
Use the COUNT(DISTINCT expression) function to return the number of unique and non-NULL rows for “expression”. For example:
,COUNT(DISTINCT tx.channel) AS `L12M_order_channels`
CURRENT_DATE()¶
Use the CURRENT_DATE() function to return the current date at the start of the query evaluation.
Note
You may use CURRENT_DATE without the parentheses to return the current date at the start of query evaluation.
Important
A table that uses the CURRENT_DATE() function is considered a “volatile table” and is always refreshed when running a database with a normal refresh.
CURRENT_TIMESTAMP()¶
Use the CURRENT_TIMESTAMP() function to return the current timestamp at the start of the query evaluation. All calls of CURRENT_TIMESTAMP() within the same query will return the same timestamp.
Note
You may use CURRENT_TIMESTAMP without the parentheses to return the current timestamp at the start of query evaluation.
Important
A table that uses the CURRENT_TIMESTAMP() function is considered a “volatile table” and is always refreshed when running a database with a normal refresh.
DATE()¶
Use the DATE(expression) function to cast the value of “expression” to a date data type.
Compare dates by date range¶
The following SELECT statement returns all orders between November 22, 2019 and November 21, 2020. The DATE() function compares the value of the order_datetime field in customer data to the date range.
1SELECT
2 amperity_id
3 ,MIN(order_datetime) AS min_order_date
4 ,COUNT(DISTINCT order_id) AS orders
5 ,SUM(order_revenue) AS order_revenue
6FROM Unified_Transactions
7WHERE amperity_id IS NOT NULL
8AND DATE(order_datetime) >= DATE('2019-11-22')
9AND DATE(order_datetime) <= DATE('2020-11-21')
Convert DATETIME() to DATE()¶
The following examples will convert attributes that are in DATETIME() to DATE().
To do the type conversion once, use:
DATE(MAX(order_datetime))
To do the type conversion for every row, use:
MAX(DATE(order_datetime))
For example, to return order_datetime as a date (and not a datetime) for each Amperity ID:
1SELECT
2 ut.amperity_id
3 ,DATE(MAX(ut.order_datetime)) AS latest_order_date
4FROM Unified_Transactions AS ut
5GROUP BY 1
DATE_ADD()¶
Use the DATE_ADD(startdate, number) function to return the “number” of days after “startdate”.
DATE_TRUNC()¶
Use the DATE_TRUNC(format, timestamp) function to return a “timestamp” truncated to “format”. A timestamp is represented by a column in Amperity data that contains a valid timestamp.
Format may be any of the following:
YEAR
YYYY
YY
MON
MONTH
MM
DAY
DD
Note
HOUR, MINUTE, SECOND, WEEK, and QUARTER are also valid format values.
Return purchases by day¶
To return last purchase by day:
DATE_TRUNC('DAY',LastPurchase)
Return purchases by month¶
To return last purchase by month:
DATE_TRUNC('MONTH',LastPurchase)
Return purchases by year¶
To return last purchase by year:
DATE_TRUNC('YEAR',LastPurchase)
DATEDIFF()¶
Use the DATEDIFF(enddate, startdate) function to return the number of days between “startdate” and “enddate”.
Find early repeat purchasers¶
1SELECT
2 amperity_id
3 ,CASE
4 WHEN DATEDIFF(second_order_date, first_order_date) <= 30 THEN true
5 WHEN DATEDIFF(second_order_date, first_order_date) > 30 THEN false
6 WHEN DATEDIFF(CURRENT_DATE(), first_order_date) > 30 THEN false
7 END AS early_repeat_purchaser
8FROM Transaction_Attributes_Extended
Find transactions within 30 days¶
1L30D_transactions AS (
2 SELECT
3 *
4 FROM Unified_Transactions
5 WHERE DATEDIFF(CURRENT_DATE(), order_date) <= 30
6),
Categorize by named age group¶
1CASE
2 WHEN DATEDIFF(CURRENT_DATE(), customers.birth_dt) < 7300 THEN 'Gen Z'
3 WHEN DATEDIFF(CURRENT_DATE(), customers.birth_dt) < 14600 THEN 'Millenials'
4 WHEN DATEDIFF(CURRENT_DATE(), customers.birth_dt) < 21900 THEN 'Gen X'
5 WHEN DATEDIFF(CURRENT_DATE(), customers.birth_dt) < 29200 THEN 'Boomers'
6ELSE 'Older than Boomers'
DECODE()¶
Use the DECODE(binary,charset) function to decode the value of “binary” using the character set specified by “charset”, such as US-ASCII, UTF-16, or UTF-8.
Tip
Use the UNBASE64() function to convert a string to binary by using it in place of “binary” within the DECODE() function:
DECODE(UNBASE64(column_name),charset)
Identify email not using US-ASCII¶
The following CASE statement decodes customer emails, identifies customer emails that are not encoded using the US-ASCII character set, and then sets them to NULL.
1CASE
2 WHEN UPPER(DECODE(UNBASE64(customer_email),'US-ASCII')) = 'UNDEFINED'
3 THEN NULL
4 ELSE UPPER(DECODE(UNBASE64(customer_email),'US-ASCII'))
5END AS email,
DENSE_RANK()¶
Use the DENSE_RANK() function to compute the rank of a value in a group of values. DENSE_RANK will assign the same rank to rows with the same value and will not create gaps in the ranked sequence.
Note
The DENSE_RANK and RANK() functions both assign a rank to rows with the same value. The difference is that RANK() will create gaps in the sequence. The following example shows rows that are ranked in ascending order by column B:
------- ------- ------------ ------
col_A col_B dense_rank rank
------- ------- ------------ ------
A 30 3 4
A 20 2 3
A 20 2 3
A 10 1 1
A 10 1 1
------- ------- ------------ ------
EXPLODE()¶
Use the EXPLODE(expression) function to use “expression” to:
Separate elements of an array into multiple columns
Separate elements of a map into multiple rows and columns
Load XML data as ingest query¶
Note
This example uses an example XML file as the data source for sales transactions.
Use the EXPLODE() function to process sales transaction data into a table using an ingest query similar to:
1WITH explodedData AS (
2 SELECT
3 salesTransactionId
4 ,EXPLODE(salesOrder.tenders.tender) AS tender
5 FROM PosXml
6)
7
8SELECT
9 salesTransactionId
10 ,tender.type AS type
11 ,tender.amount AS amount
12FROM explodedData
FIRST_VALUE()¶
Use the FIRST_VALUE(expression) function to return the first value of “expression”.
Use the FIRST_VALUE(expression, isIgnoreNull) function to return only non-NULL values.
Prioritize Amperity ID by order ID¶
The following SQL prioritizes the Amperity ID by the most recent order ID:
1WITH amp_priority AS (
2 SELECT DISTINCT
3 ut.order_id
4 ,ut.datasource
5 ,FIRST_VALUE(uc.amperity_id) OVER (
6 PARTITION BY ut.order_id, ut.datasource
7 ORDER BY uc.update_dt DESC
8 ) AS amperity_id
9 FROM (SELECT amperity_id, datasource, update_dt FROM Unified_Coalesced) uc
10 JOIN (SELECT amperity_id, datasource, order_id FROM Unified_Transactions) ut
11 ON uc.amperity_id = ut.amperity_id
12)
13
14SELECT t.* FROM table_name t
15JOIN amp_priority ap ON t.order_id=ap.order_id
IF()¶
Use the IF(expression1, expression2, expression3) function to return “expression2” when “expression1” evaluates to “true”, otherwise return “expression3”.
Build birthdate¶
If incoming data contains birthdate data split by day, month, and year, you can build a complete birthdate using an ingest query. For example, incoming data has the following fields:
----- ------- ------
day month year
----- ------- ------
08 12 1969
11 25 1978
09 15 1981
----- ------- ------
The following example uses the IF() function to concatenate three fields together using a forward slash ( / ) as a separator:
1SELECT
2 *
3 ,IF(
4 birth_month != '0' AND birth_day != '0' AND birth_year != '0',
5 birth_month||'/'||birth_day||'/'||birth_year, NULL
6 ) AS birthdate
7FROM Source_Table
ISNOTNULL()¶
Use the ISNOTNULL(expression) function to return true when “expression” is NOT NULL.
ISNULL()¶
Use the ISNULL(expression) function to return true when “expression” is NULL.
LAST()¶
Use the LAST(expression) function to return the last value of “expression”.
Use the LAST(expression, isIgnoreNull) function to return only non-NULL values.
Most recent store ID¶
LAST(store_id) AS last_store_id
LEAST()¶
Use the LEAST(column_name, column_name, …) function to return the column with the lowest value among the values of all named columns.
LEFT()¶
Use the LEFT(string, length) function to return the leftmost “length” from “string”. An empty string is returned when “length” is less than or equal to 0.
Set partitions by characters¶
- end-before:
Note
Partitioning LEFT(amperity_id, 1) to use more characters–LEFT(amperity_id, 2)–will improve distribution and the error rate.
LENGTH()¶
Use the LENGTH(expression) function to return one of the following:
The character length for string data, inclusive of trailing spaces.
The number of bytes for binary data, inclusive of binary zeroes.
Return states as 2 characters¶
The following example uses the LENGTH() function to identify columns with two character values for states (AK, AL, AR, etc.), and then sets columns without two character values to NULL:
1CASE
2 WHEN LENGTH(state) = 2
3 THEN state
4 ELSE NULL
5END AS state
LOCATE¶
Use the LOCATE(substring, string[, position]) function to return the first position of the first occurrence of “substring” in “string” after “position”. The given position and return value are 1-based.
Note
The STRPOS() function is the same as the LOCATE() function.
LOWER()¶
Use the LOWER(string) function to convert “string” to lowercase.
LPAD()¶
Use the LPAD(string, length, padding) function to return “string” left-padded with “padding” to a length of “length”.
Note
If “string” is longer than “length”, the return value is shortened to “length”.
Clean postal codes prior to Stitch¶
The following example will ensure postal codes have consistent character counts prior to making them available to Stitch:
,LPAD(postal,5,'0') as postal
Note
This example is for United States-only postal codes, requires using a custom domain table, and is only useful for ad hoc cleanup of a small number of postal codes. This example is not a substitute for address standardization.
Return first five characters¶
The following example returns only the first five characters from the stay_location field:
,LPAD(stay_location,5,'0') AS stay_location
MAX()¶
Use the MAX(expression) function to return the maximum value of “expression”.
MIN()¶
Use the MIN(expression) function to return the minimum value of “expression”.
MONTHS_BETWEEN()¶
Use the MONTHS_BETWEEN(timestamp1, timestamp2) function to return the number of months in-between two timestamps. If “timestamp1” is later than “timestamp2”, the value returned is positive.
Months between order and today¶
1L12M_transactions AS (
2 SELECT
3 *
4 FROM Unified_Transactions
5 WHERE MONTHS_BETWEEN(CURRENT_DATE(), order_date) <= 12
6),
NOW()¶
Use the NOW() function to return the current timestamp as of the start of the query.
NTILE()¶
Use the NTILE(n) function to divide rows for a window partition into “n” buckets, ranging from 1 to n.
Use the NTILE(n, int) function to specify the number of buckets into which rows are divided as “int”.
Compute ntiles over large datasets¶
The following example shows a window function that divides rows into 10 buckets and uses a combination of PARTITION BY and the LEFT() function to define the buckets using the first characters of the Amperity ID:
1Top_10 AS (
2 SELECT DISTINCT
3 amperity_id
4 ,CLV_Spend
5 ,NTILE(10) OVER (PARTITION BY LEFT(amperity_id, 1) ORDER BY CLV_Spend DESC) AS CLVPercentile
6
7 FROM (
8 SELECT
9 amperity_id
10 ,SUM(Customer_Lifetime_Spend) AS CLV_Spend
11 FROM customer_360_v
12 WHERE Customer_Lifetime_Spend IS NOT NULL
13 GROUP BY amperity_id
14 )
15),
Partition predicted CLV by brand¶
The following example shows how to separate values into brand tiers using predicted customer lifetime value.
NTILE(100) OVER (PARTITION BY brand ORDER BY predicted_clv desc, _uuid_pk)
NULLIF()¶
Use the NULLIF(expression1, expression2) function to return NULL if “expression1” is equal to “expression2”.
Return NULL for empty string values¶
The following SELECT statement is an ingest query that returns NULL if the field is empty after trimming whitespace from before and after the initial value:
1SELECT
2 NULLIF(TRIM(BrandName),'') AS BrandName
3 ,NULLIF(TRIM(AttributeName),'') AS AttributeName
4 ,NULLIF(TRIM(Priority),'') AS Priority
5FROM Customer_Table
Bucket RFM scores¶
1L12M_rfm AS (
2 SELECT
3 a.amperity_id
4 ,NTILE(5) OVER (PARTITION BY amperity_id ORDER BY b.order_date ASC) AS `L12M_recency_score`
5 ,NTILE(5) OVER (PARTITION BY amperity_id ORDER BY a.L12M_order_frequency ASC) AS `L12M_frequency_score`
6 ,NTILE(5) OVER (PARTITION BY amperity_id ORDER BY a.L12M_order_total_amount ASC) AS `L12M_monetary_score`
7 FROM L12M_attributes AS `a`
8 JOIN (
9 SELECT
10 amperity_id
11 ,order_date
12 FROM latest_order
13 WHERE MONTHS_BETWEEN(CURRENT_DATE(), order_date) <= 12
14 ) AS `b` ON a.amperity_id = b.amperity_id
15)
RAND()¶
Use the RAND([0,1]) function to return a random value with independent and identically distributed (IID) uniformly distributed values.
Important
A table that uses the RAND() function is considered a “volatile table” and is always refreshed when running a database with a normal refresh.
RANK()¶
Use the RANK(value, value, …) function to return the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row.
REGEXP_EXTRACT()¶
Use the REGEXP_EXTRACT(string, pattern, id) function to replace every instance of the substring matched by the “regex” pattern from “string”.
Extract first and last names¶
Some data sources do not contain fields that can be directly assigned the given-name and surname semantic tags. These tags are important to downstream Stitch processes. When a field is present in the data source that contains data that can be tagged with the full-name semantic tag, you can use domain SQL to extract the first and last name details from that field, add them as new columns, and then apply the correct semantic tags.
Use the REGEXP_EXTRACT() function to:
Trim whitespace from before (or after) the first and last names.
Individually extract the first and last names from the field that contains the full name.
Add columns for the first and last names.
The following example shows part of a SELECT statement that extracts first and last names from the BILLING_NAME field, and then adds columns for first and last names:
1,REGEXP_EXTRACT(TRIM(BILLING_NAME),'(^\\S*)',1) AS GIVEN_NAME
2,REGEXP_EXTRACT(TRIM(BILLING_NAME),'((?<=\\s).*)',1) AS SURNAME
3,TRIM(BILLING_NAME) AS `BILLING_NAME`
REGEXP_REPLACE()¶
Use the REGEXP_REPLACE(string, regex) function to remove every instance of the substring matched by the “regex” pattern from “string”.
Use the REGEXP_REPLACE(string, regex, replace) function to replace every instance of the substring.
Remove whitespace¶
NULLIF(REGEXP_REPLACE(field, '^\\s*(\\S.*\\S)\\s*$', '\\1'), '')
RIGHT()¶
Use the RIGHT(string, length) function to return characters from a string equal to length.
Return only email address domains¶
You can use the RIGHT() function, along with the CHAR_LENGTH() and INSTR() functions, to remove all characters that are not to the right of the “@” symbol in an email address, returning only the domain. The syntax is:
RIGHT(string, CHAR_LENGTH(expression) - INSTR(string, substring)) AS email_domain
For example:
RIGHT(email, CHAR_LENGTH(email) - INSTR(email, '@')) AS email_domain
RLIKE()¶
Use the RLIKE(string, ‘expression’) function to return true when “string” matches “expression”.
Caution
A backslash ( \
) contained within a regular expression MUST be escaped using another backslash. For example, a regular expression that matches a single numeric digit should be \\d
and not \d
.
Validate email addresses¶
The following example shows using the RLIKE() function within a CASE statement to return valid email addresses:
1CASE
2 WHEN email RLIKE('^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$')
3 AND email_completion > 0 THEN true
4 ELSE false
5END AS contactable_email
Validate phone numbers¶
The following example shows using the RLIKE() function within a CASE statement to return valid phone numbers:
1CASE
2 WHEN phone RLIKE('^(\\+\\d{1,2}\\s)?\\(?\\d{3}\\)?[\\s.-]?\\d{3}[\\s.-]?\\d{4}$')
3 AND phone_completion > 0 THEN true
4 ELSE false
5END AS contactable_phone
ROUND()¶
Use the ROUND(expression, x) function to return “expression” rounded to “x” decimal places.
Return rounded sum¶
The following example returns the sum of a, b, and c rounded to two decimal places as the ABC score:
ROUND((CAST(a AS float) + CAST(b AS float) + CAST(c AS float)) / 3, 2) AS AbcScore,
ROW_NUMBER()¶
Use the ROW_NUMBER() function within a window function to assign a unique, sequential number to each row, starting with 1. For example:
ROW_NUMBER() OVER (PARTITION BY amperity_id ORDER BY order_date DESC) AS `rn`
SHA2()¶
Use the SHA2(expression, 256) function to apply one-way SHA-256 hashes to fields, such as those that contain PII data to ensure they are no longer recognizable as valid PII.
For example:
SHA2(LOWER(TRIM(email)),256) AS email
Rehash PII data¶
Some segments send results downstream to support CCPA and/or GDPR workflows. Some CCPA and GDPR workflows send this data back to Amperity, which typically requires the data to be hashed using a domain table.
For example, to hash the name, email, and phone fields in a table named “tohash_ccpa”:
1SELECT
2 *
3 ,SHA2(UPPER(TRIM(firstname))) AS Hash_firstname
4 ,SHA2(UPPER(TRIM(lastname))) AS Hash_lastname
5 ,SHA2(UPPER(TRIM(email))) AS Hash_email
6 ,SHA2(UPPER(TRIM(phone_number))) AS Hash_phone
7FROM tohash_ccpa
SPLIT()¶
Use the SPLIT(string, regex) function to split a “string” at occurances that match “regex”, and then return the results.
Note
In Spark SQL, the index starts at 0.
Split using index position¶
Specify an index to split a string at occurrences that match the index. If the index is larger than the number of available characters, or if the specified index position is out of range, the function will return NULL. For example:
SPLIT(GIFTCARDCODE,'-')[4]
Note
In Spark SQL, the index starts at 0.
Tip
Use the SPLIT_PART() function in Presto SQL. For example:
TRIM(SPLIT_PART(GIFTCARDCODE, '-', 1))
In Presto SQL, the index starts at 1.
Return username from email¶
To return only the username from an email address (the characters before the @ symbol):
SPLIT(REGEXP_REPLACE(LOWER(email), '[.]+', ''), '@')[0] AS email_username
Parse fields with multiple separators¶
Sometimes incoming data will contain data that should be tagged with more than one semantic tag, but also contain different separators within the incoming field. For example:
----------- ---------- ------------------- ------- ---------------------- -------
firstName lastName street poBox location zip
----------- ---------- ------------------- ------- ---------------------- -------
John Smith 123 Main #101 US - Yelm , WA 98597
Andy Jones 456 South Avenue US - Bellingham, WA 98115
Anne Andersen 999 S. Bergen Way US - Seattle ,WA 98104
----------- ---------- ------------------- ------- ---------------------- -------
where “location” represents country, city, and state, always separated with a dash ( - ) between the country and city, and then a comma ( , ) between the city and the state. Some fields contain extra white space between and around the strings.
The “location” field needs to be split into individual city, state, and country fields, the two delimiters can be removed, along with the extra whitespace.
Use domain SQL similar to the following:
1SELECT
2 location
3 ,TRIM(SPLIT(location, '-')[0]) AS country
4 ,TRIM(SPLIT(SPLIT(location, '-')[1],',')[0]) AS city
5 ,TRIM(SPLIT(location, ',')[1]) AS state
6FROM domain_table
and then tag the city, state, and country fields with the appropriate semantic tags.
STRPOS()¶
Use the STRPOS(substring, string[, position]) function to return the first position of the first occurrence of “substring” in “string” after “position”. The given position and return value are 1-based.
Note
The LOCATE() function is the same as the STRPOS() function.
SUBSTR()¶
Use the SUBSTR(string, position, length) function to return the substring of “string” that starts at “position” for the specified “length”.
Return string as lowercase¶
To return as lowercase the string contained in the “Gender” field, and then only the first character of that string:
SUBSTR(LOWER(Gender),1,1) as GEN
SUBSTRING()¶
Use the SUBSTRING(string, position[, length]) function to return the substring of “string” that starts at “position” for the specified “length”.
Note
The “length” parameter is optional. Use it to return only the number of characters specified by “length”. When not specified, SUBSTRING() will use “position” to determine the number of characters to return. For example:
SELECT SUBSTRING('Amperity', 5)
returns “erity”.
SELECT SUBSTRING('Amperity', -3)
returns “ity”.
SELECT SUBSTRING('Amperity', 5, 1)
returns “e”.
Remove country code from phone numbers¶
The following example returns phone numbers from multiple tables, and then removes “+1 “, “+1-”, “1 “ and “1-“.
1SELECT
2 COALESCE(email,'NULL')||COALESCE(phone,'NULL') AS pk
3 ,MAX(email) AS email
4 ,MAX(phone) AS phone
5FROM (
6 SELECT
7 sms.email
8 ,CASE
9 WHEN sms.phone LIKE '+1%' THEN SUBSTRING(sms.phone,3)
10 WHEN sms.phone LIKE '1%' THEN SUBSTRING(sms.phone,2)
11 ELSE sms.phone
12 END AS phone
13 FROM Attentive_Mobile_SMS sms
14 UNION
15 SELECT
16 sub.Email
17 ,CASE
18 WHEN sub.phone LIKE '+1%' THEN SUBSTRING(sub.phone,3)
19 WHEN sub.phone LIKE '1%' THEN SUBSTRING(sub.phone,2)
20 ELSE sub.phone
21 END AS phone
22 FROM Attentive_Mobile_Subscribers sub
23)
24GROUP BY 1
Parse fields from DAT file¶
The following example shows an ingest query that parses fields from a DAT file. Each field (fields 1-6) has a starting point within the DAT file (1, 21, 52, 63, 69, 70) and a length (20, 30, 10, 15, 1, 140). Use an ordinal, such as “_c0”, to define each source field within the DAT file.
1SELECT
2 ,NULLIF(TRIM(SUBSTR(`_c0`,1,20)),'') AS Field1
3 ,NULLIF(TRIM(SUBSTR(`_c0`,21,30)),'') AS Field2
4 ,NULLIF(TRIM(SUBSTR(`_c0`,52,10)),'') AS Field3
5 ,NULLIF(TRIM(SUBSTR(`_c0`,63,15)),'') AS Field4
6 ,NULLIF(TRIM(SUBSTR(`_c0`,69,1)),'') AS Field5
7 ,NULLIF(TRIM(SUBSTR(`_c0`,70,140)),'') AS Field6
8FROM DAT_FILE_NAME
SUM()¶
Use the SUM(expression) to return a sum calculated from values defined in “expression”.
Use the COALESCE() function to map fields with NULL values to zero prior to using this function.
Note
The SUM() function will return NULL if the value of any field is NULL. In some situations you must use the COALESCE() function to coalesce to a zero-length string prior to concatenation. For example:
SUM(COALESCE(lifetimeValue,0))
TO_DATE()¶
Caution
Do not use this function when writing ingest queries. Instead, use the datetime picker in the Feed Editor to define a date.
Use the TO_DATE(timestamp) function to parse “timestamp”.
Use the TO_DATE(timestamp +/- interval) function to parse “timestamp”, and then add ( + ) or subtract ( - ) “interval”, where “interval” is one of “hours”, “minutes”, “days”, and so on.
A date format is not required.
Tip
This function returns NULL when the input is invalid.
Find a date six hours ago¶
Use a minus sign ( - ) to subtract from a value. For example. to find 6 hours before the current time:
TO_DATE(CURRENT_TIMESTAMP - 6 HOURS)
Correct 2-digit years¶
Converting an incorrectly ingested date in the year 19xx is not as simple as adding 365*2000 years. Lots of things can go wrong. In some cases you may need to correct birthdates so that people are not born in the future. The recommended approach is to parse the date correctly, and then re-parse it.
The following example shows two ways to do this:
The first TO_DATE() function shows how to use the current year as a breakpoint, which assumes that birthdates don’t occur in the future and that the vast majority of people are less than 100 years old.
The second TO_DATE() function shows to handle birthdates that occur after the year 2000.
1SELECT
2 customer_id
3 ,birth_dt
4 ,TO_DATE(
5 CONCAT_WS(
6 '-'
7 ,INT(IF(SPLIT(birth_dt, "/")[2] > INT(DATE_FORMAT(CURRENT_DATE(), 'yy')), SPLIT(birth_dt, "/")[2] + 1900, SPLIT(birth_dt, "/")[2] + 2000))
8 ,SPLIT(birth_dt, "/")[0]
9 ,SPLIT(birth_dt, "/")[1]
10 )
11 ,'yyyy-MM-dd'
12 ) AS birth_dt_corrected
13 ,TO_DATE(
14 CONCAT_WS(
15 '-'
16 ,INT(SPLIT(birth_dt, "/")[2]) + 2000
17 ,SPLIT(birth_dt, "/")[0]
18 ,SPLIT(birth_dt, "/")[1]
19 )
20 ,'yyyy-MM-dd'
21 ) AS birth_dt_twentyfirstcentury
22FROM CustomersPOS
TO_TIMESTAMP()¶
Caution
Do not use this function when writing ingest queries. Instead, use the datetime picker in the Feed Editor to define a timestamp.
Use the TO_TIMESTAMP(expression, format) function to return “expression” in the (optional) “format”.
TRANSLATE()¶
Use the TRANSLATE(input, from, to) function to replace the “from” characters in “input” with the “to” characters.
This function is useful for direct replacement of bad characters, simple cleanup tasks, and other situations that do not require any context or the use of regular expressions.
For example, TRANSLATE(abcde, cd, 45) will translate “abcde” to “ab45e”.
Tip
Use the REGEXP_REPLACE() function to replace characters in string values using a regular expression.
TRIM()¶
Use the TRIM(string) function to remove leading and trailing characters from “string”.
Use LEADING, TRAILING, or BOTH as keywords to define the location from which characters are removed, and then FROM to specify the string to which the keyword is applied. BOTH is the default keyword.
The TRIM(string) function assumes it should trim spaces, which means all that is required to trim leading and/or trailing spaces is to specify the column name. For example:
TRIM('column_name')
Clean phone numbers¶
The following example uses the TRIM() function to find empty or NULL phone numbers, sets them to NULL, and then standardizes all phone numbers to add a +1:
1CASE
2 WHEN TRIM(phone) LIKE ''
3 OR TRIM(phone) IS NULL
4 THEN NULL
5 ELSE CONCAT('+1','',REGEXP_REPLACE(phone,'[^0123456789]',''))
6END AS phone
Trim leading characters¶
SELECT TRIM(LEADING '123' FROM '123string')
returns:
string
Trim trailing characters¶
SELECT TRIM(TRAILING 'ing' FROM '123string')
returns:
123str
Trim leading and trailing characters¶
SELECT TRIM(BOTH '12g' FROM '123string')
returns:
3strin
Trim, then set to uppercase¶
TRIM(UPPER(field)) AS field
UNIX_TIMESTAMP()¶
Use the UNIX_TIMESTAMP(time, format) function to return the UNIX timestamp for “time” in the specified “format”.
For example:
UNIX_TIMESTAMP('2021-11-10', 'yyyy-MM-dd');
will return:
1636612010
UPPER()¶
Use the UPPER(string) function to convert “string” to uppercase.