About Spark SQL

Spark SQL is a high performance SQL query engine that is used by Amperity to ingest data, create domain tables, and extend the outcome of the Stitch process in your customer 360 database.

Why should I use this reference?

The Sources and Databases tabs use Spark SQL as the underlying SQL engine. Amperity database tables, custom domain tables, and ingest queries are built almost exclusively by using the SELECT statement, along with the clauses, operators, expressions, and functions you would expect to be available, though you may use additional functionality within Spark SQL as necessary.

Please refer to this reference first, and then to the official Spark SQL, version 3.1.2 documentation.

About Spark SQL

Use Spark SQL to define all SQL queries related to the following areas of Amperity:

  • Ingesting data, including ingest queries

  • Processing data into domain tables

  • Building custom domain tables

  • Loading data into Stitch

  • Running Stitch

  • Loading the results of Stitch into the customer 360 database

  • Defining tables in the customer 360 database

Note

Spark SQL is used to define all SQL queries related to the Stitch process up to (and including) building the tables in the customer 360 database. Presto SQL is used to define SQL queries for segments. Why both?

  • Spark SQL performs better in more traditional processes like machine learning and ETL-like processes that are resource intensive.

  • Presto SQL performs better when running real-time queries against cloud datasets.

Tip

Amperity uses Spark SQL, version 3.1.2 .

Recommendations

This is a list of recommendations for using Spark SQL with Amperity.

General

Do

  • Use consistent and descriptive identifiers and names.

  • Include comments in SQL code where necessary. Use the C style opening /* and closing */ where possible; otherwise precede comments with -- and finish them with a new line.

  • Make judicious use of white space and indentation to make code easier to read.

  • Store ISO-8601 compliant date and time information (YYYY-MM-DD HH:MM:SS.SSSSS).

  • Try to use only standard SQL functions instead of vendor-specific functions for reasons of portability.

  • Keep code succinct and devoid of redundant SQL, such as unnecessary quoting or parentheses or WHERE clauses that can otherwise be derived.

  • Convert string values to lowercase. For example: LOWER(“column_name”) = ‘value’.

  • Code review complex SQL statements.

  • Treat varchars as string values.

  • Use one-way SHA-256 hashes for fields that contain PII data.

  • Ensure that window functions are complete and use OVER() along with PARTITION BY and ORDER BY.

  • Use backticks around column names when they contain spaces or special characters, when they are fully numeric, or when the column name is also a reserved word.

Avoid

  • Reserved keyword names that are used as identifiers.

  • Columns aliased to the same name. For example, avoid email AS email.

  • Quotes (" ") around column names unless using them is necessary.

  • CamelCase, with the exception of the table name. CamelCase is more difficult to scan quickly.

  • Descriptive prefixes or Hungarian notation such as sp_ or tbl.

  • Plurals. Use the more natural collective term where possible instead. For example staff instead of employees or people instead of individuals.

  • Quote identifiers. Use SQL92 double quotes for portability, if possible.

  • Object-oriented design principles with SQL or database structures.

  • Boolean values wrapped in quotes. Use TRUE and FALSE without quotes.

  • Large tables defined within the NOT IN expression of a WHERE statement. These will run slowly and/or result in out-of-memory errors.

Add comments

Be sure to add comments to all SQL code that defines database tables or performs any type of pre-ingest processing, such as an ingest query.

Code comments should describe:

  • The purpose of the table, especially for an intermediate table that is consumed by other tables in a database.

  • The purpose of any sub-queries.

  • Situations where data is coalesced into single fields. Be sure to note what the rule is so that future maintainers of that SQL statement can apply it to their changes. For example: “Items are populated in the following order, when available: from loyalty tables, then POS tables, and then any historical tables.”

  • Non-obvious actions within the SQL statement. For example, when inserting static text as a date value or when joining on fields that appear to have different data types.

Avoid low cardinality

A JOIN operation will run slowly, use more memory, and return lower quality results when many rows on both sides of the operation contain the same value. (It’s fine if one side contains many rows with the same value.)

For example, a pos table that contains over a million rows of POS data with an email column where 50% of the rows have the email address “no@email.com”. In addition, a loyaltyProgram table with 100,000 rows, also 50% “no@email.com”.

Do this

SELECT *
FROM pos
LEFT JOIN loyaltyTable AS l ON (pos.email = l.email)
WHERE pos.email <> 'no@email.com'

This will filter out records with “no@email.com”, and then join them.

Not this

SELECT *
FROM pos
LEFT JOIN loyaltyTable AS l ON (pos.email = l.email)

This will cause the query processor to try and match every “no@email.com” email in the pos table to every email in the loyaltyProgram table, which results in 500,000 * 50,000 matches, or 25,000,000,000 records.

Avoid multiple passes

In some cases information from Amperity must be added to the data ingest process so that Amperity can correctly shape the data prior to the Stitch process. This should be done in a single pass.

  1. At the end of processing on day 1 create a “feedback segment” that contains the data to be shaped on day 2, and then output that segment data so that it may be ingested by Amperity on day 2.

  2. On day 2 ingest data from the customer’s sources normally, and then ingest the feedback segment.

  3. Join the day 2 data to the feedback segment, and then shape the results for Stitch.

  4. Ensure the data is made available to Stitch, and then run Stitch.

  5. Repeat the feedback segment process on subsequent days if necessary.

Commas

Commas are used as separators in SQL queries and are typically added at the end of a line, like this:

SELECT
  amperity_id AS amperity_id,
  max(given_name) AS given_name,
  max(surname) AS surname,
  max(email) AS email,
  max(phone) AS phone,
  max(address) AS address,
  max(city) AS city,
  max(state) AS state,
  max(postal) AS postal,
  max(birthdate) AS birthdate,
  max(gender) AS gender
FROM Unified_Coalesced
GROUP BY amperity_id

Adding the comma at the start of the row is recommended for two reasons:

  1. It makes it visually obvious that a comma was not missed.

  2. It allows a field to be commented out without breaking anything else in the query.

For example:

SELECT
  amperity_id AS amperity_id
  ,max(given_name) AS given_name
  ,max(surname) AS surname
  ,max(email) AS email
  ,max(phone) AS phone
  ,max(address) AS address
--  ,max(address_2) AS address_line_two
  ,max(city) AS city
  ,max(state) AS state
  ,max(postal) AS postal
  ,max(birthdate) AS birthdate
  ,max(gender) AS gender
FROM Unified_Coalesced
GROUP BY amperity_id

Escape characters

Use a backslash ( \ ) as an escape character for strings that contain single and double quotes. For example:

,CASE
  WHEN store_name = 'Macy\'s'
  THEN 'Macy\'s' ELSE 'Other'
END AS store_name

Important

A backslash ( \ ) contained within a regular expression MUST be escaped using another backslash. For example, a regular expression that matches a single numeric digit should be \\d and not \d.

Daisy-chain LEFT JOINs

Daisy-chain the ON conditions when joining several tables together with a LEFT JOIN.

Do this

FROM tableA AS a
LEFT JOIN tableB AS b ON (a.key = b.key)
LEFT JOIN tableC AS c ON (b.key = c.key)

This allows the query processor to prune from the right when executing the query, only evaluating the rows of table A against the rows of table B (with whatever columns came along from table C).

Not this

FROM tableA as A
LEFT JOIN tableB AS b ON (a.key = b.key)
LEFT JOIN tableC AS c ON (a.key = c.key)

If b.key and c.key are unique, the query will work as expected. If “b.key” and “c.key” are not unique in their source tables, the query will run for a long time and return the wrong data because the system will return rows for all matching pairs of “b.key” and “c.key” for each value of “a.key”. An acceptable workaround in situations like this is to first aggregate table B and table C on the key in a subquery, and then join to table A.

Faster operations are better

The cost of operations in Spark SQL depends on many things.

Cost

Discussion

Minimal

A query will run fastest when:

  • SELECT statements are restricted to specific columns.

  • WHERE clauses are restricted to specific rows.

  • JOIN clauses on more than one field run faster when fewer rows are evaluated.

Low

The following actions have a minimal affect on Spark SQL performance:

  • Value-atomic functions, such as type conversions and TRIM(), operate on a single value.

  • Row-atomic functions, such as CONCAT(), operate on a single row, which keeps the dataset small.

  • COALESCE() runs across rows, but generally stops when it finds the first value.

  • UNION appends one set of rows to another, without deduplication effort.

Medium

The following actions will reduce the performance of Spark SQL in various ways. They are often necessary, but should be used carefully and with consideration of the performance cost:

  • SELECT DISTINCT operations require building at least one index over the data and may require shuffling work between workers. SELECT DISTINCT operations should be done at the outermost level of a query against data that has already had some deduplication performed against it.

    Note

    SELECT DISTINCT operations that remove duplicates using a single subquery can be done in that location, often with a reasonable performance cost.

  • Windowing functions, such as OVER, PARTITION BY, and ORDER BY, break the dataset up into pieces that are parallelizable, but a significant amount of work may be required on each piece and there may be many pieces, so the aggregate cost can still be very high.

Expensive

The following actions are expensive:

  • A global ORDER BY can be difficult to partition effectively and results in significant shuffles of data across workers. This type of operation is generally unnecessary because database contents are not rendered directly.

Filter, then combine

A pattern that filters out unneeded rows and selects only necessary columns is much faster than first gathering together all of the information that may be required.

Do this

SELECT
  uct.amperity_id,
  uct.first_name,
  uct.last_name,
  ecomm.last_order_date
FROM
(
  SELECT
    amperity_id,
    first_name,
    last_name
  FROM Unified
  WHERE datasource <> 'Example'
  GROUP BY amperity_id
) AS uct
LEFT JOIN ecomm ON (ecomm.amperity_id = uct.amperity_id)

Not this

SELECT
  uct.amperity_id,
  uct.first_name,
  uct.last_name,
  ecomm.last_order_date
FROM Unified
LEFT JOIN ecomm ON (ecomm.amperity_id = uct.amperity_id)
WHERE Unified.datasource <> 'Example'

Filter NULL values, then join

In some cases the presence of NULL values can cause skew when joined to other tables. You can reduce skew by filtering out NULL values prior to the join operation. For example:

LEFT JOIN (
  SELECT *
  FROM Merged_Customers
  WHERE email IS NOT NULL
  AND phone IS NOT NULL
) mc
on table_A.email = mc.email
or table_A.phone = mc.phone

Identifiers

An identifier is a string that associates a database, table, or column to its parent and child objects within the same database or table. An identifier is qualified using a dot separator ( . ) and is often referred to as “dot notation”. For example:

database_name.table_name.column_name

To use table identifiers to flatten nested XML data

Note

This example uses an example XML file as the data source for sales transactions.

Use identifiers and aliases to flatten nested XML data with an ingest query, similar to:

SELECT
  salesTransactionId AS id
  ,type
  ,dateTime AS salesDateTime
  ,salesOrder.salesOrderId AS salesOrderId
  ,salesOrder.channelType AS channelType
  ,salesOrder.orderSummary.totalAmount AS totalAmount
FROM PosXml

returns a table similar to:

----- ------ ---------------------- -------------- ------------- -------------
 id    type   salesDateTime          salesOrderId   channelType   totalAmount
----- ------ ---------------------- -------------- ------------- -------------
 ABC   Add    2020-11-15T04:54:34Z   A1zyBCxwvDu    Cafe          120
 DEF   Add    2020-11-15T04:55:25Z   B1yxCDwvuEt    Cafe          14
 GHI   Add    2020-11-15T04:57:12Z   C1xwDEvutFs    Cafe          27
----- ------ ---------------------- -------------- ------------- -------------

Indentation

To ensure that SQL is readable it is important that standards of indentation are followed.

Formalisms

Make use of BETWEEN where possible instead of combining multiple statements with AND. Similarly use IN() instead of multiple OR clauses. Where a value needs to be interpreted before leaving the database use the CASE expression. CASE statements can be nested to form more complex logical structures. Avoid the use of UNION clauses and temporary tables where possible.

Join operations

Joins should be aligned with the FROM and grouped with a new line where necessary.

Subqueries

Subqueries should be aligned to the line above them, but then follow standard indentation patters from that location. Sometimes it will make sense to have the closing parenthesis on a new line at the same character position as its opening partner—this is especially true where you have nested subqueries.

Limit expensive operations

Apache Spark breaks problems into little parts, works on each little part, and then rolls them back together. This means that operations that can be easily done in chunks, like “find everyone named ‘Justin’”, run much faster than operations over the whole dataset, like “give me a list of unique first names.”

Tip

This behavior also affects operations like ORDER BY for similar reasons. In order for Spark to determine the correct order it breaks problems into little parts, works on each little part, and then rolls them back together.

The following diagram shows how Apache Spark finds everyone named Justin:

Limit expensive load operations.

The following diagram shows how Apache Spark finds a list of unique first names. Each worker must deduplicate first names, be recombined, and then broken into pieces (again), after which more deduplication is performed.

This process continues until there is no overlap between record sets for each worker. Apache Spark has to break the work apart and bring it back together again several times to get to a final set of unique names:

Limit expensive load operations.

Limit extra work

Limit the work done by a SQL query to improve overall performance. The following examples show how to use window functions to create subsections, order them by date, and then pick the first one.

Do this

SELECT
  amperity_id,
  FIRST_VALUE(LOWER(EMAIL)) OVER
    (PARTITION BY amperity_id)
    AS email_address,
  CAST('2017-12-22' AS DATE) as merged_date
FROM FlatFiles_CampsiteGawEmailListDec2017
WHERE EMAIL IS NOT NULL

This example performs better and achieves the same result.

Not this

SELECT DISTINCT
  amperity_id,
  FIRST_VALUE(LOWER(EMAIL_ADDRESS_)) OVER
    (PARTITION BY amperity_id ORDER BY merged_date DESC)
    AS email_address,
  FIRST_VALUE(merged_date) OVER
    (PARTITION BY LOWER(EMAIL_ADDRESS_) ORDER BY merged_date DESC)
    AS mergeddate
  FROM (
    SELECT
      amperity_id,
      EMAIL_ADDRESS_,
      CAST('2017-12-22' AS DATE) as merged_date
    FROM FlatFiles_TevaSweepstakesDecember2017
    WHERE EMAIL_ADDRESS_ IS NOT NULL)
  WHERE merged_date IS NOT NULL)

This example does extra work to set all dates to the same value and is slower.

Limit tables

A database should only include tables that are useful to downstream activity, such as building segments or for a database export. As a general rule, segment authors should never use any tables with a name that starts with “Unified” to build a segment. Use the passthrough option to make available certain domain tables that contain data that is useful for segmentation.

Load sizes

Apache Spark prefers load sizes to range between 1-10000 files and file sizes to range between 1-1000 MB. Apache Spark will parse 100 x 10 MB files faster than 10 x 100 MB files and much faster than 1 x 10000 MB file. When loading large files to Amperity, as a general guideline to optimize the performance of Apache Spark, look to create situations where:

  • The number of individual files is below 3000.

  • The range of individual file sizes is below 100 MB.

Put differently, Apache Spark will parse 3000 x 100 MB files faster than 300 x 1000 MB files and much faster than 30 x 10000 MB files.

Naming conventions

Ensure the name is unique and does not exist as a reserved keyword. Keep the length to a maximum of 30 bytes—in practice this is 30 characters unless you are using multi-byte character set. Names must begin with a letter and may not end with an underscore. Only use letters, numbers, and underscores in names. Avoid the use of multiple consecutive underscores, as they can be hard to read. Use underscores where you would naturally include a space in the name (first name becomes first_name). Avoid abbreviations and if you have to use them make sure they are commonly understood.

SELECT first_name
FROM Customer360

Alias names

An alias assigns a temporary name to a column or a table that only exists for the duration of a query. An alias is defined after the AS keyword.

For columns:

SELECT column_name AS alias_name
FROM table_name;

For tables:

SELECT column_name(s)
FROM table_name AS alias_name;

When using an alias:

  • Assign names that clearly relate to the column or table. .. importatn:: “Do not use an alias when its name matches the original column name.

  • Always use the AS keyword to ensure readability. .. tip:: Include the AS keyword when aliasing columns in a ‘SELECT’ statement.

  • For computed data–SUM() or AVG()–use the name you would give it were it a column defined in the schema.

  • Always wrap an aliased field name with single back ticks:

    `given_name`
    

The following example shows using aliases to add clarity to a SELECT statement that selects columns from more than one table. The “c360” alias refers to the Customer 360 table and the “el” alias refers to the Email_List table:

SELECT c360.email, c360.given_name, c360.surname, el.status
FROM Customer360 c360

LEFT JOIN Email_List el ON c360.email = el.email
WHERE status IS NOT NULL

Column names

Always use the singular name. Where possible avoid simply using id as the primary identifier for the table. Do not add a column with the same name as its table and vice versa. Always use lowercase except where it may make sense not to, such as with proper nouns.

Correlation names

A correlation name assigns a temporary name to a table. A correlation name is defined within the FROM clause after the table name. For example:

FROM table_name correlation_name;

When using correlations:

  • Assign names that clearly relate to the table. For example: a table named “Merged_Customers” should have a correlation name of “um”.

  • Always use the first letter of the table as the first letter of the correlation.

  • Append numbers as necessary to ensure unique correlation names.

  • Only use the correlation name within the rest of same query.

Suffix names

The following suffixes represent patterns that should be applied to column names in SQL tables. These patterns help ensure the purpose of columns can be easily understood by all users. Use the correct suffixes, use them consistently, and use them everywhere they are appropriate.

Suffix

Description

_id

A unique identifier such as a column that is a primary key.

_status

A flag value or some other status of any type such as “publication_status”.

_total

The total or sum of a collection of values.

_num

Denotes the field contains any kind of number.

_name

Signifies a name such as “first_name”.

_seq

Contains a contiguous sequence of values.

_date

Denotes a column that contains the date of something.

_tally

A count.

_size

The size of something such as a file size or clothing.

_addr

An address for the record could be physical or intangible such as “ip_addr”.

Table names

Table names must be unique within the same database or database generation will fail.

NULL values

Functions may fail when they encounter a NULL value and others may return NULL values if any of their arguments return NULL values.

  • Use the COALESCE() function to convert to a zero-length string when using the CONCAT() and SUM() functions.

  • Use the COALESCE() function to to identify math. For example, multiplication will return NULL if any field is NULL. For example, because 1 is the identity for multiplication, use COALESCE(myColumn, 1)

One-way SHA-256 hashes

Apply one-way SHA-256 hashes to fields that contain PII data. A one-way hash ensures that data can no longer be recognizable as valid PII, yet still allows that data to identified by downstream processes.

A one-way SHA-256 hash has the following syntax:

SHA2(LOWER(TRIM(FIELD)),256) AS FIELD

and uses the following Spark SQL functions:

  • SHA2() hashes data with a one-way SHA-256 hash.

  • LOWER() sets all characters to lowercase.

  • TRIM() removes leading and trailing whitespace.

  • FIELD is the name of the field that contains PII data.

For example:

SHA2(LOWER(TRIM(email)),256) AS email

Reserved words

Always use uppercase for the reserved keywords like SELECT and WHERE. It is best to avoid the abbreviated keywords and use the full length ones where available (prefer ABSOLUTE to ABS). Do not use database server specific keywords where an ANSI SQL keyword already exists performing the same function. This helps to make code more portable.

Reserved keywords

The following keywords are reserved and must be double-quoted to be used as an identifier.

  • ALTER

  • AND

  • AS

  • BETWEEN

  • BY

  • CASE

  • CAST

  • CONSTRAINT

  • CREATE

  • CROSS

  • CUBE

  • CURRENT_DATE

  • CURRENT_TIME

  • CURRENT_TIMESTAMP

  • CURRENT_USER

  • DEALLOCATE

  • DELETE

  • DESCRIBE

  • DISTINCT

  • DROP

  • ELSE

  • END

  • ESCAPE

  • EXCEPT

  • EXECUTE

  • EXISTS

  • EXTRACT

  • FALSE

  • FOR

  • FROM

  • FULL

  • GROUP

  • GROUPING

  • HAVING

  • IN

  • INNER

  • INSERT

  • INTERSECT

  • INTO

  • IS

  • JOIN

  • LEFT

  • LIKE

  • LOCALTIME

  • LOCALTIMESTAMP

  • NATURAL

  • NORMALIZE

  • NOT

  • NULL

  • ON

  • OR

  • ORDER

  • OUTER

  • PREPARE

  • RECURSIVE

  • RIGHT

  • ROLLUP

  • SELECT

  • TABLE

  • THEN

  • TRUE

  • UESCAPE

  • UNION

  • UNNEST

  • USING

  • VALUES

  • WHEN

  • WHERE

  • WITH

Subqueries

A subquery can be useful for shaping data prior to running a query. A subquery is a task that is required to be completed before additional processing can be performed. A subquery often runs quickly and can be used with little risk. That said, a poorly formed subquery can have adverse performance effects. Keep the following tips in mind when using a subquery:

  1. Don’t join to an outer query from a subquery, as it may cause the subquery to run recursively for every value of the join key in the outer query. This may cause the subquery to run many times (or even millions of times). This type of subquery is sometimes referred to as a correlated subquery. The best way to avoid them is to never join between the inner and outer queries.

  2. Each level of a subquery adds overhead. The fewer levels within a subquery, the easier it is to optimize the query. Try to flatten queries using joins instead of using additional levels. If you need to pull data into a subquery for multiple fields, try to use a single subquery instead of many.

  3. Remove columns and/or use as few columns in a subquery as possible. For example, use:

    SELECT
      column_1
     ,column_2
    FROM
    

    instead of

    SELECT
      *
    FROM
    

    This will reduce the amount of data that must be processed.

Temporary tables

A temporary table is a table that is built from another table in the database and can be a useful way to processes repeatable tasks one time as opposed to using a subquery that runs many times. A temporary table is built into the database, which means it will be visible to users who have access to authoring segments. This is not always a desired outcome.

Consider using a temporary table in place of a subquery, but only when that subquery is being run multiple times.

Caution

If you need to use a temporary table, be sure to use a naming convention for the table that makes it obvious to segment authors that it is not a table that should be used for segmentation. For example, prefix a temporary table with ZZ_TEMP_ so that it appears at the end of the list of tables. Column names within the temporary table should be easy to comprehend.

Whitespace

To make the code easier to read it is important that the correct complement of spacing is used. Do not crowd code or remove natural language spaces.

Line spacing

Always include newlines/vertical space:

  • before AND or OR

  • after semicolons to separate queries for easier reading

  • after each keyword definition

  • before a comma when separating multiple columns into logical groups

  • to separate code into related sections, which helps to ease the readability of large chunks of code.

Putting commas and conjunctions at the start of the line makes it easier to comment out a single line without disturbing the rest of the query

SELECT a.title
  ,a.release_date
  --,a.recording_date
  ,a.production_date
FROM albums AS a
WHERE a.title = 'Charcoal Lane'
  OR a.title = 'The New Danger';

Spaces

Spaces should be used to line up the code so that the root keywords all start on the same character boundary, and also so that this makes it easy to keep track of where you are in a query that may be multiple layers deep.

(SELECT f.species_name
  ,AVG(f.height) AS `average_height`
  ,AVG(f.diameter) AS `average_diameter`
FROM flora AS f
WHERE f.species_name = 'Banksia'
  OR f.species_name = 'Sheoak'
  OR f.species_name = 'Wattle'
GROUP BY f.species_name, f.observation_date)

UNION ALL

(SELECT b.species_name
  ,AVG(b.height) AS `average_height`
  ,AVG(b.diameter) AS `average_diameter`
FROM botanic_garden_flora AS b
WHERE b.species_name = 'Banksia'
  OR b.species_name = 'Sheoak'
  OR b.species_name = 'Wattle'
GROUP BY b.species_name, b.observation_date)

Although not exhaustive always include spaces:

  • before and after equals ( = )

  • after commas ( , )

  • surrounding apostrophes ( ‘ ) where not within parentheses or with a trailing comma or semicolon.

SELECT a.title, a.release_date, a.recording_date
FROM albums AS a
WHERE a.title = 'Charcoal Lane'
  OR a.title = 'The New Danger'

WITH clause

The WITH clause defines a common table expression (CTE).

A common table expression (CTE) is a named subquery defined by the WITH clause. A CTE defines a set of variables that act as a temporary view, an optional list of column names, and a query expression. The results of the query expression behaves like a table. Each column of that table is defined by the optional list of column names.

A window function is often used to group subsets of data prior to running a query. A window function should always include the OVER and PARTITION BY clauses to optimize the performance of your query.

Caution

A window function with an OVER statement that does not include a PARTITION BY clause often leads to performance issues when the OVER statement is asked to run across a large number of rows.

LATERAL VIEW clause

The LATERAL VIEW clause generates a virtual table containing one (or more) rows, where each row is applied to the original output row.

Use LATERAL VIEW OUTER to return NULL if the input array or map is empty or NULL.

Note

The LATERAL VIEW clause is often used along with the EXPLODE generator function.

For example, the following common table expression generates virtual tables for email and physical addresses:

WITH
  emails AS (
    SELECT
      unique_id
      ,UPPER(em.email_address) AS email_address
    FROM customer_table
    LATERAL VIEW OUTER EXPLODE(email) as em
  ),
  addresses AS (
    SELECT
      unique_id
      ,UPPER(ad.address) AS `address`
      ,UPPER(ad.address2) AS `address2`
      ,UPPER(ad.city) AS `city`
      ,UPPER(ad.state) AS `state`
      ,UPPER(ad.country) AS `country`
      ,ad.zip AS `postal`
      ,ad.zip4 AS `postal4`
    FROM customer_table
    LATERAL VIEW OUTER EXPLODE(address) as ad
  ),

after which you can use the SELECT statement to select individual columns from the virtual tables, and then join them:

SELECT
  c.unique_id AS `uuid`
  a.address
  a.address2
  a.city
  a.state
  a.country
  a.postal
  a.postal4
  e.email_address AS `email`
FROM customer_table c
LEFT JOIN emails e ON e.unique_id = c.unique_id
LEFT JOIN addresses a ON a.v = c.unique_id

SELECT statement

The SELECT statement is used to define a set of data to be returned from a data table. The set of returned data is often referred to as the result-set. Use the SELECT statement to retrieve rows from any table.

SELECT DISTINCT statement

The SELECT DISTINCT statement is used instead of SELECT to return only distinct rows of data.

FROM clause

The FROM clause specifies the name of the data table against which the SQL query will be run and is part of every SELECT statement.

JOIN

Spark SQL supports all JOIN operations available in standard SQL.

Warning

Joining on fields with many matching values is slow. The query processor in Spark SQL finds the match between each possible pair of matching values and the number of pairs goes up by the product of the matching duplicates in each table.

Joining two tables with 1 million records in each table, where each record in one table matches only 1 record in the other will result in 1 million pairs.

Joining two tables with 1 million records in each table, where all the records match will result in 1 trillion records.

ON() clause

Use the ON() clause to specify the join criteria. Contents of the ON() clause must be wrapped in parentheses.

For example:

FULL JOIN table_name AS tn
ON (Members.`amperity_id` = tn.`Members_PK`)

USING() clause

Use the USING() clause to use column names to specify the join criteria. Contents of the USING() clause must be wrapped in parentheses.

For example:

FULL JOIN table_name AS tn
USING (column_name)

CROSS JOIN

A CROSS JOIN returns a Cartesian product. A Cartesian product combines every item in the first table with every item in the second. For example, if table A has three items and table B has three items, the Cartesian product is 9 pairs.

A CROSS JOIN should generally be avoided due to typical size of tables in Amperity databases, which can often have millions of rows.

A Cartesian product in Amperity between any two tables is often a very, very large number of pairs and, as a result, is an expensive operation. It is recommended to optimize your Spark SQL queries to avoid implicit cross joins.

Tip

Take steps to avoid implicit cross-joins:

  1. Use the ON() or USING() clauses with any JOIN statement.

  2. For situations where you need to run a SELECT statement like

    SELECT * FROM Table_A, Table_B
    

    first filter out NULL values before running the SELECT statement.

FULL OUTER JOIN

A FULL OUTER JOIN returns all of the rows in both tables A and B, no matter whether there’s a row in the other table with the same key value. Where there is a matching key value, columns from the other table will be brought in, where there’s no matching value for key columns from the other table will be set to NULL. While there is no filtering applied in a FULL OUTER JOIN, rows that match on the key are only included once, so the row count of the results will always be at most equal to the count of table A + the count of table B, and will usually be considerably smaller.

INNER JOIN

An INNER JOIN returns only those rows where there are values for the field key in both tables. Any rows in either table A or table B that have a value for key that doesn’t appear in the other table are not returned.

LEFT JOIN

The LEFT JOIN clause joins rows from two tables. For a LEFT JOIN, each row in the left table is joined with all matching rows from the right table. For rows with no match in the right table, the join is completed with NULL to represent column values.

For example:

SELECT * FROM (VALUES 1, 2) t("left")
  LEFT JOIN (VALUES 1, 1) u("right")
  ON t."left" = u."right";

will return a table similar to:

-------- --------
 left     right
-------- --------
 1        1
 1        1
 2        NULL
-------- --------

Note

In some situations you may want to use an LEFT ANTI JOIN. A LEFT ANTI JOIN returns values from the left-side table when they do not have matches on the right-side table. Use a LEFT ANTI JOIN to avoid performance issues that may be caused by a broadcast join.

OUTER JOIN

An outer joins bring all of the rows from the named table (left for LEFT JOIN, right for RIGHT JOIN), but the overlapping rows can be removed by adding the condition WHERE <key value in the other table> IS NULL. You can do the same to select the non-intersecting portion of a FULL OUTER join by using OR with both key values as IS NULL.

Same as LEFT JOIN and RIGHT JOIN.

RIGHT JOIN

The RIGHT JOIN clause joins rows from two tables. For a RIGHT JOIN, each row in the right table is joined with all matching rows from the left table. For rows with no match in the left table, the join is completed with NULL to represent column values.

WHERE clause

The WHERE clause is used to filter records, and then return only those records that match the specified condition and value.

The WHERE clause can be combined with the AND, OR, and NOT operators and the IN, NOT IN, EXISTS, and NOT EXISTS expressions.

AND operator

The AND operator is used to filter records based on more than one condition and displays a record if all the conditions separated by AND are TRUE.

OR operator

The OR operator is used to filter records based on more than one condition and displays a record if any of the conditions separated by OR is TRUE.

NOT operator

The NOT operator displays a record if the condition is NOT TRUE.

IN expression

The IN expression returns a TRUE or FALSE value or UNKNOWN when the returned list contains NULL.

Tip

The IN expression is equivalent to the OR operator. For example: IN (a, b) is equivalent to a OR b.

NOT IN expression

The NOT IN expression returns a TRUE or FALSE value or UNKNOWN when the returned list contains NULL. Use WHERE … NOT IN … when the argument within the NOT IN expression is:

  • A small table with no more than a few thousand rows and is not expected to grow significantly over time.

  • A hard-coded list of values. For example:

    WHERE color NOT IN ('red', 'blue', 'green')
    

Warning

The use of WHERE … NOT IN … will cause performance issues when the argument within NOT IN is a large table or large query. This is because Spark SQL will implement the table or query as a broadcast join, and then attempt to broadcast that table or query to every worker node in the Spark cluster. This may cause Spark SQL to run slowly or result in out-of-memory errors.

To prevent slow queries and/or out-of-memory errors use a LEFT ANTI JOIN operation instead of a WHERE … NOT IN <table or query>. For example:

SELECT *
FROM table_a
LEFT ANTI JOIN (SELECT id FROM table_b) AS remove
ON table_a.id = remove.id

instead of:

SELECT *
FROM table_a
WHERE table_a.id NOT IN (SELECT id FROM table_b)

EXISTS expression

The EXISTS expression returns TRUE when the subquery to which it refers returns one (or more) rows, or returns FALSE.

NOT EXISTS expression

The NOT EXISTS returns TRUE when the subquery to which it refers returns no rows or zero rows, or returns FALSE.

Tip

The NOT EXISTS expression runs the same way in Spark SQL and Presto SQL. This is useful for running the same query in the Customer 360 and the Queries pages. For example:

SELECT *
FROM table_a
WHERE table_a.id NOT EXISTS (SELECT id FROM table_b)

is similar to:

SELECT *
FROM table_a
LEFT ANTI JOIN (SELECT id FROM table_b) AS remove
ON table_a.id = remove.id

The important difference is WHERE … NOT EXISTS … runs the same way in both Spark SQL and Presto SQL, wheres LEFT ANTI JOIN may only be used in Spark SQL.

CASE expression

The standard SQL CASE expression has two forms: simple and searched.

Note

If a CASE expression is too long, add a new line between WHEN and THEN clauses to make the line more readable.

Simple form

The simple form searches each value expression from left to right until it finds one that equals expression:

CASE expression
  WHEN value THEN result
  [ WHEN ... ]
  [ ELSE result ]
END

The result for the matching value is returned.

If no match is found, the result from the ELSE clause is returned if it exists, otherwise NULL is returned:

SELECT a,
  CASE a
    WHEN 1 THEN 'one'
    WHEN 2 THEN 'two'
    ELSE 'many'
  END

Searched form

The searched form evaluates each boolean condition from left to right until one is true and returns the matching result:

CASE
  WHEN condition THEN result
  [ WHEN ... ]
  [ ELSE result ]
END

If no conditions are true, the result from the ELSE clause is returned if it exists, otherwise NULL is returned:

SELECT a, b,
  CASE
    WHEN a = 1 THEN 'aaa'
    WHEN b = 2 THEN 'bbb'
    ELSE 'ccc'
  END

GROUP BY clause

The GROUP BY clause divides the output of a SELECT statement into groups of rows containing matching values. A simple GROUP BY clause may contain any expression composed of input columns or it may be an ordinal number selecting an output column by position (starting at one).

The following queries are equivalent. They both group the output by the nationkey input column with the first query using the ordinal position of the output column and the second query using the input column name:

SELECT count(*), nationkey FROM customer GROUP BY 2;

is equivalent to:

SELECT count(*), nationkey FROM customer GROUP BY nationkey;

GROUP BY clauses can group output by input column names not appearing in the output of a select statement. For example, the following query generates row counts for the customer table using the input column “mktsegment”:

SELECT count(*) FROM customer GROUP BY mktsegment;

returns a table similar to:

 _col0
-------
 29968
 30142
 30189
 29949
 29752
(5 rows)

When a GROUP BY clause is used in a SELECT statement all output expressions must be either aggregate functions or columns present in the GROUP BY clause.

HAVING clause

The HAVING clause is used to sort a result set by one or more output expressions. Use in conjunction with aggregate functions and the GROUP BY clause to control which groups are selected. A HAVING clause eliminates groups that do not satisfy the given conditions and filters groups after groups and aggregates are computed.

For example:

SELECT COUNT(*), segment, key,
       CAST(SUM(actual_loyalty) AS bigint) AS loyalty
FROM Customer360
GROUP BY segment, key
HAVING SUM(actual_loyalty) > 5700000
ORDER BY loyalty DESC;

returns a table similar to:

------- ------------ ------ ----------
 id      segment      key    loyalty
------- ------------ ------ ----------
 1272    CARS         19     5856939
 1253    PLANES       14     5794887
 1248    HOTELS       9      5784628
 1243    HOTELS       12     5757371
 1231    CARS         3      5753216
 1251    CARS         2      5719140
 1247    PLANES       8      5701952
------- ------------ ------ ----------

Window functions

Window functions are a way to evaluate rows around each row as it is being evaluated. There’s great flexibility in controlling how the windows are made (i.e. which other rows to consider), but for most uses I’ve seen in Amperity databases, we use a relatively small subset to group the rows in the data set by the unique values of some field (i.e. like a GROUP BY) and then select a row from that group. In addition to great flexibility on which rows to include in a group, there’s a powerful set of functions you can run across the group as well, and again the portions we generally use in Amperity are relatively small. So, you can use the info below for guidelines of what to write, and can learn more of the expressiveness available at your leisure.

Caution

From a performance point of view, window functions tend to be relatively performant as long as the OVER() function contains both PARTITION BY and ORDER BY and will run quickly in Spark and have performance similar to a JOIN operation.

Tip

When a WINDOWS function is too long, add a new code line to improve readability.

That said, an OVER() clause that does not contain both PARTITION BY and ORDER BY may run into situations where a window function without PARTITION BY will result in Spark running 50 million rows through a single partition, which can cause a significant reduction in performance.

A common use in Amperity might look like this:

FIRST_VALUE(LOWER(EMAIL_ADDRESS_)) OVER (
  PARTITION BY amperity_id
  ORDER BY merged_date DESC)
  AS email_address,

It says “Group the records by amperity_id, and for each group return the one with the latest merged_date”. Or, more briefly, “Give me the most recent email address for each customer.” The window function bits do the following:

  • FIRST_VALUE(<field name>) - the FIRST_VALUE() window function picks the first record out of some set of records. In this case, it’s selecting from the window / partition, which is defined next.

  • OVER() - the OVER() function sets up the window that we’ll be finding records within. This function should include both PARTITION BY and ORDER_BY functions.

  • PARTITION BY amperity_id - the PARTITION BY <field name> function behaves in a way analogous to a GROUP BY in that it groups all records with unique values for the specified field together. So here it creates a subset of rows for each Amperity ID.

  • ORDER BY merged_date DESC - the ORDER BY() function is just like it is in a SELECT statement, it sorts the rows being operated on. The only difference is that in this case, it is only sorting the rows within the partition, so in this example it’s sorting the rows for each Amperity ID. And DESC simply says sort in descending order, so most recent date is first.

  • AS <field name> - the final clause of the statement above isn’t really part of the window function at all, but is the AS statement you’ve used before to set the name of the projected column. In this example the results will be put in a column with the name email address.

ORDER BY clause

The ORDER BY clause is used to sort a result set by one or more output expressions.

ORDER BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...]

Each expression may be composed of output columns or it may be an ordinal number selecting an output column by position (starting at one).

The default sort order is ascending (ASC). NULL values will sort first when the sort order is ascending and will sort last when the sort order is descending (DESC). Use NULLS FIRST to sort NULL values first, regardless of sort order. Use NULLS LAST to sort NULL values last, regardless of sort order.

The ORDER BY clause is evaluated as the last step of a query after any GROUP BY or HAVING clause.

LIMIT clause

The LIMIT clause restricts the number of rows in the result set. The following example queries a large table, but the limit clause restricts the output to only have five rows (because the query lacks an ORDER BY, exactly which rows are returned is arbitrary):

SELECT orderdate FROM orders LIMIT 5;

returns a table similar to:

 o_orderdate
-------------
 1996-04-14
 1992-01-15
 1995-02-01
 1995-11-12
 1992-04-26
 (5 rows)

Operators

An operator compares two data items, and then returns a result.

BETWEEN

Use the BETWEEN operator to test if a value falls within the specified range using the syntax value BETWEEN min AND max:

SELECT 3 BETWEEN 2 AND 6;

The statement shown above is equivalent to the following statement:

SELECT 3 >= 2 AND 3 <= 6;

Presence of NULL evaluates to NULL

The presence of NULL will result in the statement evaluating to NULL:

SELECT NULL BETWEEN 2 AND 4;

and:

SELECT 2 BETWEEN NULL AND 6;

String arguments must be of same type

Use the BETWEEN operator to evaluate string arguments as long as the value, min, and max parameters are of the same type:

SELECT 'Paul' BETWEEN 'John' AND 'Ringo';

whereas this query will produce an error:

SELECT '2.3' BETWEEN 'John' AND '35.2';

Comparison operators

The following comparison operators are available:

Operator

Description

<

Less than

>

Greater than

<=

Less than or equal to

>=

Greater than or equal to

=

Equal

<>

Not equal

!=

Not equal (non-standard but popular syntax)

ALL

Use ALL together with comparison operators in the following way:

expression operator ALL ( subquery )

For example:

SELECT 21 < ALL (VALUES 19, 20, 21); -- false

Examples of ALL comparison operator combinations:

Expression

Meaning

A = ALL (…)

Evaluates to TRUE when A is equal to all values.

A <> ALL (…)

Evaluates to TRUE when A doesn’t match any value.

A < ALL (…)

Evaluates to TRUE when A is smaller than the smallest value.

ANY

Use ANY together with comparison operators in the following way:

expression operator ANY ( subquery )

For example:

SELECT 21 < ALL (VALUES 19, 20, 21); -- false

Examples of ANY comparison operator combinations:

Expression

Meaning

A = ANY (…)

Evaluates to TRUE when A is equal to any of the values.

Note

This form is equivalent to A IN (…).

A <> ANY (…)

Evaluates to TRUE when A doesn’t match one or more values.

A < ANY (…)

Evaluates to TRUE when A is smaller than the biggest value.

SOME

Use SOME together with comparison operators in the following way:

expression operator SOME ( subquery )

For example:

SELECT 42 >= SOME (SELECT 41 UNION ALL SELECT 42 UNION ALL SELECT 43); -- true

Examples of SOME comparison operator combinations:

Expression

Meaning

A = SOME (…)

Evaluates to TRUE when A is equal to any of the values.

Note

This form is equivalent to A IN (…).

A <> SOME (…)

Evaluates to TRUE when A doesn’t match one or more values.

A < SOME (…)

Evaluates to TRUE when A is smaller than the biggest value.

IS NULL

The IS NULL operator tests whether a value is NULL. This operator works for all data types.

Using NULL with IS NULL evaluates to TRUE:

select NULL IS NULL;

But any other constant does not evaluate to FALSE:

SELECT 3.0 IS NULL;

IS NOT NULL

The IS NOT NULL operator tests whether a value is not NULL. This operator works for all data types.

Using NULL with IS NOT NULL evaluates to FALSE:

select NULL IS NOT NULL;

But any other constant evaluates TRUE:

SELECT 3.0 IS NOT NULL;

PIVOT operator

The PIVOT operator allows data to be reshaped into a new table. Use values from within a field to define column names, and then group rows by those values and by other fields, such as the Amperity ID. A pivot must declare how data is to be reshaped – SUM(), MAX(), or by some other grouping or aggregation – and then must declare from which field in the existing table the new column names are derived.

For example, a transactions table has a field with three values: PUR (purchases), RET (returns), and OTH (other). Use PIVOT to return a table that contains columns named Purchases, Returns, and Other, each containing the sum for that value in the transactions table, and then a row for each Amperity ID and brand:

SELECT
  amperity_id
  ,brand
  ,PUR AS Purchases
  ,RET AS Returns
  ,OTH AS Other
FROM transactions
PIVOT
  (
    SUM(net_amount)
    FOR category IN ('PUR', 'RET', 'OTH')
  )

Functions

A function is a SQL statement that accepts input parameters, performs actions, and then returns results. This section highlights some useful functions for use when using Spark SQL in Amperity.

Note

This section highlights a very small subset of the complete list of functions available in Spark SQL, version 3.1.2 , many of which can be useful depending on the type of query.

The following list contains some of the most frequently used functions for building segments via the SQL Segment Editor (alphabetized):

ABS()

Use the ABS(expression) function to return the absolute value of “expression”.

Coalesce to empty (non-NULL) values

The following example is a common table expression that uses the ABS() function within a CASE statement to determine the how many days have passed since a customer’s most recent purchase, and then to apply a lifecycle status to each set of customers who fall into each range: active, lapsed, dormant, prospect, or no status.

customer_lifecycle_status AS (
  SELECT
    c.amperity_id
    ,CASE
      WHEN ABS(DATEDIFF(ta.latest_order_datetime, current_date)) <= 365 THEN 'active'
      WHEN ABS(DATEDIFF(ta.latest_order_datetime, current_date)) > 365 THEN ''
      AND ABS(DATEDIFF(ta.latest_order_datetime, current_date)) <= 730 THEN 'lapsed'
      WHEN ABS(DATEDIFF(ta.latest_order_datetime, current_date)) > 730 THEN 'dormant'
      ELSE 'prospect'
    END AS lifecycle_status
  FROM Customer_360 c
  LEFT JOIN Transaction_Attributes ta ON c.amperity_id = ta.amperity_id
),

AVG()

Use the AVG(x) function to return the average of all input values.

CAST()

Use the CAST(field-or-expression AS type) function to cast a “field” or an “expression” as “type”. For example, to cast a varchar to a numeric or cast a date or datetime to a timestamp.

For example:

CAST(order_date AS timestamp)

COALESCE()

Use the COALESCE(column_name, value) function to return the first non-NULL argument. Use a value of "", 0, or 1.

Tip

Some functions will fail if they encounter NULL values. Use this function prior to the CONCAT() and SUM() functions to ensure they do not encounter NULL values.

Note

The CONCAT() function will return NULL if the value of any field is NULL. Use the COALESCE() function to coalesce to a zero-length string prior to concatenation. For example, use:

CONCAT(COALESCE(firstname, ""), " ", COALESE(lastname, ""))

to concatenate a FullName field.

Coalesce to empty (non-NULL) values

,COALESCE(column_name, "")

Coalesce to 0 (non-NULL) values

,COALESCE(column_name, 0)

Coalesce to 1 (non-NULL) values

,COALESCE(column_name, 1)

Multiple delimiter types

The bad-values blocklist uses a regular expression to identify domain tables. Domain tables are built using a source:feed pattern, whereas custom domain tables use a SQL-safe pattern that uses underscores (_) instead of a colon (:) as a delimiter. When custom domain table names are present, the default regular expression will not identify the underscores (and any related custom domain tables), and may return NULL values.

If a blocklist returns NULL values and if custom domain tables are present, update the regular expression in the SELECT statements for the following sections:

  • bad_addresses

  • bad_emails

  • bad_phones

For each SELECT statement, change:

REGEXP_EXTRACT(datasource, '.+?(?=:)') AS datasource,

to:

COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource,

This update will allow these SELECT statements to continue using a regular expression to find domain tables, and then use * to find custom domain tables and will prevent NULL values from being returned.

Get the first non-NULL value

,COALESCE(order_quantity, canceled_quantity, order_quantity) AS qty_coalesce

Add columns (accounting for NULL values)

COALESCE(order_quantity,0)
  + COALESCE(canceled_quantity,0)
  + COALESCE(canceled_quantity,0)
AS qty_sum

Standardize values for USA states

The following example standardizes values for all fifty states in the United States to only a two-character value (AK, AL, AR, etc.). The CASE statement uses the following strings to determine:

  1. The correct two-character value

  2. The correct spelled out value

  3. Other variations that appear in the data, which may be common (or known) abbreviations, misspellings, slang, or shortcuts

CASE
  WHEN UPPER(TRIM(COALESCE(state))) IN ('AL','ALABAMA', 'BAMA') THEN 'AL'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('AK','ALASKA') THEN 'AK'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('AZ','ARIZONA') THEN 'AZ'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('AR','ARKANSAS') THEN 'AR'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('CA','CALIF','CALIFORNIA','CALIFORNIZ','CALIFRONIA') THEN 'CA'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('CO','COLORADO') THEN 'CO'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('CT','CONNECTICUT', 'CONNETICUT') THEN 'CT'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('DE','DELAWARE', 'DELWARE') THEN 'DE'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('FL','FLORIDA') THEN 'FL'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('GA','GEORGIA') THEN 'GA'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('HI','HAWAII', 'HAWAI\'I') THEN 'HI'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('ID','IDAHO') THEN 'ID'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('IL','ILLINOIS') THEN 'IL'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('IN','INDIANA') THEN 'IN'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('IA','IOWA') THEN 'IA'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('KS','KANSAS') THEN 'KS'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('KY','KENTUCKY') THEN 'KY'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('LA','LOUISIANA', 'LOUSIANA') THEN 'LA'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('ME','MAINE') THEN 'ME'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('MD','MARYLAND') THEN 'MD'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('MA','MASS','MASSACHUSETES','MASSACHUSETTS','MASSACHUSETTES') THEN 'MA'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('MI','MICHIGAN') THEN 'MI'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('MN','MINNESOTA') THEN 'MN'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('MS','MISSISSIPPI') THEN 'MS'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('MO','MISSOURI') THEN 'MO'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('MT','MONTANA') THEN 'MT'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('NE','NEBRASKA') THEN 'NE'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('NV','NEVADA') THEN 'NV'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('NH','NEW HAMPSHIRE') THEN 'NH'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('NJ','NEW JERSEY', 'JERSEY') THEN 'NJ'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('NM','NEW MEXICO') THEN 'NM'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('NY','NEW YORK') THEN 'NY'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('NC','NORTH CAROLINA') THEN 'NC'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('ND','NORTH DAKOTA') THEN 'ND'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('OH','OHIO') THEN 'OH'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('OK','OKLAHOMA') THEN 'OK'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('OR','ORE','OREGON','OREGONE') THEN 'OR'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('PA','PENNSYLVANIA') THEN 'PA'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('RI','RHODE ISLAND') THEN 'RI'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('SC','SOUTH CAROLINA') THEN 'SC'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('SD','SOUTH DAKOTA') THEN 'SD'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('TN','TENNESSEE') THEN 'TN'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('TX','TEXAS') THEN 'TX'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('UT','UTAH') THEN 'UT'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('VT','VERMONT') THEN 'VT'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('VA','VIRGINIA') THEN 'VA'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('WA','WASHINGTON') THEN 'WA'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('WV','WEST VIRGINIA') THEN 'WV'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('WI','WISCONSIN') THEN 'WI'
  WHEN UPPER(TRIM(COALESCE(state))) IN ('WY','WYOMING') THEN 'WY'
ELSE NULL

COLLECT_LIST()

Use the COLLECT_LIST(expression) function to collect and return a list of non-unique elements, as defined by “expression”.

Note

This function is non-deterministic because it depends on the order of the list of non-unique elements that are returned, which may be non-deterministic. Wrap the COLLECT_LIST() function in a SORT_ARRAY() function to return a deterministic list of non-unique elements.

Return array of ordered IDs

Collect a list from the Unified Transactions table, and then returned it as an array of order IDs:

SELECT
  COLLECT_LIST(ut.order_id) AS array_order_ids
FROM
  Unified_Transactions AS ut

COLLECT_SET()

Use the COLLECT_LIST(expression) function to collect and return a set of unique elements.

Note

This function is non-deterministic because it depends on the order of the list of unique elements that are returned, which may be non-deterministic. Wrap the COLLECT_SET() function in a SORT_ARRAY() function to return a deterministic list of unique elements.

CONCAT()

Use the CONCAT(column1, column2, …, columnN) function to return a concatenation of all columns. This function returns NULL if any value of any field is NULL. Use the COALESCE() function to map fields to zero-length string values prior to using this function.

Concatenate to full name

,CONCAT(COALESCE(firstname, ""), " ", COALESE(lastname, ""))

Concatenate order IDs and dates

,CONCAT(order_id, order_date) AS order_id

Add a +1 to phone numbers

The following CASE statement uses the TRIM() function to find empty or NULL phone numbers, sets them to NULL, and then standardizes all phone numbers to add a +1:

CASE
  WHEN TRIM(phone) LIKE ''
  OR TRIM(phone) IS NULL
  THEN NULL
  ELSE CONCAT('+1','',REGEXP_REPLACE(phone,'[^0123456789]',''))
END AS phone

CONCAT_WS()

Use the CONCAT_WS(separator, [string | array(string)]+ function to return a concatenation of “string” or an “array” of “string”, as separated by “separator”.

Combine five- and four-digit postal codes

Some data sources do not contain fields for complete postal codes and instead contain fields that separate the five- and four-digit codes. Some use cases require a single field for postal codes that includes both components, after which the postal semantic tag is applied.

The following example shows how to use a CASE statement to do the following:

  1. Find situations where the five- and four-digit codes are both present, and then combine them.

  2. Find situations where only the five-digit code is present, and then use only the five-digit code.

  3. Uses the CONCAT_WS function to return “zip_code” and “zip_code_plus_four” separated by “-“.

  4. Use NULL for situations where the five-digit code is not present.

  5. Return as the postal field, to which the postal semantic tag may be applied.

,CASE
  WHEN zip_code != '(NULL)' AND zip_code_plus_four != '(NULL)'
  THEN CONCAT_WS('-',zip_code, zip_code_plus_four)

  WHEN zip_code != '(NULL)'
  THEN zip_code

  ELSE NULL
END AS postal

Combine five- and four-digit postal codes

The following example concatenates columns into a single field, where “ACME” represents a hardcoded value present in the filename.

SELECT
  'ACME' AS Brand_Name
  ,CONCAT_WS('-',
             ,'ACME'
             ,card_number) AS Loyalty_ID
  ,CONCAT_WS(' '
             ,residential_unit_number
             ,residential_street_number
             ,residential_street) AS residential_address
  ,*
FROM Loyalty_ACME
CLUSTER BY card_number

COUNT()

Use the COUNT(*) function to return the total number of rows, including rows with unique and non-NULL values.

Use the COUNT(expression) function to return the number of non-NULL rows for “expression”.

COUNT(*) AS `L12M_order_channels`,

or:

-- Join for one-time customers
LEFT JOIN (
  SELECT
    amperity_id
    ,CASE
       WHEN COUNT(*) = 1 THEN '1X Customer'
       WHEN COUNT(*) > 1 THEN 'Repeat Customer'
    END
    AS repeat_purchaser
  FROM Unified_Transactions
  GROUP BY amperity_id
) AS onetime ON onetime.amperity_id = customers.amperity_id

or:

COUNT(channel) AS `channel_frequency`

COUNT(DISTINCT)

Use the COUNT(DISTINCT expression) function to return the number of unique and non-NULL rows for “expression”. For example:

,COUNT(DISTINCT tx.channel) AS `L12M_order_channels`

CURRENT_DATE()

Use the CURRENT_DATE() function to return the current date at the start of the query evaluation.

Note

You may use CURRENT_DATE without the parentheses to return the current date at the start of query evaluation.

Important

A table that uses the CURRENT_DATE() function is considered a “volatile table” and is always refreshed when running a database with a normal refresh.

CURRENT_TIMESTAMP()

Use the CURRENT_TIMESTAMP() function to return the current timestamp at the start of the query evaluation. All calls of CURRENT_TIMESTAMP() within the same query will return the same timestamp.

Note

You may use CURRENT_TIMESTAMP without the parentheses to return the current timestamp at the start of query evaluation.

Important

A table that uses the CURRENT_TIMESTAMP() function is considered a “volatile table” and is always refreshed when running a database with a normal refresh.

DATE()

Use the DATE(expression) function to cast the value of “expression” to a date data type.

Compare dates by date range

The following SELECT statement returns all orders between November 22, 2019 and November 21, 2020. The DATE() function compares the value of the order_datetime field in customer data to the date range.

SELECT
  amperity_id
  ,MIN(order_datetime) AS min_order_date
  ,COUNT(DISTINCT order_id) AS orders
  ,SUM(order_revenue) AS order_revenue
FROM Unified_Transactions
WHERE amperity_id IS NOT NULL AND
  DATE(order_datetime) >= DATE('2019-11-22') AND
  DATE(order_datetime) <= DATE('2020-11-21')

Convert DATETIME() to DATE()

The following examples will convert attributes that are in DATETIME() to DATE().

To do the type conversion once, use:

DATE(MAX(order_datetime))

To do the type conversion for every row, use:

MAX(DATE(order_datetime))

For example, to return order_datetime as a date (and not a datetime) for each Amperity ID:

SELECT
  ut.amperity_id
  ,DATE(MAX(ut.order_datetime)) AS latest_order_date
FROM Unified_Transactions AS ut
GROUP BY 1

DATE_ADD()

Use the DATE_ADD(startdate, number) function to return the “number” of days after “startdate”.

DATE_TRUNC()

Use the DATE_TRUNC(format, timestamp) function to return a “timestamp” truncated to “format”. A timestamp is represented by a column in Amperity data that contains a valid timestamp.

Format may be any of the following:

  • YEAR

  • YYYY

  • YY

  • MON

  • MONTH

  • MM

  • DAY

  • DD

Note

HOUR, MINUTE, SECOND, WEEK, and QUARTER are also valid format values.

Return purchases by day

To return last purchase by day:

DATE_TRUNC('DAY',LastPurchase)

Return purchases by month

To return last purchase by month:

DATE_TRUNC('MONTH',LastPurchase)

Return purchases by year

To return last purchase by year:

DATE_TRUNC('YEAR',LastPurchase)

DATEDIFF()

Use the DATEDIFF(enddate, startdate) function to return the number of days between “startdate” and “enddate”.

Find early repeat purchasers

SELECT
  amperity_id,
  CASE
    WHEN DATEDIFF(second_order_date, first_order_date) <= 30 THEN true
    WHEN DATEDIFF(second_order_date, first_order_date) > 30 THEN false
    WHEN DATEDIFF(CURRENT_DATE(), first_order_date) > 30 THEN false
  END AS early_repeat_purchaser
FROM
  Transaction_Attributes

Find transactions within 30 days

L30D_transactions AS (
  SELECT
    *
  FROM
    Unified_Transactions
  WHERE
    DATEDIFF(CURRENT_DATE(), order_date) <= 30
),

Categorize by named age group

CASE
  WHEN DATEDIFF(CURRENT_DATE(), customers.birth_dt) < 7300 THEN 'Gen Z'
  WHEN DATEDIFF(CURRENT_DATE(), customers.birth_dt) < 14600 THEN 'Millenials'
  WHEN DATEDIFF(CURRENT_DATE(), customers.birth_dt) < 21900 THEN 'Gen X'
  WHEN DATEDIFF(CURRENT_DATE(), customers.birth_dt) < 29200 THEN 'Boomers'
ELSE 'Older than Boomers'

DECODE()

Use the DECODE(binary,charset) function to decode the value of “binary” using the character set specified by “charset”, such as US-ASCII, UTF-16, or UTF-8.

Tip

Use the UNBASE64() function to convert a string to binary by using it in place of “binary” within the DECODE() function:

DECODE(UNBASE64(column_name),charset)

Identify email not using US-ASCII

The following CASE statement decodes customer emails, identifies customer emails that are not encoded using the US-ASCII character set, and then sets them to NULL.

CASE
  WHEN UPPER(DECODE(UNBASE64(customer_email),'US-ASCII')) = 'UNDEFINED'
  THEN NULL
  ELSE UPPER(DECODE(UNBASE64(customer_email),'US-ASCII'))
END AS email,

DENSE_RANK()

Use the DENSE_RANK() function to compute the rank of a value in a group of values. DENSE_RANK will assign the same rank to rows with the same value and will not create gaps in the ranked sequence.

Note

The DENSE_RANK and RANK() functions both assign a rank to rows with the same value. The difference is that RANK() will create gaps in the sequence. The following example shows rows that are ranked in ascending order by column B:

------- ------- ------------ ------
 col_A   col_B   dense_rank   rank
------- ------- ------------ ------
 A       30      3            4
 A       20      2            3
 A       20      2            3
 A       10      1            1
 A       10      1            1
------- ------- ------------ ------

EXPLODE()

Use the EXPLODE(expression) function to use “expression” to:

  1. Separate elements of an array into multiple columns

  2. Separate elements of a map into multiple rows and columns

Load XML data as ingest query

Note

This example uses an example XML file as the data source for sales transactions.

Use the EXPLODE() function to process sales transaction data into a table using an ingest query similar to:

WITH explodedData AS (
  SELECT
    salesTransactionId
    ,EXPLODE(salesOrder.tenders.tender) AS tender FROM PosXml
)

SELECT
  salesTransactionId
  ,tender.type AS type
  ,tender.amount AS amount
FROM
  explodedData

FIRST_VALUE()

Use the FIRST_VALUE(expression) function to return the first value of “expression”.

Use the FIRST_VALUE(expression, isIgnoreNull) function to return only non-NULL values.

Prioritize Amperity ID by order ID

The following SQL prioritizes the Amperity ID by the most recent order ID:

WITH
  amp_priority AS (
    SELECT DISTINCT
      ut.order_id
      ,ut.datasource
      ,FIRST_VALUE(uc.amperity_id) OVER (
        PARTITION BY ut.order_id, ut.datasource
        ORDER BY uc.update_dt DESC
      ) AS amperity_id
    FROM (SELECT amperity_id, datasource, update_dt FROM Unified_Coalesced) uc
    JOIN (SELECT amperity_id, datasource, order_id FROM Unified_Transactions) ut
    ON uc.amperity_id = ut.amperity_id
  )

SELECT t.* FROM table_name t
JOIN amp_priority ap ON t.order_id=ap.order_id

IF()

Use the IF(expression1, expression2, expression3) function to return “expression2” when “expression1” evaluates to “true”, otherwise return “expression3”.

Build birthdate

If incoming data contains birthdate data split by day, month, and year, you can build a complete birthdate using an ingest query. For example, incoming data has the following fields:

----- ------- ------
 day   month  year
----- ------- ------
 08    12     1969
 11    25     1978
 09    15     1981
----- ------- ------

The following example uses the IF() function to concatenate three fields together using a forward slash ( / ) as a separator:

SELECT
  *
  ,IF(birth_month != '0' AND birth_day != '0' AND  birth_year != '0',
      birth_month||'/'||birth_day||'/'||birth_year, NULL) AS birthdate
FROM Source_Table

ISNOTNULL()

Use the ISNOTNULL(expression) function to return true when “expression” is NOT NULL.

ISNULL()

Use the ISNULL(expression) function to return true when “expression” is NULL.

LAST()

Use the LAST(expression) function to return the last value of “expression”.

Use the LAST(expression, isIgnoreNull) function to return only non-NULL values.

Most recent store ID

LAST(store_id) AS last_store_id

LEAST()

Use the LEAST(column_name, column_name, …) function to return the column with the lowest value among the values of all named columns.

LEFT()

Use the LEFT(string, length) function to return the leftmost “length” from “string”. An empty string is returned when “length” is less than or equal to 0.

Set partitions by characters

end-before:

Note

Partitioning LEFT(amperity_id, 1) to use more characters–LEFT(amperity_id, 2)–will improve distribution and the error rate.

LENGTH()

Use the LENGTH(expression) function to return one of the following:

  • The character length for string data, inclusive of trailing spaces.

  • The number of bytes for binary data, inclusive of binary zeroes.

Return states as 2 characters

The following example uses the LENGTH() function to identify columns with two character values for states (AK, AL, AR, etc.), and then sets columns without two character values to NULL:

CASE
  WHEN LENGTH(state) = 2
  THEN state
  ELSE NULL
END AS state

LOCATE

Use the LOCATE(substring, string[, position]) function to return the first position of the first occurrence of “substring” in “string” after “position”. The given position and return value are 1-based.

Note

The STRPOS() function is the same as the LOCATE() function.

LOWER()

Use the LOWER(string) function to convert “string” to lowercase.

LPAD()

Use the LPAD(string, length, padding) function to return “string” left-padded with “padding” to a length of “length”.

Note

If “string” is longer than “length”, the return value is shortened to “length”.

Clean postal codes prior to Stitch

The following example will ensure postal codes have consistent character counts prior to making them available to Stitch:

,LPAD(postal,5,'0') as postal

Note

This example is for United States-only postal codes, requires using a custom domain table, and is only useful for ad hoc cleanup of a small number of postal codes. This example is not a substitute for address standardization.

Return first five characters

The following example returns only the first five characters from the stay_location field:

,LPAD(stay_location,5,'0') AS stay_location

MAX()

Use the MAX(expression) function to return the maximum value of “expression”.

MIN()

Use the MIN(expression) function to return the minimum value of “expression”.

MONTHS_BETWEEN()

Use the MONTHS_BETWEEN(timestamp1, timestamp2) function to return the number of months in-between two timestamps. If “timestamp1” is later than “timestamp2”, the value returned is positive.

Months between order and today

L12M_transactions AS (
  SELECT
    *
  FROM
    Unified_Transactions
  WHERE
    MONTHS_BETWEEN(CURRENT_DATE(), order_date) <= 12
),

NOW()

Use the NOW() function to return the current timestamp as of the start of the query.

NTILE()

Use the NTILE(n) function to divide rows for a window partition into “n” buckets, ranging from 1 to n.

Use the NTILE(n, int) function to specify the number of buckets into which rows are divided as “int”.

Compute ntiles over large datasets

The following example shows a window function that divides rows into 10 buckets and uses a combination of PARTITION BY and the LEFT() function to define the buckets using the first characters of the Amperity ID:

Top_10 AS (
  SELECT DISTINCT
    amperity_id
    ,CLV_Spend
    ,NTILE(10) OVER (PARTITION BY LEFT(amperity_id, 1) ORDER BY CLV_Spend DESC) AS CLVPercentile

  FROM (
    SELECT
      amperity_id
      ,SUM(Customer_Lifetime_Spend) AS CLV_Spend
    FROM customer_360_v
    WHERE Customer_Lifetime_Spend IS NOT NULL
    GROUP BY amperity_id
  )
),

Partition predicted CLV by brand

The following example shows how to separate values into brand tiers using predicted customer lifetime value.

NTILE(100) OVER (PARTITION BY brand ORDER BY predicted_clv desc, _uuid_pk)

NULLIF()

Use the NULLIF(expression1, expression2) function to return NULL if “expression1” is equal to “expression2”.

Return NULL for empty string values

The following SELECT statement is an ingest query that returns NULL if the field is empty after trimming whitespace from before and after the initial value:

SELECT
  NULLIF(TRIM(BrandName),'') AS BrandName
  ,NULLIF(TRIM(AttributeName),'') AS AttributeName
  ,NULLIF(TRIM(Priority),'') AS Priority
FROM
  Customer_Table

Bucket RFM scores

L12M_rfm AS (
  SELECT
    a.amperity_id
    ,NTILE(5) OVER (PARTITION BY amperity_id ORDER BY b.order_date ASC) AS `L12M_recency_score`
    ,NTILE(5) OVER (PARTITION BY amperity_id ORDER BY a.L12M_order_frequency ASC) AS `L12M_frequency_score`
    ,NTILE(5) OVER (PARTITION BY amperity_id ORDER BY a.L12M_order_total_amount ASC) AS `L12M_monetary_score`
  FROM
    L12M_attributes AS `a`
  JOIN (
    SELECT
      amperity_id
      ,order_date
    FROM
      latest_order
    WHERE
      MONTHS_BETWEEN(CURRENT_DATE(), order_date) <= 12
  ) AS `b` ON a.amperity_id = b.amperity_id
)

RAND()

Use the RAND([0,1]) function to return a random value with independent and identically distributed (IID) uniformly distributed values.

Important

A table that uses the RAND() function is considered a “volatile table” and is always refreshed when running a database with a normal refresh.

RANK()

Use the RANK(value, value, …) function to return the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row.

REGEXP_EXTRACT()

Use the REGEXP_EXTRACT(string, pattern, id) function to replace every instance of the substring matched by the “regex” pattern from “string”.

Extract first and last names

Some data sources do not contain fields that can be directly assigned the given-name and surname semantic tags. These tags are important to downstream Stitch processes. When a field is present in the data source that contains data that can be tagged with the full-name semantic tag, you can use domain SQL to extract the first and last name details from that field, add them as new columns, and then apply the correct semantic tags.

Use the REGEXP_EXTRACT() function to:

  1. Trim whitespace from before (or after) the first and last names.

  2. Individually extract the first and last names from the field that contains the full name.

  3. Add columns for the first and last names.

The following example shows part of a SELECT statement that extracts first and last names from the BILLING_NAME field, and then adds columns for first and last names:

,REGEXP_EXTRACT(TRIM(BILLING_NAME),'(^\\S*)',1) AS GIVEN_NAME
,REGEXP_EXTRACT(TRIM(BILLING_NAME),'((?<=\\s).*)',1) AS SURNAME
,TRIM(BILLING_NAME) AS `BILLING_NAME`

REGEXP_REPLACE()

Use the REGEXP_REPLACE(string, regex) function to remove every instance of the substring matched by the “regex” pattern from “string”.

Use the REGEXP_REPLACE(string, regex, replace) function to replace every instance of the substring.

Remove whitespace

NULLIF(REGEXP_REPLACE(field, '^\\s*(\\S.*\\S)\\s*$', '\\1'), '')

RLIKE()

Use the RLIKE(string, ‘expression’) function to return true when “string” matches “expression”.

Caution

A backslash ( \ ) contained within a regular expression MUST be escaped using another backslash. For example, a regular expression that matches a single numeric digit should be \\d and not \d.

Validate email addresses

The following example shows using the RLIKE() function within a CASE statement to return valid email addresses:

CASE
  WHEN email RLIKE('^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$')
  AND email_completion > 0 THEN true
  ELSE false
END AS contactable_email

Validate phone numbers

The following example shows using the RLIKE() function within a CASE statement to return valid phone numbers:

CASE
  WHEN phone RLIKE('^(\\+\\d{1,2}\\s)?\\(?\\d{3}\\)?[\\s.-]?\\d{3}[\\s.-]?\\d{4}$')
  AND phone_completion > 0 THEN true
  ELSE false
END AS contactable_phone

ROUND()

Use the ROUND(expression, x) function to return “expression” rounded to “x” decimal places.

Return rounded sum

The following example returns the sum of a, b, and c rounded to two decimal places as the ABC score:

ROUND((CAST(a AS float) + CAST(b AS float) + CAST(c AS float)) / 3, 2) AS AbcScore,

ROW_NUMBER()

Use the ROW_NUMBER() function within a window function to assign a unique, sequential number to each row, starting with 1. For example:

ROW_NUMBER() OVER (PARTITION BY amperity_id ORDER BY order_date DESC) AS `rn`

SHA2()

Use the SHA2(expression, 256) function to apply one-way SHA-256 hashes to fields, such as those that contain PII data to ensure they are no longer recognizable as valid PII.

For example:

SHA2(LOWER(TRIM(email)),256) AS email

Rehash PII data

Some segments send results downstream to support CCPA and/or GDPR workflows. Some CCPA and GDPR workflows send this data back to Amperity, which typically requires the data to be hashed using a domain table.

For example, to hash the name, email, and phone fields in a table named “tohash_ccpa”:

SELECT
  *
  ,SHA2(UPPER(TRIM(firstname))) AS Hash_firstname
  ,SHA2(UPPER(TRIM(lastname))) AS Hash_lastname
  ,SHA2(UPPER(TRIM(email))) AS Hash_email
  ,SHA2(UPPER(TRIM(phone_number))) AS Hash_phone
FROM tohash_ccpa

SPLIT()

Use the SPLIT(string, regex) function to split a “string” at occurances that match “regex”, and then return the results.

Note

In Spark SQL, the index starts at 0.

Split using index position

Specify an index to split a string at occurrences that match the index. If the index is larger than the number of available characters, or if the specified index position is out of range, the function will return NULL. For example:

SPLIT(GIFTCARDCODE,'-')[4]

Note

In Spark SQL, the index starts at 0.

Tip

Use the SPLIT_PART() function in Presto SQL. For example:

TRIM(SPLIT_PART(GIFTCARDCODE, '-', 1))

In Presto SQL, the index starts at 1.

Return username from email

To return only the username from an email address (the characters before the @ symbol):

SPLIT(REGEXP_REPLACE(LOWER(email), '[.]+', ''), '@')[0] AS email_username

Parse fields with multiple separators

Sometimes incoming data will contain data that should be tagged with more than one semantic tag, but also contain different separators within the incoming field. For example:

----------- ---------- ------------------- ------- ---------------------- -------
 firstName   lastName   street              poBox   location               zip
----------- ---------- ------------------- ------- ---------------------- -------
 John        Smith      123 Main            #101    US - Yelm , WA         98597
 Andy        Jones      456 South Avenue            US - Bellingham, WA    98115
 Anne        Andersen   999 S. Bergen Way           US - Seattle ,WA       98104
----------- ---------- ------------------- ------- ---------------------- -------

where “location” represents country, city, and state, always separated with a dash ( - ) between the country and city, and then a comma ( , ) between the city and the state. Some fields contain extra white space between and around the strings.

The “location” field needs to be split into individual city, state, and country fields, the two delimiters can be removed, along with the extra whitespace.

Use domain SQL similar to the following:

SELECT
  location
  ,TRIM(SPLIT(location, '-')[0]) AS country
  ,TRIM(SPLIT(SPLIT(location, '-')[1],',')[0]) AS city
  ,TRIM(SPLIT(location, ',')[1]) AS state
FROM domain_table

and then tag the city, state, and country fields with the appropriate semantic tags.

STRPOS()

Use the STRPOS(substring, string[, position]) function to return the first position of the first occurrence of “substring” in “string” after “position”. The given position and return value are 1-based.

Note

The LOCATE() function is the same as the STRPOS() function.

SUBSTR()

Use the SUBSTR(string, position, length) function to return the substring of “string” that starts at “position” for the specified “length”.

Return string as lowercase

To return as lowercase the string contained in the “Gender” field, and then only the first character of that string:

SUBSTR(LOWER(Gender),1,1) as GEN

SUBSTRING()

Use the SUBSTRING(string, position[, length]) function to return the substring of “string” that starts at “position” for the specified “length”.

Note

The “length” parameter is optional. Use it to return only the number of characters specified by “length”. When not specified, SUBSTRING() will use “position” to determine the number of characters to return. For example:

SELECT SUBSTRING('Amperity', 5)

returns “erity”.

SELECT SUBSTRING('Amperity', -3)

returns “ity”.

SELECT SUBSTRING('Amperity', 5, 1)

returns “e”.

Remove country code from phone numbers

The following example returns phone numbers from multiple tables, and then removes “+1 “, “+1-”, “1 “ and “1-“.

SELECT
  COALESCE(email,'NULL')||COALESCE(phone,'NULL') AS pk
  ,MAX(email) AS email
  ,MAX(phone) AS phone
FROM (
  SELECT
    sms.email
    ,CASE WHEN sms.phone LIKE '+1%' THEN SUBSTRING(sms.phone,3)
          WHEN sms.phone LIKE '1%' THEN SUBSTRING(sms.phone,2)
          ELSE sms.phone
    END AS phone
  FROM Attentive_Mobile_SMS sms
  UNION
  SELECT
    sub.Email
    ,CASE WHEN sub.phone LIKE '+1%' THEN SUBSTRING(sub.phone,3)
          WHEN sub.phone LIKE '1%' THEN SUBSTRING(sub.phone,2)
          ELSE sub.phone
    END AS phone
  FROM Attentive_Mobile_Subscribers sub
)
GROUP BY 1

Parse fields from DAT file

The following example shows an ingest query that parses fields from a DAT file. Each field (fields 1-6) has a starting point within the DAT file (1, 21, 52, 63, 69, 70) and a length (20, 30, 10, 15, 1, 140). Use an ordinal ( _c0 ) to define each source field within the DAT file.

SELECT
  ,NULLIF(TRIM(SUBSTR(`_c0`,1,20)),'') AS Field1
  ,NULLIF(TRIM(SUBSTR(`_c0`,21,30)),'') AS Field2
  ,NULLIF(TRIM(SUBSTR(`_c0`,52,10)),'') AS Field3
  ,NULLIF(TRIM(SUBSTR(`_c0`,63,15)),'') AS Field4
  ,NULLIF(TRIM(SUBSTR(`_c0`,69,1)),'') AS Field5
  ,NULLIF(TRIM(SUBSTR(`_c0`,70,140)),'') AS Field6
FROM DAT_FILE_NAME

SUM()

Use the SUM(expression) to return a sum calculated from values defined in “expression”.

Use the COALESCE() function to map fields with NULL values to zero prior to using this function.

Note

The SUM() function will return NULL if the value of any field is NULL. In some situations you must use the COALESCE() function to coalesce to a zero-length string prior to concatenation. For example:

SUM(COALESCE(lifetimeValue,0))

TO_DATE()

Caution

Do not use this function when writing ingest queries. Instead, use the datetime picker in the Feed Editor to define a date.

Use the TO_DATE(timestamp) function to parse “timestamp”.

Use the TO_DATE(timestamp +/- interval) function to parse “timestamp”, and then add ( + ) or subtract ( - ) “interval”, where “interval” is one of “hours”, “minutes”, “days”, and so on.

A date format is not required.

Tip

This function returns NULL when the input is invalid.

Find a date six hours ago

Use a minus sign ( - ) to subtract from a value. For example. to find 6 hours before the current time:

TO_DATE(CURRENT_TIMESTAMP - 6 HOURS)

Correct 2-digit years

Converting an incorrectly ingested date in the year 19xx is not as simple as adding 365*2000 years. Lots of things can go wrong. In some cases you may need to correct birthdates so that people are not born in the future. The recommended approach is to parse the date correctly, and then re-parse it.

The following example shows two ways to do this:

  1. The first TO_DATE() function shows how to use the current year as a breakpoint, which assumes that birthdates don’t occur in the future and that the vast majority of people are less than 100 years old.

  2. The second TO_DATE() function shows to handle birthdates that occur after the year 2000.

SELECT
  customer_id
  ,birth_dt
  ,TO_DATE(
    CONCAT_WS(
      '-'
      ,INT(IF(SPLIT(birth_dt, "/")[2] > INT(DATE_FORMAT(CURRENT_DATE(), 'yy')), SPLIT(birth_dt, "/")[2] + 1900, SPLIT(birth_dt, "/")[2] + 2000))
      ,SPLIT(birth_dt, "/")[0]
      ,SPLIT(birth_dt, "/")[1]
    )
    ,'yyyy-MM-dd'
  ) AS birth_dt_corrected
  ,TO_DATE(
    CONCAT_WS(
      '-'
      ,INT(SPLIT(birth_dt, "/")[2]) + 2000
      ,SPLIT(birth_dt, "/")[0]
      ,SPLIT(birth_dt, "/")[1]
    )
    ,'yyyy-MM-dd'
  ) AS birth_dt_twentyfirstcentury
FROM CustomersPOS

TO_TIMESTAMP()

Caution

Do not use this function when writing ingest queries. Instead, use the datetime picker in the Feed Editor to define a timestamp.

Use the TO_TIMESTAMP(expression, format) function to return “expression” in the (optional) “format”.

TRANSLATE()

Use the TRANSLATE(input, from, to) function to replace the “from” characters in “input” with the “to” characters.

This function is useful for direct replacement of bad characters, simple cleanup tasks, and other situations that do not require any context or the use of regular expressions.

For example, TRANSLATE(abcde, cd, 45) will translate “abcde” to “ab45e”.

Tip

Use the REGEXP_REPLACE() function to replace characters in string values using a regular expression.

TRIM()

Use the TRIM(string) function to remove leading and trailing characters from “string”.

Use LEADING, TRAILING, or BOTH as keywords to define the location from which characters are removed, and then FROM to specify the string to which the keyword is applied. BOTH is the default keyword.

The TRIM(string) function assumes it should trim spaces, which means all that is required to trim leading and/or trailing spaces is to specify the column name. For example:

TRIM('column_name')

Clean phone numbers

The following example uses the TRIM() function to find empty or NULL phone numbers, sets them to NULL, and then standardizes all phone numbers to add a +1:

CASE
  WHEN TRIM(phone) LIKE ''
  OR TRIM(phone) IS NULL
  THEN NULL
  ELSE CONCAT('+1','',REGEXP_REPLACE(phone,'[^0123456789]',''))
END AS phone

Trim leading characters

SELECT TRIM(LEADING '123' FROM '123string')

returns:

string

Trim trailing characters

SELECT TRIM(TRAILING 'ing' FROM '123string')

returns:

123str

Trim leading and trailing characters

SELECT TRIM(BOTH '12g' FROM '123string')

returns:

3strin

Trim, then set to uppercase

TRIM(UPPER(field)) AS field

UNIX_TIMESTAMP()

Use the UNIX_TIMESTAMP(time, format) function to return the UNIX timestamp for “time” in the specified “format”.

For example:

UNIX_TIMESTAMP('2021-11-10', 'yyyy-MM-dd');

will return:

1636612010

UPPER()

Use the UPPER(string) function to convert “string” to uppercase.