About custom domain tables¶
Some customer data sources are only available in a state that requires the use of SQL to construct a complete record that can be made available to the Stitch process. This is often true with interaction records, which typically require some data shaping to map the data that is available in the data source to the semantic tags that are required by Amperity for transactions.
These semantic tags generate the Unified Itemized Transactions table, which is then used as the underlying reference for the Unified Transactions, Transaction Attributes, and Transaction Attributes Extended tables.
A custom domain table is built directly using Spark SQL to define a schema for that data source, after which semantic tags are applied and the primary key is identified. A custom domain table may reference other custom domain tables.
Note
When a database is run, any custom domain table that has changed is run first, and then Stitch will run. If there are no changes to custom domain tables or if custom domain tables have changed that are not configured for Stitch, Stitch will not run.
Domain SQL¶
Domain SQL is a reference to using Spark SQL to reshape data that has been loaded to Amperity prior to making it available to downstream processes, such as Stitch or the customer 360 database. Domain SQL is often used to build new tables from existing domain tables, and then reshaping that data into a new table that allows semantic tags for transactions and itemized transactions to be applied correctly.
Use cases¶
The following examples describe some of the more common uses cases for domain SQL:
Note
These examples are not meant to be copied and pasted, but they should work for most tenants as a good starting point.
Combine day, month, year as birthdate¶
Some data sources do not contain fields for complete birthdates and instead contain values by day, month, and year in separate fields. These individual fields must be combined in order to use the birthdate semantic tag.
The following example shows an IF
statement within a SELECT
statement that finds the values in day, month, and year fields, and then combines them into a field that captures the birthdate value as DD/MM/YYYY
:
SELECT
*
,IF(birth_day != '0' AND birth_month != '0' AND birth_year != '0',
birth_month||'/'||birth_day||'/'||birth_year, NULL) AS birthdate
FROM table
Combine five- and four-digit postal codes¶
Some data sources do not contain fields for complete postal codes and instead contain fields that separate the five- and four-digit codes. Some use cases require a single field for postal codes that includes both components, after which the postal semantic tag is applied.
The following example shows how to use a CASE statement to do the following:
Find situations where the five- and four-digit codes are both present, and then combine them.
Find situations where only the five-digit code is present, and then use only the five-digit code.
Uses the CONCAT_WS function to return “zip_code” and “zip_code_plus_four” separated by “-“.
Use NULL for situations where the five-digit code is not present.
Return as the postal field, to which the postal semantic tag may be applied.
,CASE
WHEN zip_code != '(NULL)' AND zip_code_plus_four != '(NULL)'
THEN CONCAT_WS('-',zip_code, zip_code_plus_four)
WHEN zip_code != '(NULL)'
THEN zip_code
ELSE NULL
END AS postal
Extract first and last names¶
Some data sources do not contain fields that can be directly assigned the given-name and surname semantic tags. These tags are important to downstream Stitch processes. When a field is present in the data source that contains data that can be tagged with the full-name semantic tag, you can use domain SQL to extract the first and last name details from that field, add them as new columns, and then apply the correct semantic tags.
Use the REGEXP_EXTRACT() function to:
Trim whitespace from before (or after) the first and last names.
Individually extract the first and last names from the field that contains the full name.
Add columns for the first and last names.
The following example shows part of a SELECT statement that extracts first and last names from the BILLING_NAME field, and then adds columns for first and last names:
,REGEXP_EXTRACT(TRIM(BILLING_NAME),'(^\\S*)',1) AS GIVEN_NAME
,REGEXP_EXTRACT(TRIM(BILLING_NAME),'((?<=\\s).*)',1) AS SURNAME
,TRIM(BILLING_NAME) AS `BILLING_NAME`
Hash PII data that has been resent to Amperity¶
Some segments send results downstream to support CCPA and/or GDPR workflows. Some CCPA and GDPR workflows send this data back to Amperity, which typically requires the data to be hashed using a domain table.
For example, to hash the name, email, and phone fields in a table named “tohash_ccpa”:
SELECT
*
,SHA2(UPPER(TRIM(firstname))) AS Hash_firstname
,SHA2(UPPER(TRIM(lastname))) AS Hash_lastname
,SHA2(UPPER(TRIM(email))) AS Hash_email
,SHA2(UPPER(TRIM(phone_number))) AS Hash_phone
FROM tohash_ccpa
Parse fields with multiple separators¶
Sometimes incoming data will contain data that should be tagged with more than one semantic tag, but also contain different separators within the incoming field. For example:
----------- ---------- ------------------- ------- ---------------------- -------
firstName lastName street poBox location zip
----------- ---------- ------------------- ------- ---------------------- -------
John Smith 123 Main #101 US - Yelm , WA 98597
Andy Jones 456 South Avenue US - Bellingham, WA 98115
Anne Andersen 999 S. Bergen Way US - Seattle ,WA 98104
----------- ---------- ------------------- ------- ---------------------- -------
where “location” represents country, city, and state, always separated with a dash ( - ) between the country and city, and then a comma ( , ) between the city and the state. Some fields contain extra white space between and around the strings.
The “location” field needs to be split into individual city, state, and country fields, the two delimiters can be removed, along with the extra whitespace.
Use domain SQL similar to the following:
SELECT
location
,TRIM(SPLIT(location, '-')[0]) AS country
,TRIM(SPLIT(SPLIT(location, '-')[1],',')[0]) AS city
,TRIM(SPLIT(location, ',')[1]) AS state
FROM domain_table
and then tag the city, state, and country fields with the appropriate semantic tags.
Reference custom domain tables¶
A custom domain table may reference another custom domain table. For example:
SELECT
order_id
,two.order_id
FROM custom_domain_table1 one
LEFT JOIN custom_domain_table2 two ON one.order_id = two.order_id
Set non-US-ASCII email addresses to NULL¶
The following CASE statement decodes customer emails, identifies customer emails that are not encoded using the US-ASCII character set, and then sets them to NULL.
CASE
WHEN UPPER(DECODE(UNBASE64(customer_email),'US-ASCII')) = 'UNDEFINED'
THEN NULL
ELSE UPPER(DECODE(UNBASE64(customer_email),'US-ASCII'))
END AS email,
Standardize values for USA states¶
The following example standardizes values for all fifty states in the United States to only a two-character value (AK, AL, AR, etc.). The CASE statement uses the following strings to determine:
The correct two-character value
The correct spelled out value
Other variations that appear in the data, which may be common (or known) abbreviations, misspellings, slang, or shortcuts
CASE
WHEN UPPER(TRIM(COALESCE(state))) IN ('AL','ALABAMA', 'BAMA') THEN 'AL'
WHEN UPPER(TRIM(COALESCE(state))) IN ('AK','ALASKA') THEN 'AK'
WHEN UPPER(TRIM(COALESCE(state))) IN ('AZ','ARIZONA') THEN 'AZ'
WHEN UPPER(TRIM(COALESCE(state))) IN ('AR','ARKANSAS') THEN 'AR'
WHEN UPPER(TRIM(COALESCE(state))) IN ('CA','CALIF','CALIFORNIA','CALIFORNIZ','CALIFRONIA') THEN 'CA'
WHEN UPPER(TRIM(COALESCE(state))) IN ('CO','COLORADO') THEN 'CO'
WHEN UPPER(TRIM(COALESCE(state))) IN ('CT','CONNECTICUT', 'CONNETICUT') THEN 'CT'
WHEN UPPER(TRIM(COALESCE(state))) IN ('DE','DELAWARE', 'DELWARE') THEN 'DE'
WHEN UPPER(TRIM(COALESCE(state))) IN ('FL','FLORIDA') THEN 'FL'
WHEN UPPER(TRIM(COALESCE(state))) IN ('GA','GEORGIA') THEN 'GA'
WHEN UPPER(TRIM(COALESCE(state))) IN ('HI','HAWAII', 'HAWAI\'I') THEN 'HI'
WHEN UPPER(TRIM(COALESCE(state))) IN ('ID','IDAHO') THEN 'ID'
WHEN UPPER(TRIM(COALESCE(state))) IN ('IL','ILLINOIS') THEN 'IL'
WHEN UPPER(TRIM(COALESCE(state))) IN ('IN','INDIANA') THEN 'IN'
WHEN UPPER(TRIM(COALESCE(state))) IN ('IA','IOWA') THEN 'IA'
WHEN UPPER(TRIM(COALESCE(state))) IN ('KS','KANSAS') THEN 'KS'
WHEN UPPER(TRIM(COALESCE(state))) IN ('KY','KENTUCKY') THEN 'KY'
WHEN UPPER(TRIM(COALESCE(state))) IN ('LA','LOUISIANA', 'LOUSIANA') THEN 'LA'
WHEN UPPER(TRIM(COALESCE(state))) IN ('ME','MAINE') THEN 'ME'
WHEN UPPER(TRIM(COALESCE(state))) IN ('MD','MARYLAND') THEN 'MD'
WHEN UPPER(TRIM(COALESCE(state))) IN ('MA','MASS','MASSACHUSETES','MASSACHUSETTS','MASSACHUSETTES') THEN 'MA'
WHEN UPPER(TRIM(COALESCE(state))) IN ('MI','MICHIGAN') THEN 'MI'
WHEN UPPER(TRIM(COALESCE(state))) IN ('MN','MINNESOTA') THEN 'MN'
WHEN UPPER(TRIM(COALESCE(state))) IN ('MS','MISSISSIPPI') THEN 'MS'
WHEN UPPER(TRIM(COALESCE(state))) IN ('MO','MISSOURI') THEN 'MO'
WHEN UPPER(TRIM(COALESCE(state))) IN ('MT','MONTANA') THEN 'MT'
WHEN UPPER(TRIM(COALESCE(state))) IN ('NE','NEBRASKA') THEN 'NE'
WHEN UPPER(TRIM(COALESCE(state))) IN ('NV','NEVADA') THEN 'NV'
WHEN UPPER(TRIM(COALESCE(state))) IN ('NH','NEW HAMPSHIRE') THEN 'NH'
WHEN UPPER(TRIM(COALESCE(state))) IN ('NJ','NEW JERSEY', 'JERSEY') THEN 'NJ'
WHEN UPPER(TRIM(COALESCE(state))) IN ('NM','NEW MEXICO') THEN 'NM'
WHEN UPPER(TRIM(COALESCE(state))) IN ('NY','NEW YORK') THEN 'NY'
WHEN UPPER(TRIM(COALESCE(state))) IN ('NC','NORTH CAROLINA') THEN 'NC'
WHEN UPPER(TRIM(COALESCE(state))) IN ('ND','NORTH DAKOTA') THEN 'ND'
WHEN UPPER(TRIM(COALESCE(state))) IN ('OH','OHIO') THEN 'OH'
WHEN UPPER(TRIM(COALESCE(state))) IN ('OK','OKLAHOMA') THEN 'OK'
WHEN UPPER(TRIM(COALESCE(state))) IN ('OR','ORE','OREGON','OREGONE') THEN 'OR'
WHEN UPPER(TRIM(COALESCE(state))) IN ('PA','PENNSYLVANIA') THEN 'PA'
WHEN UPPER(TRIM(COALESCE(state))) IN ('RI','RHODE ISLAND') THEN 'RI'
WHEN UPPER(TRIM(COALESCE(state))) IN ('SC','SOUTH CAROLINA') THEN 'SC'
WHEN UPPER(TRIM(COALESCE(state))) IN ('SD','SOUTH DAKOTA') THEN 'SD'
WHEN UPPER(TRIM(COALESCE(state))) IN ('TN','TENNESSEE') THEN 'TN'
WHEN UPPER(TRIM(COALESCE(state))) IN ('TX','TEXAS') THEN 'TX'
WHEN UPPER(TRIM(COALESCE(state))) IN ('UT','UTAH') THEN 'UT'
WHEN UPPER(TRIM(COALESCE(state))) IN ('VT','VERMONT') THEN 'VT'
WHEN UPPER(TRIM(COALESCE(state))) IN ('VA','VIRGINIA') THEN 'VA'
WHEN UPPER(TRIM(COALESCE(state))) IN ('WA','WASHINGTON') THEN 'WA'
WHEN UPPER(TRIM(COALESCE(state))) IN ('WV','WEST VIRGINIA') THEN 'WV'
WHEN UPPER(TRIM(COALESCE(state))) IN ('WI','WISCONSIN') THEN 'WI'
WHEN UPPER(TRIM(COALESCE(state))) IN ('WY','WYOMING') THEN 'WY'
ELSE NULL
Update blocklists¶
The bad-values blocklist uses a regular expression to identify domain tables. Domain tables are built using a source:feed pattern, whereas custom domain tables use a SQL-safe pattern that uses underscores (_
) instead of a colon (:
) as a delimiter. When custom domain table names are present, the default regular expression will not identify the underscores (and any related custom domain tables), and may return NULL values.
If a blocklist returns NULL values and if custom domain tables are present, update the regular expression in the SELECT statements for the following sections:
bad_addresses
bad_emails
bad_phones
For each SELECT statement, change:
REGEXP_EXTRACT(datasource, '.+?(?=:)') AS datasource,
to:
COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource,
This update will allow these SELECT statements to continue using a regular expression to find domain tables, and then use *
to find custom domain tables and will prevent NULL values from being returned.