Bad-values blocklist¶
Some values can be very common in data sets. These values may have been entered into the data for the purpose of testing, but often they were entered with a goal of intentionally providing bad data. This pattern is especially common with phone numbers, email addresses, and physical addresses. For example, bad phone numbers like 555-555-5555, 555-1212, and 867-5309 may be provided by individuals instead of their real phone number.
Amperity should be configured to ignore these types of fake values when they appear in columns that are associated with certain semantic values for PII data. Especially when they occur at a very high frequency in customer data. By ignoring these values, this prevents the Stitch process from assigning Amperity IDs as if these fake values were real values.
Important
The bad-values blocklist supports the following PII semantics: given-name, surname, email, phone, and address (which is handled as a combination of fields to which the address, address2, city, state, and postal semantic tags are applied). You do not have to implement all of these semantics as part of the query template for automated blocklist values. Automated blocklist values will not work for address2 (standalone), birthdate, city (standalone), company, full-name, gender, generational-suffix, postal (standalone), state (standalone), title, or for any custom semantic.
Automatic bad value detection¶
Amperity is configured to automatically apply blocklists to email addresses, phone numbers, and physical addresses when a bad value is discovered at a frequency that exceeds the defined threshold.
The default configuration is:
:amperity.stitch.settings/badvalues-config [
{:threshold 20, :proxy "given-name", :semantic "email"}
{:threshold 20, :proxy "surname", :semantic "email"}
{:threshold 20, :proxy "given-name", :semantic "phone"}
{:threshold 40, :proxy "given-name", :semantic "address"}]
How the default configuration works:
An email address is added to the bad-values blocklist when the same email addresses is associated with more than 20 distinct given names.
A phone number is added to the bad-values blocklist when the same phone number is associated with more than 20 distinct given names.
A physical addresses is added to the bad-values blocklist when the same physical addresses is associated with more than 40 distinct given names.
The threshold values can be increased or decreased as needed for your tenant.
You can add blocklist items using the following syntax:
{:threshold 00, :proxy "semantic-name", :semantic "semantic-name"}
Disable automatic bad value detection¶
You can disable automatic bad-value blocklists by editing the configuration to be empty square brackets:
:amperity.stitch.settings/badvalues-config []
Stitch_BadValues table¶
The Stitch BadValues table contains all of the values that were added to the bad-values blocklist.
How to understand rows in the Stitch_BadValues table
For a configuration of:
{:threshold 20, :proxy "given-name", :semantic "email"}
A row in the Stitch BadValues table may contain the following values:
----- ---------- ----------- ------------ ----------- -----
... semantic value num_values num_proxy ...
----- ---------- ----------- ------------ ----------- -----
email a@aol.com 189 91
----- ---------- ----------- ------------ ----------- -----
This means that the email address “a@aol.com” was discovered 189 times with 91 different given-name proxies, which means this email address is associated with a large number of individuals. This exceeds the threshold of “20” and is added to the bad-values blocklist.
The Stitch BadValues table contains the following columns:
Column name |
Data type |
Description |
---|---|---|
Datasource |
String |
The name of the data source from which the value originated. |
Semantic |
String |
The semantic tag that is associated with the value that was added to the bad-values blocklist. |
Value |
String |
The value that was added to the bad-values blocklist. |
Num Values |
String |
The number of times the value appeared in the data. |
Num Proxy |
String |
The number of proxies to which the value was associated. When this number exceeds the threshold defined for in the blocklist a value is added to the bad-values blocklist. |
Domain Table |
String |
The domain table from which the value originated. |
Is Preprocessed |
String |
Indicates if the value was changed during pre-processing by Stitch. For example, phone numbers often have dashes within their values. Dashes are removed by Stitch during pre-processing. When this column is true, and a phone number value is “12065551212”, the original value for that phone number in the associated datasource may be “1-206-555-1212”. |
Global blocklist values¶
Amperity uses a global blocklist to automatically remove a set of known bad values from the data to prevent them from being used to identify clusters.
Semantic |
Values |
---|---|
All semantics |
The following values are ignored by Stitch when performing identity resolution:
|
address |
The following values associated with the address semantic are ignored by Stitch when performing identity resolution:
|
birthdate |
The following values associated with the birthdate semantic are ignored by Stitch when performing identity resolution:
|
city |
The following values associated with the city semantic are ignored by Stitch when performing identity resolution:
|
Stitch will automatically blocklist email address values that do not contain an @ symbol. The following values associated with the email semantic are ignored by Stitch when performing identity resolution:
The values in bold are always ignored. Warning Stitch may be configured to ignore a partial list of generic email addresses. When Stitch is configured to ignore a partial list of generic email addresses, additional work to update the bad-values blocklist may be necessary. |
|
full-name |
The following values associated with the full-name semantic are ignored by Stitch when performing identity resolution:
|
given-name |
The following values associated with the given-name semantic are ignored by Stitch when performing identity resolution:
|
phone |
The following values associated with the phone semantic are ignored by Stitch when performing identity resolution:
|
postal |
The following values associated with the postal semantic are ignored by Stitch when performing identity resolution:
|
state |
The following values associated with the state semantic are ignored by Stitch when performing identity resolution:
|
surname |
The following values associated with the surname semantic are ignored by Stitch when performing identity resolution:
|
Managed bad-value blocklists¶
A bad-values blocklist contains known values that appear frequently in data and should be excluded from the Stitch process.
In addition to using the default behavior of the bad-values blocklist, you may configure custom workflows for managing bad-values in your tenant’s blocklist:
Add a query for automated blocklist values.
Run the query, and then validate the results.
Note
These steps describe how to add a bad-values blocklist. They show examples for only the email and phone semantics. Use the example SQL query if you want to add the address semantic. If you want to also add given-name and surname they both follow the same pattern, but be sure to associate the proxy for given-name to some other semantic to ensure the validation process can compare the results alongside some other column.
Add SQL query¶
A query is SQL that is run from the Queries page against database tables in the Customer 360 page. A query returns a refined and filtered subset of useful customer data.
The easiest way to build a bad-values blocklist is to define a SQL query that returns a list of values associated with specific Amperity semantics. This list should only return common values that appear over a certain threshold.
This query must do the following:
Use SELECT to get data from a table in the customer 360 database.
Specify the associated Amperity semantic for that column.
Specify the column in which potential bad values are located.
Apply a count threshold, over which a value is returned for inclusion in the bad-values blocklist.
Return the data into columns named datasource, semantic, and values, one unique value per row.
Important
Only the PII semantic fields email, phone, given-name, surname, and address will have corresponding _blv columns added to the Unified Coalesced table, but any semantic may be added to the blocklist.
The name of the query should contain the word “blocklist” to help clearly identify its purpose. For example: “Bad-values Blocklist”.
For example, a SQL query that returns values that appear more than 20 times for email and phone is similar to:
WITH
bad_emails AS (
SELECT * FROM (
SELECT
UPPER(REGEXP_REPLACE(email,'(?:\.|\+.*)(?=.*?@.+\..+)','')) AS value
,'email' AS semantic
,datasource AS domain_table
,UPPER(given_name) AS proxy
FROM Unified_Coalesced)
WHERE value IS NOT null
AND value <> ''),
bad_phones AS (
SELECT * FROM (
SELECT
UPPER(phone) AS value
,'phone' AS semantic
,datasource AS domain_table
,UPPER(given_name) AS proxy
FROM Unified_Coalesced)
WHERE value IS NOT null
AND value <> '')
SELECT
*
,TO_HEX(MD5(TO_UTF8(CONCAT(domain_table,semantic,value)))) AS pk
FROM (
SELECT
'*' AS datasource
,semantic
,value
,COUNT(value) AS num_values
,COUNT(DISTINCT proxy) AS num_proxy
,'' AS domain_table
FROM bad_emails
GROUP BY 1,2,3,6
HAVING COUNT(DISTINCT proxy) > 20
UNION
SELECT
'*' AS datasource
,semantic
,value
,COUNT(value) AS num_values
,COUNT(DISTINCT proxy) AS num_proxy
,'' AS domain_table
FROM bad_phones
GROUP BY 1,2,3,6
HAVING COUNT(DISTINCT proxy) > 20
ORDER BY datasource, semantic, num_proxy desc
)
Run query¶
Run the query from the SQL Query Editor to validate the syntax and to verify the output. Fix any errors that may be returned. If the values in the returned output don’t seem correct, update the threshold for values counts to see if that improves the results.
Tip
The goal with a bad-values blocklist isn’t to catch every single bad value, but rather to remove from the Stitch process the most comon bad values.
Download CSV¶
After the bad-values blocklist query has been validated and run successfully, download the results as a CSV file.
The bad-values blocklist output is a table similar to:
----- ------------ ------------ ----------------------------
row datasource semantic value
----- ------------ ------------ ----------------------------
1 MTK email internaltest@matchstik.com
2 MTK phone 555-555-5555
4 MTK fk 172fdr9HpTafWNpQmyVR
----- ------------ ------------ ----------------------------
Each row associates a bad value to an Amperity semantic and a customer data source. The CSV file itself would be similar to:
datasource,semantic,value
MTK,email,internaltest@matchstik.com
MTK,phone,555-555-5555
MTK,fk,172fdr9HpTafWNpQmyVR
Each row associates a bad value to an Amperity semantic and a customer data source.
Add feed¶
A feed defines how data should be loaded into a domain table, including specifying which columns are required and which columns should be associated with a semantic tag that indicates that column contains customer profile (PII) and transactions data.
To add a feed for blocklist values
Click the Sources page.
Add a data source named “Amperity” if one does not already exist.
Add a feed named “Blocklist”.
Upload the bad-values blocklist CSV file that was previously downloaded from the Queries page.
Set all field types to string.
Set labels for fields in the CSV file.
The downloaded CSV file for the bad-values blocklist has three columns: datasource, semantic, and value.
These columns must be associated with specific labels in the feed for the bad-values blocklist. When these columns are associated in a feed, the SEMANTIC must be mapped to specific labels:
The following labels are required:
Incoming Field
Blocklist Semantic
datasource
blv/data-source
semantic
blv/semantic
value
blv/value
Note
This is different from how semantics are applied to a non-blocklist feed. This is because the actual data source, actual semantic, and actual values are in the data source itself, as they were output from the customer 360 data as CSV data, and then re-ingested as CSV data in the feed created for the bad-values blocklist. This, effectively, round-trips output from the customer 360 database as a CSV file that creates a new domain table for use during the Stitch process.
Define the primary key. This is required by the Stitch process, but for the purpose of creating the bad-values blocklist is not important. Use any row ID.
Uncheck the Make available to Stitch option.
Activate the feed.
Run Stitch¶
Run the Stitch process to update the results for the bad-values blackist.
On the Stitch page, click Run.
Re-run each data table in the customer 360 database that contains data that could be affected by the bad-values blocklist. This will add the results of the most recent Stitch run to these data tables, including adding a series of columns that indicate the presence of bad-values.
The
has_blv
column indicates if blocklist values for address, email, phone, given-name, or surname are present in customer records.
Edit Unified Coalesced table¶
The Unified Coalesced table contains all the PII data that has been processed through Stitch. The data is organized by the semantic tag as the column heading and then coalesced into one single table. A unique Amperity ID may appear in more than one row.
The Unified Coalesced table is preconfigured in the Stitch QA database. When a passthrough table is added to a database all of the columns in that table are selected by default.
Important
If a column is removed from the Unified Coalesced table, it will be removed automatically from the passthrough table. If a column is added to the Unified Coalesced table, you must edit the passthrough table in the Stitch QA database to select the added columns.
For example, the bad-values blocklist is typically configured after the initial customer 360 and Stitch QA databases. After the bad-values blocklist is configured, a collection of columns are added to the Unified Coalesced table that is output by Stitch. These columns are not enabled automatically within the Unified Coalesced table. You must edit the Unified Coalesced table, select these columns, activate the table, and then rerun the Stitch QA database to ensure that bad-values blocklist data is available to the Stitch QA database.
To edit the Unified Coalesced table
From the Customer 360 page, under All Databases, open the menu for the Stitch QA database, and then select View. The Database Editor page opens.
Under Database Tables, open the menu for the Unified Coalesced table, and then select Edit. The Database Table page opens.
Click Next to open the Database Table page, and then find and select the following columns: blv_address, blv_email, blv_given_name, blv_phone, and blv_surname.
Click Save. This returns you to the Database Editor page for the Stitch QA database.
Click Activate. Run the database.
Tip
If these changes also affect the Merged Customers table, update both tables before running the Stitch QA database.
Update Merged Customers table¶
When using the bad-values blocklist, you must update the Merged Customers table to include the bad-values blocklist items in the merge rules.
To update the Merged Customers table
From the Customer 360 page, under All Databases, open the menu for the customer 360 database, and then select Edit.
Open the menu for the Merged Customers table and select Edit.
Find the Unified_Prioritized section, and then update the first value blocks for email and phone to add the .blv entries:
,FIRST_VALUE(email_struct) OVER ( PARTITION BY amperity_id ORDER BY email_struct.completion DESC ,email_struct.blv ,email_struct.priority ,email_struct.update_dt DESC ,email_struct.pk ) AS email_struct ,FIRST_VALUE(phone_struct) OVER ( PARTITION BY amperity_id ORDER BY phone_struct.completion DESC ,phone_struct.blv ,phone_struct.priority ,phone_struct.update_dt DESC ,phone_struct.pk ) AS phone_struct
Find the SELECT statement that builds the Merged Customers table, and then add the columns for email_internal:
,up.email_struct.email ,up.email_struct.pk AS `email_pk` ,up.email_struct.update_dt AS `email_update_dt` ,up.email_struct.datasource AS `email_datasource` ,up.email_struct.priority AS `email_priority` ,up.email_struct.blv AS `email_blv` ,up.email_struct.completion AS `email_completion` ,up.phone_struct.phone ,up.phone_struct.pk AS `phone_pk` ,up.phone_struct.update_dt AS `phone_update_dt` ,up.phone_struct.datasource AS `phone_datasource` ,up.phone_struct.priority AS `phone_priority` ,up.phone_struct.blv AS `phone_blv` ,up.phone_struct.completion AS `phone_completion`
Validate the query.
Click Next. This opens the Database Table Definition page.
Verify these settings, and then click Save.
Run the customer 360 database.
Note
If you add other PII semantics to the bad-values blocklist, such as for addresses, be sure to make the same changes to the structs.
Add Stitch BlocklistValues table¶
Important
This table cannot be added to the Stitch QA database until after the bad-values blocklist has been configured.
To add the Stitch BlocklistValues table
From the Customer 360 page, under All Databases, select the menu for the Stitch QA database, and then click Edit.
From the Database Editor, click Add Table.
Name the table “Stitch_BlocklistValues”.
Set the build mode to Passthrough.
From the Source Table drop-down, select the table associated with the bad-values blocklist.
Click Next.
Add a table description. This enables a tooltip that is visible from other areas in Amperity.
Leave everything else unchanged.
Click Save.
Activate, and then run the Stitch QA database.
Update Stitch QA queries¶
Update the detailed examples Stitch QA query to uncomment all of the fields that contain _blv, and then run it to verify the results.
The following Stitch QA queries can be configured to exclude values for the bad-values blocklist:
After configuring the bad-values blocklist, examine each of the Stitch QA queries and determine which bad values (if any) should be excluded from the query.
Automate blocklist¶
The bad-values blocklist must be refreshed on a daily basis. To do this, configure a destination to send the results of the bad-values blocklist query to the cloud-based storage location that is included with your tenant – Amazon S3 or Microsoft Azure Blob Storage. Update your courier to pull the bad-values blocklist from that location, and then configure the end-to-end workflow to run automatically.
Tip
The following settings are recommended whenever the bad-values blocklist belongs to a courier group:
Set Notify when missing? to “Disabled”.
Set Abort when missing? to “Disabled”.
Configure data to load 1 day older than the scheduled date and time.
Advanced options¶
There are more ways to do blocklists:
Per-data source¶
You can create a group of bad-values queries, with one query per data source. This allows you to tune the SQL query results to a specific data set for both semantic associations and thresholds. Use the existing SQL query as a template and the create a unique query for each bad-values blocklist you want to build based on results in the customer 360 database.
Per-database¶
You could use a single SQL query that is run against the customer 360 database to return a bad-values blocklist that is global and not specific to a single data table. Download that as a CSV file. Then follow the same steps.
Expand returned values¶
In situations where multiple values exist for phone and/or email addresses and the field in the Unified Coalesced table is a comma-separated concatenation of all values, the automated blocklist SQL query may fail to catch individual bad values for phones or email addresses because the query looks at occurrances of the whole (concatenated) value.
Use the UNNEST clause in the SQL segment to expand phone and/or email addresses so that the query looks at individual bad values. The following example shows using an UNNEST clause to expand email addresses that are part of a bad-values blocklist SQL query:
SELECT
REGEXP_REPLACE(v.value,'\+.*@','@') AS value
,M.semantic
,M.datasource
,M.domain_table
,M.proxy
FROM (
SELECT DISTINCT
SPLIT(UPPER(email), ',') AS vs
,'email' AS semantic
,REGEXP_EXTRACT(datasource, '.+?(?=:)') AS datasource
,datasource AS domain_table
,UPPER(given_name) AS proxy
FROM Unified_Coalesced AS UC
) AS M
CROSS JOIN UNNEST(vs) AS v(value)
WHERE value IS NOT NULL
AND value <> ''),
Addresses¶
The profile (PII) semantics to which a bad-values blocklist can be applied are email, phone, given-name, surname, and address.
When address is added to the bad-values blocklist, be sure to verify that real values, such as university addresses, apartment complex addresses, and condominium addresses are not being added to the bad-values blocklist.
Tip
Stitch processes addresses in a bad-values blocklist in this order:
Each record is normalized, and the address and address2 are concatenated together into a single address field.
Each blocklist entry is processed for the normalized address.
Values are removed from the preprocessed record if they occur in any preprocessed blocklist record.
A blocklist cannot define independent address and address2 semantics. When the values in address2 must be part of the bad-values blocklist, the value of the address in its post-processed concatenated state is the best value to use in the blocklist.
An effective bad-values blocklist for address often requires tuning and validation of the results to ensure that the right level of values are removed from the data. Start with a high threshold (at least “40”, but higher if necessary) for address, verify the results, and then adjust the threshold carefully until the desired level of accuracy is achieved. Use an Internet search to help verify each address that is blocklisted as part of the verification process.
When the bad-values blocklist is applied to address keep in mind that it also considers city and state along with address before determining if the threshold is met. This group—address, city, and state—does not replace the address value in the Stitch_BadValues table; the same address value may appear multiple times for each city and state pair. When ordinals are used with the address semantic tag, the address group for each ordinal is checked.
address2 fields¶
The address field may be used in the bad-values blocklist as part of a complete, normalized address. This should be done carefully because the fields associated with this semantic can follow many patterns and contain many types of values so applying a bad-values blocklist to them is more difficult and the results are less accurate.
Important
The address2 field should never be used in the bad-values blocklist in isolation.
In certain situations, the address2 field is the source of a value that may need to be added to the bad-values blocklist. There are two ways to do this:
Caution
Both approaches require visual inspection of the results to verify that the blocklist values are being applied correctly and only to the intended fields and not to a more global set of fields.
Remove address¶
To remove a single address from post-processed output, add SQL similar to the following to the bad-values blocklist query to remove a single address value. Be sure to use the post-processed, normalized format for the data as the value to be removed:
semantic="address", value="123 MAIN ST STE 500"
Concatenate address¶
To concatenate the address and address2 fields in the bad-values blocklist, update the following line in the bad_addresses block from:
UPPER(address) AS value
to:
CASE
WHEN address2 IS null
THEN UPPER(address)
ELSE UPPER(CONCAT(address,' ',address2))
END AS value
Caution
This approach should not be used when the address field is known to contain a high volume of bad addresses that should be removed, and also the address2 fields within that subset of records contains a variety of junk values. Updating the bad_address block in this scenario may cause addresses to be missed as the blocklist values are applied.
Custom CSV files¶
You may create a CSV file to use for custom bad-value blocklists. The table structure of this CSV file must be contain the following columns:
datasource
semantic
value
and then must contain a unique value per row, associated to a single Amperity semantic. Use a wildcard value (*
) to associate a value with all data sources:
----- ------------ ----------- ----------------------------
row datasource semantic value
----- ------------ ----------- ----------------------------
1 ACME email internaltest@matchstik.com
2 ACME email externaltext@matchstik.com
3 ACME email foo@matchstik.com
4 ACME phone 555-555-5555
5 ACME phone 555-1212
6 * phone 867-5309
7 ACME phone 111-111-1111
8 * address One Infinite Loop
9 ACME fk 172fdr9HpTafWNpQmyVR
----- ------------ ----------- ----------------------------
The CSV file itself would be similar to:
datasource,semantic,value
ACME,email,internaltest@matchstik.com
ACME,email,externaltext@matchstik.com
ACME,email,foo@matchstik.com
ACME,phone,555-555-5555
ACME,phone,555-1212
*,phone,867-5309
ACME,phone,111-111-1111
*,address,One Infinite Loop
ACME,fk,172fdr9HpTafWNpQmyVR
To use a custom CSV file to blocklist birthdates
Use a custom CSV file to blocklist birthdates that exist across multiple data sources. For example:
datasource,semantic,value *,birthdate,0001-01-03 *,birthdate,0001-01-01 *,birthdate,1901-12-13
Import this CSV file as a feed.
Set the semantic and value fields as a primary key.
Uncheck the Make available to Stitch setting.
Tip
This example shows how to use a CSV file to provide custom values that should be blocklisted. However, adding birthdates to a custom blocklist is often unnecessary because Amperity will not use birthdate values to cluster two individuals with distinct names. The likelihood of distinct individuals with very similar PII, such as both having the same name and email address, along with having the same birthdate is very low.
Before adding values to a custom blocklist, first verify if any of these values have created false-positive clusters in the Stitch results.
Custom domain tables¶
The bad-values blocklist uses a regular expression to identify domain tables. Domain tables are built using a source:feed pattern, whereas custom domain tables use a SQL-safe pattern that uses underscores (_
) instead of a colon (:
) as a delimiter. When custom domain table names are present, the default regular expression will not identify the underscores (and any related custom domain tables), and may return NULL values.
If a blocklist returns NULL values and if custom domain tables are present, update the regular expression in the SELECT statements for the following sections:
bad_addresses
bad_emails
bad_phones
For each SELECT statement, change:
REGEXP_EXTRACT(datasource, '.+?(?=:)') AS datasource,
to:
COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource,
This update will allow these SELECT statements to continue using a regular expression to find domain tables, and then use *
to find custom domain tables and will prevent NULL values from being returned.
External blocklists¶
A blocklist does not need to be round-tripped as CSV output from the customer 360 database back into Amperity via a feed to a domain table. A blocklist may come from any source, including external sources that contain known bad values. The only requirement for this approach is the table structure must contain fields for datasource, semantic, and values.
Update existing blocklist¶
You can upload new data to an existing blocklist feed as long as the schema for the uploaded data matches the schema that already exists in the blocklist feed. Select the feed menu options, and then choose Load new data. In the dialog box, select the Upload new option and click Upload to select a file. The table may be truncated on load. Then load the file.
Derived semantics¶
A bad-values blocklist works best for email addresses, physical addresses, and phone numbers. That said, other Amperity PII semantics may be added to the blocklist.
The bad-values blocklist operates against pre-processed semantic rows, and not against raw customer data. When adding blocklist values, consider which semantics are best to target rather than which fields to target, even if they are often the same.
Caution
Blocklist values are pre-processed using the same semantic configuration as the targeted data source. This is intentional, in part to ensure the individual who creates the bad-values blocklist does not have to understand what the customer data will look like after it has been pre-processed.
This can be a problem with derived semantics.
A derived semantic is a semantic that can be inferred from existing data. For example, “given-name” and “surname” semantics can be inferred from “full-name” and “gender” can be inferred from “title”. Amperity creates derived semantics automatically to ensure the greatest number of individual semantics are available to Stitch during identity resolution.
If the targeted data source is configured to derive given-name and surname semantics from the full-name semantic, values in the full-name semantic should never be added to a bad-values blocklist.
For example, if “BLACK BLACK” were added to the bad-values blocklist via the full-name semantic, then “BLACK” for both given-name and surname semantics would be blocklisted too. This type of situation is not common, but must be considered when deciding which semantics to use for building the bad-values blocklist.
SQL query example¶
This SQL query provides a working example of a bad-values blocklist that looks for common values that exceed defined thresholds for physical addresses, email addresses, and phone numbers. Use this as a template for defining a bad-values query against your data.
-- QUERY NAME: Blocklist Values
-- DESCRIPTION: Use this query to verify the presence of blocklisted values.
-- REQUIREMENTS: Configure for semantics and use `proxy` to assess the quality
-- of the bad values returned by the query. The recommended proxy is `given_name`.
-- OPTIONS:
-- NOTES:
WITH bad_addresses AS (
SELECT *
FROM (
SELECT
CASE
WHEN address2 IS NULL
THEN UPPER(address)
ELSE UPPER(CONCAT(address,' ',address2))
END AS value
,UPPER(REPLACE(CONCAT(
COALESCE(address,''),
COALESCE(address2,''),
COALESCE(city,''),
COALESCE(state,'')),
' ')) AS measured_value
,'address' AS semantic
,COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource
,datasource AS domain_table
,UPPER(COALESCE(given_name, SPLIT(full_name,' ')[1])) AS proxy
FROM Unified_Coalesced
)
WHERE value IS NOT NULL
AND value <> ''
)
,bad_emails AS (
SELECT *
FROM (
SELECT
UPPER(REGEXP_REPLACE(email,'(?:\.|\+.*)(?=.*?@.+\..+)','')) AS value
,'email' AS semantic
,COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource
,datasource AS domain_table
,UPPER(COALESCE(given_name, SPLIT(full_name,' ')[1])) AS proxy
FROM Unified_Coalesced
)
WHERE value IS NOT NULL
AND value <> '')
,bad_phones AS (
SELECT *
FROM (
SELECT
UPPER(phone) AS value
,'phone' AS semantic
,COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource
,datasource AS domain_table
,UPPER(COALESCE(given_name, SPLIT(full_name,' ')[1])) AS proxy
FROM Unified_Coalesced
)
WHERE value IS NOT NULL
AND value <> '')
SELECT
*
,TO_HEX(MD5(TO_UTF8(CONCAT(domain_table,semantic,value)))) AS pk
FROM (
SELECT
datasource
,semantic
,value
,SUM(num_values) AS num_values
,SUM(num_proxy) AS num_proxy
,domain_table
FROM (
SELECT
'*' AS datasource
,semantic
,value
,measured_value
,COUNT(value) AS num_values
,COUNT(DISTINCT proxy) AS num_proxy
,'' AS domain_table
FROM bad_addresses
GROUP BY 1,2,3,4,7
HAVING COUNT(DISTINCT proxy) > 40
)
GROUP BY 1,2,3,6
UNION
SELECT
'*' AS datasource
,semantic
,value
,COUNT(value) AS num_values
,COUNT(DISTINCT proxy) AS num_proxy
,'' AS domain_table
FROM bad_emails
GROUP BY 1,2,3,6
HAVING COUNT(DISTINCT proxy) > 20
UNION
SELECT
'*' AS datasource
,semantic
,value
,COUNT(value) AS num_values
,COUNT(DISTINCT proxy) AS num_proxy
,'' AS domain_table
FROM bad_phones
GROUP BY 1,2,3,6
HAVING COUNT(DISTINCT proxy) > 20
)