Bad-values blocklist

Some values can be very common in data sets. These values may have been entered into the data for the purpose of testing, but often they were entered with a goal of intentionally providing bad data. This pattern is especially common with phone numbers, email addresses, and physical addresses. For example, bad phone numbers like 555-555-5555, 555-1212, and 867-5309 may be provided by individuals instead of their real phone number.

Amperity should be configured to ignore these types of fake values when they appear in columns that are associated with certain semantic values for PII data. Especially when they occur at a very high frequency in customer data. By ignoring these values, this prevents the Stitch process from assigning Amperity IDs as if these fake values were real values.

Important

The bad-values blocklist supports the following PII semantics: given-name, surname, email, phone, and address (which is handled as a combination of fields to which the address, address2, city, state, and postal semantic tags are applied). You do not have to implement all of these semantics as part of the query template for automated blocklist values. Automated blocklist values will not work for address2 (standalone), birthdate, city (standalone), company, full-name, gender, generational-suffix, postal (standalone), state (standalone), title, or for any custom semantic.

Bad values

Use the Bad values tab in the Stitch settings dialog to configure values that should be ignored by Stitch during identity resolution.

The Bad values tab in the Stitch settings dialog box.

Values can be configured to be:

  • Detected automatically using a threshold and an association between two fields. The value is ignored when the first field exceeds the configured threshold as it relates to the second field.

    For example, “Ignore any email address with more than 8 given names.” where email is the first field, the threshold is “8”, and the second field is given-name. When an email address is associated with more than 8 given names, that email address will be added to the bad-values blocklist and ignored by Stitch, after which it is not used for identity resolution.

    Automatic detection is available for all fields to which customer profile semantic tags were applied, such as email, phone, address, given-name and for fields that contain foreign keys.

  • Detected manually. Specific values can be added for email addresses, phone numbers, given names, and surnames.

Tip

Review the documentation for managing bad values for additional options for managing a bad-values blocklist.

Automatic bad value detection

Automatic detection uses rules that define thresholds above which values are ignored automatically. Rules can be defined to look for values in fields created by customer profile semantic tags (address, birthdate, city, company, email, gender, generational-suffix, given-name, phone, postal, state, and surname), along with foreign keys (fk-) and Post Office boxes.

For example:

Ignore any email with more than 8 given names

will automatically ignore an email address that is associated with 8 or more distinct values for given_name.

Note

Default thresholds are set for the following combinations:

  1. Ignore any email address with more than 20 given names.

  2. Ignore any email address with more than 20 surnames.

  3. Ignore any phone number with more than 20 given names.

  4. Ignore any physical address with more than 40 given names.

Values can be added to the bad-values blocklist manually for email addresses, phone numbers, given names, and surnames. Type the value into the field, and then hit enter to manually add a value to the bad-values blocklist.

Tip

If a value needs to be removed from the bad-values blocklist click the “X” icon next to the value.

Disable automatic bad value detection

You can disable automatic bad-value blocklists by updating the configuration to add the following setting with empty square brackets:

:amperity.stitch.settings/badvalues-config []

Stitch_BadValues table

The Stitch BadValues table contains all of the values that were added to the bad-values blocklist.

How to understand rows in the Stitch_BadValues table

For a configuration of:

{:threshold 20, :proxy "given-name", :semantic "email"}

A row in the Stitch BadValues table may contain the following values:

----- ---------- ----------- ------------ -----------  -----
 ...   semantic   value       num_values   num_proxy    ...
----- ---------- ----------- ------------ -----------  -----
       email      a@aol.com   189          91
----- ---------- ----------- ------------ -----------  -----

This means that the email address “a@aol.com” was discovered 189 times with 91 different given-name proxies, which means this email address is associated with a large number of individuals. This exceeds the threshold of “20” and is added to the bad-values blocklist.

The Stitch BadValues table has the following columns:

Column name

Data type

Description

Datasource

String

The name of the data source from which the value originated.

Semantic

String

The semantic tag that is associated with the value that was added to the bad-values blocklist.

Value

String

The value that was added to the bad-values blocklist.

Num Values

String

The number of times the value appeared in the data.

Num Proxy

String

The number of proxies to which the value was associated. When this number exceeds the threshold defined for in the blocklist a value is added to the bad-values blocklist.

Domain Table

String

The domain table from which the value originated.

Is Preprocessed

String

Indicates if the value was changed during pre-processing by Stitch.

For example, phone numbers often have dashes within their values. Dashes are removed by Stitch during pre-processing. When this column is true, and a phone number value is “12065551212”, the original value for that phone number in the associated datasource may be “1-206-555-1212”.

Global blocklist values

Amperity uses a global blocklist to automatically remove a set of known bad values from the data to prevent them from being used to identify clusters.

Semantic

Values

All semantics

The following values are ignored by Stitch when performing identity resolution:

  • BLOCKED

  • COLLECT

  • DECLINE

  • DECLINED

  • DECLINES

  • DENIED

  • EMPTY

  • INFO

  • N/A

  • NAME

  • NONE

  • NONE GIVEN

  • NO VALUE

  • NOT AVAILABLE

  • NOT GIVEN

  • NOTAVAILABLE

  • NULL

  • REFUSED

  • TBD

  • TDB

  • TRAVEL

  • UNKNOWN

address

The following values associated with the address semantic are ignored by Stitch when performing identity resolution:

  • GENERAL DELIVERY

  • NEED ADDRESS

  • NO ADDRESS GIVEN

  • P O BOX

  • PO BOX

  • UNLISTED

  • YOUR STREET ADDRESS

birthdate

The following values associated with the birthdate semantic are ignored by Stitch when performing identity resolution:

  • 1899-12-31

  • 1900-01-01

  • 1920-01-01

city

The following values associated with the city semantic are ignored by Stitch when performing identity resolution:

  • NA

  • NAN

  • NEED CITY

  • NO

  • NO ADDRESS

email

Stitch will automatically blocklist email address values that do not contain an @ symbol.

The following values associated with the email semantic are ignored by Stitch when performing identity resolution:

  • @NOEMAIL.COM

  • @NOMAIL.COM

  • 0000000000

  • 123@

  • 1234@

  • 99@

  • ABC@

  • ABC123@

  • ADMIN@

  • BOOKING@

  • CLIENT@

  • CLIENTS@

  • CONFIRMATION@

  • CONFIRMATIONS@

  • CONTACT@

  • CUSTOMERSERVICE

  • CUSTOMERSERVICE@

  • CUSTOMERSERVICES

  • CUSTOMERSERVICES@

  • DECLINE@

  • DECLINED@

  • DENIED@

  • EMAIL@

  • @EMAIL.TST

  • EXAMPLE@

  • FAKENAME@

  • GUEST@

  • GUESTS@

  • HELP@

  • HELPS@

  • HOTELHELP@

  • HOTELPARTNER@

  • HOTELPARTNERS@

  • INFO@

  • JUNK@

  • MAIL@

  • ME@

  • N@A

  • NAME@

  • NO@

  • NOEMAIL@

  • NOMAIL@

  • NONE@

  • NONENONE@

  • NOREPLY@

  • NOTHANKS@

  • NOTHANKYOU@

  • ONLINERESERVATION

  • ONLINERESERVATION@

  • ONLINERESERVATIONS

  • ONLINERESERVATIONS@

  • OPERATION@

  • OPERATIONS@

  • QUERIES@

  • QUERY@

  • REFUSED@

  • RES@

  • RESERVAS

  • RESERVATION@

  • RESERVATIONS@

  • ROOMRESERVATION@

  • ROOMRESERVATIONS@

  • SAMPLE@

  • SAMPLES@

  • SERVICE@

  • SHOP@

  • TEST@

  • TESTING@

  • TESTEMAIL@

  • TRAVEL@

  • TRAVELS

  • VENDOR@

  • VENDORS@

  • XXX@

The values in bold are always ignored.

Warning

Stitch may be configured to ignore a partial list of generic email addresses. When Stitch is configured to ignore a partial list of generic email addresses, additional work to update the bad-values blocklist may be necessary.

full-name

The following values associated with the full-name semantic are ignored by Stitch when performing identity resolution:

  • DEFAULT CUSTOMER

  • PREPAID CARDHOLDER

  • RED CARD

  • RED CARDS

  • RESERVED GUEST

  • RESERVED GUESTS

  • TEST TEST

given-name

The following values associated with the given-name semantic are ignored by Stitch when performing identity resolution:

  • BLOCK

  • CONTACT

  • GUEST

  • GUESTS

  • NO NAME

  • RESERVED

  • USE

phone

The following values associated with the phone semantic are ignored by Stitch when performing identity resolution:

  • 0000000000

  • 0000000001

  • 1111111111

  • 1234567890

  • 1234567891

  • 2222222222

  • 3333333333

  • 4444444444

  • 5555555555

  • 6666666666

  • 7777777777

  • 8888888888

  • 9999999999

postal

The following values associated with the postal semantic are ignored by Stitch when performing identity resolution:

  • 00000

  • 11111

  • 1111

  • 111

  • 11

  • 1

  • NA

  • NAN

  • NO

state

The following values associated with the state semantic are ignored by Stitch when performing identity resolution:

  • NA

  • NAN

  • NEED STATE

  • NO

surname

The following values associated with the surname semantic are ignored by Stitch when performing identity resolution:

  • CONTACT

  • GUESTS

  • NO NAME

  • RESERVED

  • USE

Managed bad-value blocklists

A bad-values blocklist has known values that appear often in data. The Stitch process should exclude values from the bad-values blocklist.

In addition to using the default behavior of the bad-values blocklist, you may configure custom workflows for managing bad-values in your tenant’s blocklist:

  1. Add a query for automated blocklist values.

  2. Run the query, and then validate the results.

  3. Download the results as a CSV file.

  4. Add a feed to ingest the CSV file.

  5. Run Stitch.

  6. Edit Unified Coalesced table.

  7. Update the Merged Customers table.

  8. Update Stitch QA queries.

  9. Automate blocklist.

Note

These steps describe how to add a bad-values blocklist. They show examples for only the email and phone semantics. Use the example SQL query if you want to add the address semantic. If you want to also add given-name and surname they both follow the same pattern, but be sure to associate the proxy for given-name to some other semantic to ensure the validation process can compare the results alongside some other column.

Add SQL query

A query is SQL that runs from the Queries page against database tables in the Customer 360 page. A query returns a refined and filtered subset of useful customer data.

The easiest way to build a bad-values blocklist is to define a SQL query that returns a list of values associated with specific Amperity semantics. This list should only return common values that appear over a certain threshold.

This query must do the following:

  • Use SELECT to get data from a table in the customer 360 database.

  • Specify the associated Amperity semantic for that column.

  • Specify the column in which potential bad values are located.

  • Apply a count threshold, over which a value is returned for inclusion in the bad-values blocklist.

  • Return the data into columns named datasource, semantic, and values, one unique value per row.

Important

Only the PII semantic fields email, phone, given-name, surname, and address have corresponding _blv columns added to the Unified Coalesced table, but any semantic may be added to the blocklist.

The name of the query should contain the word “blocklist” to help identify its purpose. For example: “Bad-values Blocklist”.

For example, a SQL query that returns values that appear more than 20 times for email and phone is similar to:

 1WITH
 2
 3bad_emails AS (
 4  SELECT * FROM (
 5    SELECT
 6      UPPER(REGEXP_REPLACE(email,'(?:\.|\+.*)(?=.*?@.+\..+)','')) AS value
 7      ,'email' AS semantic
 8      ,datasource AS domain_table
 9      ,UPPER(given_name) AS proxy
10    FROM Unified_Coalesced)
11  WHERE value IS NOT null
12  AND value <> ''),
13
14bad_phones AS (
15  SELECT * FROM (
16    SELECT
17      UPPER(phone) AS value
18      ,'phone' AS semantic
19      ,datasource AS domain_table
20      ,UPPER(given_name) AS proxy
21    FROM Unified_Coalesced)
22  WHERE value IS NOT null
23  AND value <> '')
24
25SELECT
26  *
27  ,TO_HEX(MD5(TO_UTF8(CONCAT(domain_table,semantic,value)))) AS pk
28FROM (
29  SELECT
30    '*' AS datasource
31    ,semantic
32    ,value
33    ,COUNT(value) AS num_values
34    ,COUNT(DISTINCT proxy) AS num_proxy
35    ,'' AS domain_table
36  FROM bad_emails
37  GROUP BY 1,2,3,6
38  HAVING COUNT(DISTINCT proxy) > 20
39
40  UNION
41
42  SELECT
43    '*' AS datasource
44    ,semantic
45    ,value
46    ,COUNT(value) AS num_values
47    ,COUNT(DISTINCT proxy) AS num_proxy
48    ,'' AS domain_table
49  FROM bad_phones
50  GROUP BY 1,2,3,6
51  HAVING COUNT(DISTINCT proxy) > 20
52
53  ORDER BY datasource, semantic, num_proxy desc
54)

Run query

Run the query from the SQL Query Editor to validate the syntax and to verify the output. Fix any errors that may be returned. If the values in the returned output do not seem correct, update the threshold for values counts to see if that improves the results.

Tip

The goal of the bad-values blocklist is not to catch every single bad value, but rather to remove from the Stitch process the most comon bad values.

Download CSV

After the bad-values blocklist query has been validated and run successfully, download the results as a CSV file.

The bad-values blocklist output is a table similar to:

----- ------------ ------------ ----------------------------
 row   datasource   semantic     value
----- ------------ ------------ ----------------------------
 1     MTK          email        internaltest@matchstik.com
 2     MTK          phone        555-555-5555
 4     MTK          fk           172fdr9HpTafWNpQmyVR
----- ------------ ------------ ----------------------------

Each row associates a bad value to an Amperity semantic and a customer data source. The CSV file itself would be similar to:

datasource,semantic,value
MTK,email,internaltest@matchstik.com
MTK,phone,555-555-5555
MTK,fk,172fdr9HpTafWNpQmyVR

Each row associates a bad value to an Amperity semantic and a customer data source.

Add feed

A feed defines how to load data into a domain table, including specifying required columns and columns with semantic tags for customer profile (PII) or transactions data.

To add a feed for blocklist values

  1. Click the Sources page.

  2. Add a data source named “Amperity” if one does not already exist.

  3. Add a feed named “Blocklist”.

  4. Upload the bad-values blocklist CSV file that was previously downloaded from the Queries page.

  5. Set all field types to string.

  6. Set labels for fields in the CSV file.

    The downloaded CSV file for the bad-values blocklist has three columns: datasource, semantic, and value.

    These columns must be associated with specific labels in the feed for the bad-values blocklist. When these columns are associated in a feed, the SEMANTIC must be mapped to specific labels:

    The following labels are required:

    Incoming Field

    Blocklist Semantic

    datasource

    blv/data-source

    semantic

    blv/semantic

    value

    blv/value

    Note

    This is different from how semantics are applied to a non-blocklist feed. This is because the actual data source, actual semantic, and actual values are in the data source itself, as they were output from the customer 360 data as CSV data, and then re-ingested as CSV data in the feed created for the bad-values blocklist. This round-trips output from the customer 360 database as a CSV file that creates a new domain table for use during the Stitch process.

  7. Define the primary key. This is required by the Stitch process, but for the purpose of creating the bad-values blocklist is not important. Use any row ID.

  8. Uncheck the Make available to Stitch option.

  9. Activate the feed.

Run Stitch

Run the Stitch process to update the results for the bad-values blackist.

  1. On the Stitch page, click Run.

  2. Re-run each data table in the customer 360 database that contains data that could be affected by the bad-values blocklist. This adds the results of the most recent Stitch run to these data tables, including adding a series of columns that indicate the presence of bad-values.

    The has_blv column indicates if blocklist values for address, email, phone, given-name, or surname are present in customer profiles.

Edit Unified Coalesced table

The Unified Coalesced table has all PII data processed through Stitch. Each semantic tag is a column header. All data is coalesced into a single table. A unique Amperity ID may appear in more than one row.

The Unified Coalesced table is preconfigured in the Stitch QA database. When a passthrough table is added to a database all of the columns in that table are selected by default.

Important

If a column is removed from the Unified Coalesced table, it will be removed automatically from the passthrough table. If a column is added to the Unified Coalesced table, you must edit the passthrough table in the Stitch QA database to select the added columns.

For example, the bad-values blocklist is typically configured after the initial customer 360 and Stitch QA databases. After the bad-values blocklist is configured, a collection of columns are added to the Unified Coalesced table that is output by Stitch. These columns are not enabled automatically within the Unified Coalesced table. You must edit the Unified Coalesced table, select these columns, activate the table, and then rerun the Stitch QA database to ensure that bad-values blocklist data is available to the Stitch QA database.

To edit the Unified Coalesced table

  1. From the Customer 360 page, under All Databases, open the menu for the Stitch QA database, and then select View. The Database Editor page opens.

  2. Under Database Tables, open the menu for the Unified Coalesced table, and then select Edit. The Database Table page opens.

  3. Click Next to open the Database Table page, and then find and select the following columns: blv_address, blv_email, blv_given_name, blv_phone, and blv_surname.

  4. Click Save. This returns you to the Database Editor page for the Stitch QA database.

  5. Click Activate. Run the database.

Tip

If these changes also affect the Merged Customers table, update both tables before running the Stitch QA database.

Update Merged Customers table

When using the bad-values blocklist, you must update the Merged Customers table to include the bad-values blocklist items in the merge rules.

To update the Merged Customers table

  1. From the Customer 360 page, under All Databases, open the menu for the customer 360 database, and then select Edit.

  2. Open the menu for the Merged Customers table and select Edit.

  3. Find the Unified_Prioritized section, and then update the first value blocks for email and phone to add the .blv entries:

     1,FIRST_VALUE(email_struct)
     2  OVER (
     3    PARTITION BY amperity_id
     4    ORDER BY email_struct.completion DESC
     5             ,email_struct.blv
     6             ,email_struct.priority
     7             ,email_struct.update_dt DESC
     8             ,email_struct.pk
     9  ) AS email_struct
    10
    11,FIRST_VALUE(phone_struct)
    12  OVER (
    13    PARTITION BY amperity_id
    14    ORDER BY phone_struct.completion DESC
    15             ,phone_struct.blv
    16             ,phone_struct.priority
    17             ,phone_struct.update_dt DESC
    18             ,phone_struct.pk
    19  ) AS phone_struct
    
  4. Find the SELECT statement that builds the Merged Customers table, and then add the columns for email_internal:

     1,up.email_struct.email
     2,up.email_struct.pk AS `email_pk`
     3,up.email_struct.update_dt AS `email_update_dt`
     4,up.email_struct.datasource AS `email_datasource`
     5,up.email_struct.priority AS `email_priority`
     6,up.email_struct.blv AS `email_blv`
     7,up.email_struct.completion AS `email_completion`
     8
     9,up.phone_struct.phone
    10,up.phone_struct.pk AS `phone_pk`
    11,up.phone_struct.update_dt AS `phone_update_dt`
    12,up.phone_struct.datasource AS `phone_datasource`
    13,up.phone_struct.priority AS `phone_priority`
    14,up.phone_struct.blv AS `phone_blv`
    15,up.phone_struct.completion AS `phone_completion`
    
  5. Validate the query.

  6. Click Next. This opens the Database Table Definition page.

  7. Verify these settings, and then click Save.

  8. Run the customer 360 database.

Note

If you add other PII semantics to the bad-values blocklist, such as for addresses, be sure to make the same changes to the structs.

Add Stitch BlocklistValues table

Important

This table cannot be added to the Stitch QA database until after the bad-values blocklist has been configured.

To add the Stitch BlocklistValues table

  1. From the Customer 360 page, under All Databases, select the menu for the Stitch QA database, and then click Edit.

  2. From the Database Editor, click Add Table.

  3. Name the table “Stitch_BlocklistValues”.

  4. Set the build mode to Passthrough.

  5. From the Source Table dropdown, select the table associated with the bad-values blocklist.

  6. Click Next.

  7. Add a table description. This enables a tooltip that is visible from other areas in Amperity.

  8. Leave everything else unchanged.

  9. Click Save.

  10. Activate, and then run the Stitch QA database.

Update Stitch QA queries

Update the detailed examples Stitch QA query to uncomment all of the fields that contain _blv, and then run it to verify the results.

The following Stitch QA queries can be configured to exclude values for the bad-values blocklist:

  1. Common values

  2. Detailed examples

  3. Many Amperity IDs

  4. Split clusters

  5. Unmatched semantic values

After configuring the bad-values blocklist, examine each of the Stitch QA queries and determine which bad values should be excluded from the query.

Automate blocklist

The bad-values blocklist must be refreshed on a daily basis. To do this, configure a destination to send the results of the bad-values blocklist query to the cloud-based storage location that is included with your tenant – Amazon S3 or Microsoft Azure Blob Storage. Update your courier to pull the bad-values blocklist from that location, and then configure the end-to-end workflow to run automatically.

Tip

The following settings are recommended whenever the bad-values blocklist belongs to a courier group:

  1. Set Notify when missing? to “Disabled”.

  2. Set Abort when missing? to “Disabled”.

  3. Configure data to load 1 day older than the scheduled date and time.

Advanced options

There are more ways to do blocklists:

By data source

You can create a group of bad-values queries, with one query per data source. This allows you to tune the SQL query results to a specific data set for both semantic associations and thresholds. Use the existing SQL query as a template and the create a unique query for each bad-values blocklist you want to build based on results in the customer 360 database.

By database

You could use a single SQL query that is run against the customer 360 database to return a bad-values blocklist that is global and not specific to a single data table. Download that as a CSV file. Then follow the same steps.

Expand returned values

In situations where multiple values exist for phone numbers or email addresses and the field in the Unified Coalesced table is a comma-separated concatenation of all values, the automated blocklist SQL query may fail to catch individual bad values for phones or email addresses because the query looks at occurrances of the whole concatenated value.

Use the UNNEST clause in the SQL segment to expand phone numbers or email addresses so that the query looks at individual bad values. The following example shows using an UNNEST clause to expand email addresses that are part of a bad-values blocklist SQL query:

 1SELECT
 2  REGEXP_REPLACE(v.value,'\+.*@','@') AS value
 3  ,M.semantic
 4  ,M.datasource
 5  ,M.domain_table
 6  ,M.proxy
 7FROM (
 8  SELECT DISTINCT
 9    SPLIT(UPPER(email), ',') AS vs
10    ,'email' AS semantic
11    ,REGEXP_EXTRACT(datasource, '.+?(?=:)') AS datasource
12    ,datasource AS domain_table
13    ,UPPER(given_name) AS proxy
14  FROM Unified_Coalesced AS UC
15) AS M
16CROSS JOIN UNNEST(vs) AS v(value)
17WHERE value IS NOT NULL
18AND value <> ''),

Addresses

The profile (PII) semantics to which a bad-values blocklist can be applied are email, phone, given-name, surname, and address.

When address is added to the bad-values blocklist, be sure to verify that real values, such as university addresses, apartment complex addresses, and condominium addresses are not being added to the bad-values blocklist.

Tip

Stitch processes addresses in a bad-values blocklist in this order:

  1. Each record is normalized, and the address and address2 are concatenated together into a single address field.

  2. Each blocklist entry is processed for the normalized address.

  3. Values are removed from the preprocessed record if they occur in any preprocessed blocklist record.

A blocklist cannot define independent address and address2 semantics. When the values in address2 must be part of the bad-values blocklist, the value of the address in its post-processed concatenated state is the best value to use in the blocklist.

An effective bad-values blocklist for address often requires tuning and validation of the results to ensure that the right level of values are removed from the data. Start with a high threshold–at least “40”, but higher if necessary–for address, verify the results, and then adjust the threshold until the desired level of accuracy is achieved. Use an Internet search to help verify each address that is blocklisted as part of the verification process.

When the bad-values blocklist is applied to address keep in mind that it also considers city and state along with address before determining if the threshold is met. This group–address, city, and state–does not replace the address value in the Stitch_BadValues table. The same address value may appear multiple times for each city and state pair. When ordinals are used with the address semantic tag, the address group for each ordinal is checked.

address2 fields

The address field may be used in the bad-values blocklist as part of a complete, normalized address.

Caution

The fields associated with this semantic can follow many patterns and contain many types of values. Applying a bad-values blocklist to them is more difficult and the results are less accurate.

Important

The address2 field should never be used in the bad-values blocklist in isolation.

In certain situations, the address2 field is the source of a value that may need to be added to the bad-values blocklist. There are two ways to do this:

  1. Remove a single address.

  2. Concatenate address and address2.

Caution

Both approaches require visual inspection of the results to verify that the blocklist values are being applied correctly and only to the intended fields and not to a more global set of fields.

Remove address

To remove a single address from post-processed output, add SQL similar to the following to the bad-values blocklist query to remove a single address value. Be sure to use the post-processed, normalized format for the data as the value to be removed:

semantic="address", value="123 MAIN ST STE 500"

Concatenate address

To concatenate the address and address2 fields in the bad-values blocklist, update the following line in the bad_addresses block from:

UPPER(address) AS value

to:

CASE
  WHEN address2 IS null
  THEN UPPER(address)
  ELSE UPPER(CONCAT(address,' ',address2))
END AS value

Caution

This approach should not be used when the address field is known to contain a high volume of bad addresses that should be removed, and also the address2 fields within that subset of records contains a variety of junk values. Updating the bad_address block in this scenario may cause addresses to be missed as the blocklist values are applied.

Custom CSV files

You may create a CSV file to use for custom bad-value blocklists. The table structure of this CSV file must be contain the following columns:

  • datasource

  • semantic

  • value

and then must contain a unique value per row, associated to a single Amperity semantic. Use a wildcard value (*) to associate a value with all data sources:

----- ------------ ----------- ----------------------------
 row   datasource   semantic    value
----- ------------ ----------- ----------------------------
 1     ACME         email       internaltest@matchstik.com
 2     ACME         email       externaltext@matchstik.com
 3     ACME         email       foo@matchstik.com
 4     ACME         phone       555-555-5555
 5     ACME         phone       555-1212
 6     *            phone       867-5309
 7     ACME         phone       111-111-1111
 8     *            address     One Infinite Loop
 9     ACME         fk          172fdr9HpTafWNpQmyVR
----- ------------ ----------- ----------------------------

The CSV file itself would be similar to:

datasource,semantic,value
ACME,email,internaltest@matchstik.com
ACME,email,externaltext@matchstik.com
ACME,email,foo@matchstik.com
ACME,phone,555-555-5555
ACME,phone,555-1212
*,phone,867-5309
ACME,phone,111-111-1111
*,address,One Infinite Loop
ACME,fk,172fdr9HpTafWNpQmyVR

To use a custom CSV file to blocklist birthdates

  1. Use a custom CSV file to blocklist birthdates that exist across multiple data sources. For example:

    datasource,semantic,value
    *,birthdate,0001-01-03
    *,birthdate,0001-01-01
    *,birthdate,1901-12-13
    
  2. Import this CSV file as a feed.

  3. Set the semantic and value fields as a primary key.

  4. Uncheck the Make available to Stitch setting.

Tip

This example shows how to use a CSV file to provide custom values that should be blocklisted. However, adding birthdates to a custom blocklist is often unnecessary because Amperity will not use birthdate values to cluster two individuals with distinct names. The likelihood of distinct individuals with very similar PII, such as both having the same name and email address, along with having the same birthdate is very low.

Before adding values to a custom blocklist, first verify if any of these values have created false-positive clusters in the Stitch results.

Custom domain tables

The bad-values blocklist uses a regular expression to identify domain tables. Domain tables are built using a source:feed pattern, whereas custom domain tables use a SQL-safe pattern that uses underscores (_) instead of a colon (:) as a delimiter. When custom domain table names are present, the default regular expression will not identify the underscores and any related custom domain tables, and may return NULL values.

If a blocklist returns NULL values and if custom domain tables are present, update the regular expression in the SELECT statements for the following sections:

  • bad_addresses

  • bad_emails

  • bad_phones

For each SELECT statement, change:

REGEXP_EXTRACT(datasource, '.+?(?=:)') AS datasource,

to:

COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource,

This update allows these SELECT statements to continue using a regular expression to find domain tables, and then use * to find custom domain tables and prevents NULL values from being returned.

External blocklists

A blocklist does not need to be round-tripped as CSV output from the customer 360 database back into Amperity via a feed to a domain table. A blocklist may come from any source, including external sources that contain known bad values. The only requirement for this approach is the table structure must contain fields for datasource, semantic, and values.

Update existing blocklist

You can upload new data to an existing blocklist feed as long as the schema for the uploaded data matches the schema that already exists in the blocklist feed. Select the feed menu options, and then choose Load new data. In the dialog box, select the Upload new option and click Upload to select a file. The table may be truncated on load. Then load the file.

Derived semantics

A bad-values blocklist works best for email addresses, physical addresses, and phone numbers. That said, other Amperity PII semantics may be added to the blocklist.

The bad-values blocklist operates against pre-processed semantic rows, and not against raw customer data. When adding blocklist values, consider which semantics are best to target rather than which fields to target, even if they are often the same.

Caution

Blocklist values are pre-processed using the same semantic configuration as the targeted data source. This is intentional, in part to ensure the individual who creates the bad-values blocklist does not have to understand what the customer data will look like after it has been pre-processed.

This can be a problem with derived semantics.

A derived semantic is inferred from existing data. For example, “given-name” and “surname” semantics are from “full-name”. “gender” is from “title”. Amperity creates derived semantics automatically to ensure that individual semantics are available to Stitch during identity resolution.

If the targeted data source is configured to derive given-name and surname semantics from the full-name semantic, values in the full-name semantic should never be added to a bad-values blocklist.

For example, if “BLACK BLACK” were added to the bad-values blocklist via the full-name semantic, then “BLACK” for both given-name and surname semantics would be blocklisted too. This type of situation is not common, but must be considered when deciding which semantics to use for building the bad-values blocklist.

SQL query example

This SQL query provides a working example of a bad-values blocklist that looks for common values that exceed defined thresholds for physical addresses, email addresses, and phone numbers. Use this as a template for defining a bad-values query against your data.

  1-- QUERY NAME: Blocklist Values
  2-- DESCRIPTION: Use this query to verify the presence of blocklisted values.
  3-- REQUIREMENTS: Configure for semantics and use `proxy` to assess the quality
  4-- of the bad values returned by the query. The recommended proxy is `given_name`.
  5-- OPTIONS:
  6-- NOTES:
  7
  8
  9WITH bad_addresses AS (
 10  SELECT *
 11  FROM (
 12    SELECT
 13      CASE
 14        WHEN address2 IS NULL
 15        THEN UPPER(address)
 16        ELSE UPPER(CONCAT(address,' ',address2))
 17      END AS value
 18      ,UPPER(REPLACE(CONCAT(
 19        COALESCE(address,''),
 20        COALESCE(address2,''),
 21        COALESCE(city,''),
 22        COALESCE(state,'')),
 23        ' ')) AS measured_value
 24      ,'address' AS semantic
 25      ,COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource
 26      ,datasource AS domain_table
 27      ,UPPER(COALESCE(given_name, SPLIT(full_name,' ')[1])) AS proxy
 28    FROM Unified_Coalesced
 29  )
 30  WHERE value IS NOT NULL
 31  AND value <> ''
 32)
 33
 34,bad_emails AS (
 35  SELECT *
 36  FROM (
 37    SELECT
 38      UPPER(REGEXP_REPLACE(email,'(?:\.|\+.*)(?=.*?@.+\..+)','')) AS value
 39      ,'email' AS semantic
 40      ,COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource
 41      ,datasource AS domain_table
 42      ,UPPER(COALESCE(given_name, SPLIT(full_name,' ')[1])) AS proxy
 43    FROM Unified_Coalesced
 44  )
 45  WHERE value IS NOT NULL
 46  AND value <> '')
 47
 48,bad_phones AS (
 49  SELECT *
 50  FROM (
 51    SELECT
 52      UPPER(phone) AS value
 53      ,'phone' AS semantic
 54      ,COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource
 55      ,datasource AS domain_table
 56      ,UPPER(COALESCE(given_name, SPLIT(full_name,' ')[1])) AS proxy
 57    FROM Unified_Coalesced
 58  )
 59  WHERE value IS NOT NULL
 60  AND value <> '')
 61
 62SELECT
 63  *
 64  ,TO_HEX(MD5(TO_UTF8(CONCAT(domain_table,semantic,value)))) AS pk
 65FROM (
 66  SELECT
 67    datasource
 68    ,semantic
 69    ,value
 70    ,SUM(num_values) AS num_values
 71    ,SUM(num_proxy) AS num_proxy
 72    ,domain_table
 73  FROM (
 74    SELECT
 75      '*' AS datasource
 76      ,semantic
 77      ,value
 78      ,measured_value
 79      ,COUNT(value) AS num_values
 80      ,COUNT(DISTINCT proxy) AS num_proxy
 81      ,'' AS domain_table
 82    FROM bad_addresses
 83    GROUP BY 1,2,3,4,7
 84    HAVING COUNT(DISTINCT proxy) > 40
 85  )
 86  GROUP BY 1,2,3,6
 87
 88  UNION
 89
 90  SELECT
 91    '*' AS datasource
 92    ,semantic
 93    ,value
 94    ,COUNT(value) AS num_values
 95    ,COUNT(DISTINCT proxy) AS num_proxy
 96    ,'' AS domain_table
 97  FROM bad_emails
 98  GROUP BY 1,2,3,6
 99  HAVING COUNT(DISTINCT proxy) > 20
100
101  UNION
102
103  SELECT
104    '*' AS datasource
105    ,semantic
106    ,value
107    ,COUNT(value) AS num_values
108    ,COUNT(DISTINCT proxy) AS num_proxy
109    ,'' AS domain_table
110  FROM bad_phones
111  GROUP BY 1,2,3,6
112  HAVING COUNT(DISTINCT proxy) > 20
113)