Bad-values blocklist

Some values can be very common in data sets. These values may have been entered into the data for the purpose of testing, but often they were entered with a goal of intentionally providing bad data. This pattern is especially common with phone numbers, email addresses, and physical addresses. For example, bad phone numbers like 555-555-5555, 555-1212, and 867-5309 may be provided by individuals instead of their real phone number.

Amperity should be configured to ignore these types of fake values when they appear in columns that are associated with certain semantic values for PII data. Especially when they occur at a very high frequency in customer data. By ignoring these values, this prevents the Stitch process from assigning Amperity IDs as if these fake values were real values.

Important

The bad-values blocklist supports the following PII semantics: given-name, surname, email, phone, and address (which is handled as a combination of fields to which the address, address2, city, state, and postal semantic tags are applied). You do not have to implement all of these semantics as part of the query template for automated blocklist values. Automated blocklist values will not work for address2 (standalone), birthdate, city (standalone), company, full-name, gender, generational-suffix, postal (standalone), state (standalone), title, or for any custom semantic.

Automatic bad value detection

Amperity is configured to automatically apply blocklists to email addresses, phone numbers, and physical addresses when a bad value is discovered at a frequency that exceeds the defined threshold.

The default configuration is:

:amperity.stitch.settings/badvalues-config [
{:threshold 20, :proxy "given-name", :semantic "email"}
{:threshold 20, :proxy "surname", :semantic "email"}
{:threshold 20, :proxy "given-name", :semantic "phone"}
{:threshold 40, :proxy "given-name", :semantic "address"}]

How the default configuration works:

  1. An email address is added to the bad-values blocklist when the same email addresses is associated with more than 20 distinct given names.

  2. A phone number is added to the bad-values blocklist when the same phone number is associated with more than 20 distinct given names.

  3. A physical addresses is added to the bad-values blocklist when the same physical addresses is associated with more than 40 distinct given names.

The threshold values can be increased or decreased as needed for your tenant.

You can add blocklist items using the following syntax:

{:threshold 00, :proxy "semantic-name", :semantic "semantic-name"}

Disable automatic bad value detection

You can disable automatic bad-value blocklists by editing the configuration to be empty square brackets:

:amperity.stitch.settings/badvalues-config []

Stitch_BadValues table

The Stitch BadValues table contains all of the values that were added to the bad-values blocklist.

How to understand rows in the Stitch_BadValues table

For a configuration of:

{:threshold 20, :proxy "given-name", :semantic "email"}

A row in the Stitch BadValues table may contain the following values:

----- ---------- ----------- ------------ -----------  -----
 ...   semantic   value       num_values   num_proxy    ...
----- ---------- ----------- ------------ -----------  -----
       email      a@aol.com   189          91
----- ---------- ----------- ------------ -----------  -----

This means that the email address “a@aol.com” was discovered 189 times with 91 different given-name proxies, which means this email address is associated with a large number of individuals. This exceeds the threshold of “20” and is added to the bad-values blocklist.

The Stitch BadValues table contains the following columns:

Column name

Data type

Description

Datasource

String

The name of the data source from which the value originated.

Semantic

String

The semantic tag that is associated with the value that was added to the bad-values blocklist.

Value

String

The value that was added to the bad-values blocklist.

Num Values

String

The number of times the value appeared in the data.

Num Proxy

String

The number of proxies to which the value was associated. When this number exceeds the threshold defined for in the blocklist a value is added to the bad-values blocklist.

Domain Table

String

The domain table from which the value originated.

Is Preprocessed

String

Indicates if the value was changed during pre-processing by Stitch.

For example, phone numbers often have dashes within their values. Dashes are removed by Stitch during pre-processing. When this column is true, and a phone number value is “12065551212”, the original value for that phone number in the associated datasource may be “1-206-555-1212”.

Global blocklist values

Amperity uses a global blocklist to automatically remove a set of known bad values from the data to prevent them from being used to identify clusters.

Semantic

Values

All semantics

The following values are ignored by Stitch when performing identity resolution:

  • BLOCKED

  • COLLECT

  • DECLINE

  • DECLINED

  • DECLINES

  • DENIED

  • EMPTY

  • INFO

  • N/A

  • NAME

  • NONE

  • NONE GIVEN

  • NO VALUE

  • NOT AVAILABLE

  • NOT GIVEN

  • NOTAVAILABLE

  • NULL

  • REFUSED

  • TBD

  • TDB

  • TRAVEL

  • UNKNOWN

address

The following values associated with the address semantic are ignored by Stitch when performing identity resolution:

  • GENERAL DELIVERY

  • NEED ADDRESS

  • NO ADDRESS GIVEN

  • P O BOX

  • PO BOX

  • UNLISTED

  • YOUR STREET ADDRESS

birthdate

The following values associated with the birthdate semantic are ignored by Stitch when performing identity resolution:

  • 1899-12-31

  • 1900-01-01

  • 1920-01-01

city

The following values associated with the city semantic are ignored by Stitch when performing identity resolution:

  • NA

  • NAN

  • NEED CITY

  • NO

  • NO ADDRESS

email

Stitch will automatically blocklist email address values that do not contain an @ symbol.

The following values associated with the email semantic are ignored by Stitch when performing identity resolution:

  • @NOEMAIL.COM

  • @NOMAIL.COM

  • 0000000000

  • 123@

  • 1234@

  • 99@

  • ABC@

  • ABC123@

  • ADMIN@

  • BOOKING@

  • CLIENT@

  • CLIENTS@

  • CONFIRMATION@

  • CONFIRMATIONS@

  • CONTACT@

  • CUSTOMERSERVICE

  • CUSTOMERSERVICE@

  • CUSTOMERSERVICES

  • CUSTOMERSERVICES@

  • DECLINE@

  • DECLINED@

  • DENIED@

  • EMAIL@

  • @EMAIL.TST

  • EXAMPLE@

  • FAKENAME@

  • GUEST@

  • GUESTS@

  • HELP@

  • HELPS@

  • HOTELHELP@

  • HOTELPARTNER@

  • HOTELPARTNERS@

  • INFO@

  • JUNK@

  • MAIL@

  • ME@

  • N@A

  • NAME@

  • NO@

  • NOEMAIL@

  • NOMAIL@

  • NONE@

  • NONENONE@

  • NOREPLY@

  • NOTHANKS@

  • NOTHANKYOU@

  • ONLINERESERVATION

  • ONLINERESERVATION@

  • ONLINERESERVATIONS

  • ONLINERESERVATIONS@

  • OPERATION@

  • OPERATIONS@

  • QUERIES@

  • QUERY@

  • REFUSED@

  • RES@

  • RESERVAS

  • RESERVATION@

  • RESERVATIONS@

  • ROOMRESERVATION@

  • ROOMRESERVATIONS@

  • SAMPLE@

  • SAMPLES@

  • SERVICE@

  • SHOP@

  • TEST@

  • TESTING@

  • TESTEMAIL@

  • TRAVEL@

  • TRAVELS

  • VENDOR@

  • VENDORS@

  • XXX@

The values in bold are always ignored.

Warning

Stitch may be configured to ignore a partial list of generic email addresses. When Stitch is configured to ignore a partial list of generic email addresses, additional work to update the bad-values blocklist may be necessary.

full-name

The following values associated with the full-name semantic are ignored by Stitch when performing identity resolution:

  • DEFAULT CUSTOMER

  • PREPAID CARDHOLDER

  • RED CARD

  • RED CARDS

  • RESERVED GUEST

  • RESERVED GUESTS

  • TEST TEST

given-name

The following values associated with the given-name semantic are ignored by Stitch when performing identity resolution:

  • BLOCK

  • CONTACT

  • GUEST

  • GUESTS

  • NO NAME

  • RESERVED

  • USE

phone

The following values associated with the phone semantic are ignored by Stitch when performing identity resolution:

  • 0000000000

  • 0000000001

  • 1111111111

  • 1234567890

  • 1234567891

  • 2222222222

  • 3333333333

  • 4444444444

  • 5555555555

  • 6666666666

  • 7777777777

  • 8888888888

  • 9999999999

postal

The following values associated with the postal semantic are ignored by Stitch when performing identity resolution:

  • 00000

  • 11111

  • 1111

  • 111

  • 11

  • 1

  • NA

  • NAN

  • NO

state

The following values associated with the state semantic are ignored by Stitch when performing identity resolution:

  • NA

  • NAN

  • NEED STATE

  • NO

surname

The following values associated with the surname semantic are ignored by Stitch when performing identity resolution:

  • CONTACT

  • GUESTS

  • NO NAME

  • RESERVED

  • USE

Managed bad-value blocklists

A bad-values blocklist contains known values that appear frequently in data and should be excluded from the Stitch process.

In addition to using the default behavior of the bad-values blocklist, you may configure custom workflows for managing bad-values in your tenant’s blocklist:

  1. Add a query for automated blocklist values.

  2. Run the query, and then validate the results.

  3. Download the results as a CSV file.

  4. Add a feed to ingest the CSV file.

  5. Run Stitch.

  6. Edit Unified Coalesced table.

  7. Update the Merged Customers table.

  8. Update Stitch QA queries.

  9. Automate blocklist.

Note

These steps describe how to add a bad-values blocklist. They show examples for only the email and phone semantics. Use the example SQL query if you want to add the address semantic. If you want to also add given-name and surname they both follow the same pattern, but be sure to associate the proxy for given-name to some other semantic to ensure the validation process can compare the results alongside some other column.

Add SQL query

A query is SQL that is run from the Queries page against database tables in the Customer 360 page. A query returns a refined and filtered subset of useful customer data.

The easiest way to build a bad-values blocklist is to define a SQL query that returns a list of values associated with specific Amperity semantics. This list should only return common values that appear over a certain threshold.

This query must do the following:

  • Use SELECT to get data from a table in the customer 360 database.

  • Specify the associated Amperity semantic for that column.

  • Specify the column in which potential bad values are located.

  • Apply a count threshold, over which a value is returned for inclusion in the bad-values blocklist.

  • Return the data into columns named datasource, semantic, and values, one unique value per row.

Important

Only the PII semantic fields email, phone, given-name, surname, and address will have corresponding _blv columns added to the Unified Coalesced table, but any semantic may be added to the blocklist.

The name of the query should contain the word “blocklist” to help clearly identify its purpose. For example: “Bad-values Blocklist”.

For example, a SQL query that returns values that appear more than 20 times for email and phone is similar to:

WITH

bad_emails AS (
  SELECT * FROM (
    SELECT
      UPPER(REGEXP_REPLACE(email,'(?:\.|\+.*)(?=.*?@.+\..+)','')) AS value
      ,'email' AS semantic
      ,datasource AS domain_table
      ,UPPER(given_name) AS proxy
    FROM Unified_Coalesced)
  WHERE value IS NOT null
  AND value <> ''),

bad_phones AS (
  SELECT * FROM (
    SELECT
      UPPER(phone) AS value
      ,'phone' AS semantic
      ,datasource AS domain_table
      ,UPPER(given_name) AS proxy
    FROM Unified_Coalesced)
  WHERE value IS NOT null
  AND value <> '')

SELECT
  *
  ,TO_HEX(MD5(TO_UTF8(CONCAT(domain_table,semantic,value)))) AS pk
FROM (
  SELECT
    '*' AS datasource
    ,semantic
    ,value
    ,COUNT(value) AS num_values
    ,COUNT(DISTINCT proxy) AS num_proxy
    ,'' AS domain_table
  FROM bad_emails
  GROUP BY 1,2,3,6
  HAVING COUNT(DISTINCT proxy) > 20

  UNION

  SELECT
    '*' AS datasource
    ,semantic
    ,value
    ,COUNT(value) AS num_values
    ,COUNT(DISTINCT proxy) AS num_proxy
    ,'' AS domain_table
  FROM bad_phones
  GROUP BY 1,2,3,6
  HAVING COUNT(DISTINCT proxy) > 20

  ORDER BY datasource, semantic, num_proxy desc
)

Run query

Run the query from the SQL Query Editor to validate the syntax and to verify the output. Fix any errors that may be returned. If the values in the returned output don’t seem correct, update the threshold for values counts to see if that improves the results.

Tip

The goal with a bad-values blocklist isn’t to catch every single bad value, but rather to remove from the Stitch process the most comon bad values.

Download CSV

After the bad-values blocklist query has been validated and run successfully, download the results as a CSV file.

The bad-values blocklist output is a table similar to:

----- ------------ ------------ ----------------------------
 row   datasource   semantic     value
----- ------------ ------------ ----------------------------
 1     MTK          email        internaltest@matchstik.com
 2     MTK          phone        555-555-5555
 4     MTK          fk           172fdr9HpTafWNpQmyVR
----- ------------ ------------ ----------------------------

Each row associates a bad value to an Amperity semantic and a customer data source. The CSV file itself would be similar to:

datasource,semantic,value
MTK,email,internaltest@matchstik.com
MTK,phone,555-555-5555
MTK,fk,172fdr9HpTafWNpQmyVR

Each row associates a bad value to an Amperity semantic and a customer data source.

Add feed

A feed defines how data should be loaded into a domain table, including specifying which columns are required and which columns should be associated with a semantic tag that indicates that column contains customer profile (PII) and transactions data.

To add a feed for blocklist values

  1. Click the Sources page.

  2. Add a data source named “Amperity” if one does not already exist.

  3. Add a feed named “Blocklist”.

  4. Upload the bad-values blocklist CSV file that was previously downloaded from the Queries page.

  5. Set all field types to string.

  6. Set labels for fields in the CSV file.

    The downloaded CSV file for the bad-values blocklist has three columns: datasource, semantic, and value.

    These columns must be associated with specific labels in the feed for the bad-values blocklist. When these columns are associated in a feed, the SEMANTIC must be mapped to specific labels:

    The following labels are required:

    Incoming Field

    Blocklist Semantic

    datasource

    blv/data-source

    semantic

    blv/semantic

    value

    blv/value

    Note

    This is different from how semantics are applied to a non-blocklist feed. This is because the actual data source, actual semantic, and actual values are in the data source itself, as they were output from the customer 360 data as CSV data, and then re-ingested as CSV data in the feed created for the bad-values blocklist. This, effectively, round-trips output from the customer 360 database as a CSV file that creates a new domain table for use during the Stitch process.

  7. Define the primary key. This is required by the Stitch process, but for the purpose of creating the bad-values blocklist is not important. Use any row ID.

  8. Uncheck the Make available to Stitch option.

  9. Activate the feed.

Run Stitch

Run the Stitch process to update the results for the bad-values blackist.

  1. On the Stitch page, click Run.

  2. Re-run each data table in the customer 360 database that contains data that could be affected by the bad-values blocklist. This will add the results of the most recent Stitch run to these data tables, including adding a series of columns that indicate the presence of bad-values.

    The has_blv column indicates if blocklist values for address, email, phone, given-name, or surname are present in customer records.

Edit Unified Coalesced table

The Unified Coalesced table contains all the PII data that has been processed through Stitch. The data is organized by the semantic tag as the column heading and then coalesced into one single table. A unique Amperity ID may appear in more than one row.

The Unified Coalesced table is preconfigured in the Stitch QA database. When a passthrough table is added to a database all of the columns in that table are selected by default.

Important

If a column is removed from the Unified Coalesced table, it will be removed automatically from the passthrough table. If a column is added to the Unified Coalesced table, you must edit the passthrough table in the Stitch QA database to select the added columns.

For example, the bad-values blocklist is typically configured after the initial customer 360 and Stitch QA databases. After the bad-values blocklist is configured, a collection of columns are added to the Unified Coalesced table that is output by Stitch. These columns are not enabled automatically within the Unified Coalesced table. You must edit the Unified Coalesced table, select these columns, activate the table, and then rerun the Stitch QA database to ensure that bad-values blocklist data is available to the Stitch QA database.

To edit the Unified Coalesced table

  1. From the Customer 360 page, under All Databases, open the menu for the Stitch QA database, and then select View. The Database Editor page opens.

  2. Under Database Tables, open the menu for the Unified Coalesced table, and then select Edit. The Database Table page opens.

  3. Click Next to open the Database Table page, and then find and select the following columns: blv_address, blv_email, blv_given_name, blv_phone, and blv_surname.

  4. Click Save. This returns you to the Database Editor page for the Stitch QA database.

  5. Click Activate. Run the database.

Tip

If these changes also affect the Merged Customers table, update both tables before running the Stitch QA database.

Update Merged Customers table

When using the bad-values blocklist, you must update the Merged Customers table to include the bad-values blocklist items in the merge rules.

To update the Merged Customers table

  1. From the Customer 360 page, under All Databases, open the menu for the customer 360 database, and then select Edit.

  2. Open the menu for the Merged Customers table and select Edit.

  3. Find the Unified_Prioritized section, and then update the first value blocks for email and phone to add the .blv entries:

    ,FIRST_VALUE(email_struct)
      OVER (
        PARTITION BY amperity_id
        ORDER BY email_struct.completion DESC
                 ,email_struct.blv
                 ,email_struct.priority
                 ,email_struct.update_dt DESC
                 ,email_struct.pk
      ) AS email_struct
    
    ,FIRST_VALUE(phone_struct)
      OVER (
        PARTITION BY amperity_id
        ORDER BY phone_struct.completion DESC
                 ,phone_struct.blv
                 ,phone_struct.priority
                 ,phone_struct.update_dt DESC
                 ,phone_struct.pk
      ) AS phone_struct
    
  4. Find the SELECT statement that builds the Merged Customers table, and then add the columns for email_internal:

    ,up.email_struct.email
    ,up.email_struct.pk AS `email_pk`
    ,up.email_struct.update_dt AS `email_update_dt`
    ,up.email_struct.datasource AS `email_datasource`
    ,up.email_struct.priority AS `email_priority`
    ,up.email_struct.blv AS `email_blv`
    ,up.email_struct.completion AS `email_completion`
    
    ,up.phone_struct.phone
    ,up.phone_struct.pk AS `phone_pk`
    ,up.phone_struct.update_dt AS `phone_update_dt`
    ,up.phone_struct.datasource AS `phone_datasource`
    ,up.phone_struct.priority AS `phone_priority`
    ,up.phone_struct.blv AS `phone_blv`
    ,up.phone_struct.completion AS `phone_completion`
    
  5. Validate the query.

  6. Click Next. This opens the Database Table Definition page.

  7. Verify these settings, and then click Save.

  8. Run the customer 360 database.

Note

If you add other PII semantics to the bad-values blocklist, such as for addresses, be sure to make the same changes to the structs.

Add Stitch BlocklistValues table

Important

This table cannot be added to the Stitch QA database until after the bad-values blocklist has been configured.

To add the Stitch BlocklistValues table

  1. From the Customer 360 page, under All Databases, select the menu for the Stitch QA database, and then click Edit.

  2. From the Database Editor, click Add Table.

  3. Name the table “Stitch_BlocklistValues”.

  4. Set the build mode to Passthrough.

  5. From the Source Table drop-down, select the table associated with the bad-values blocklist.

  6. Click Next.

  7. Add a table description. This enables a tooltip that is visible from other areas in Amperity.

  8. Leave everything else unchanged.

  9. Click Save.

  10. Activate, and then run the Stitch QA database.

Update Stitch QA queries

Update the detailed examples Stitch QA query to uncomment all of the fields that contain _blv, and then run it to verify the results.

The following Stitch QA queries can be configured to exclude values for the bad-values blocklist:

  1. Common values

  2. Detailed examples

  3. Many Amperity IDs

  4. Split clusters

  5. Unmatched semantic values

After configuring the bad-values blocklist, examine each of the Stitch QA queries and determine which bad values (if any) should be excluded from the query.

Automate blocklist

The bad-values blocklist must be refreshed on a daily basis. To do this, configure a destination to send the results of the bad-values blocklist query to the cloud-based storage location that is included with your tenant – Amazon S3 or Microsoft Azure Blob Storage. Update your courier to pull the bad-values blocklist from that location, and then configure the end-to-end workflow to run automatically.

Tip

The following settings are recommended whenever the bad-values blocklist belongs to a courier group:

  1. Set Notify when missing? to “Disabled”.

  2. Set Abort when missing? to “Disabled”.

  3. Configure data to load 1 day older than the scheduled date and time.

Advanced options

There are more ways to do blocklists:

Per-data source

You can create a group of bad-values queries, with one query per data source. This allows you to tune the SQL query results to a specific data set for both semantic associations and thresholds. Use the existing SQL query as a template and the create a unique query for each bad-values blocklist you want to build based on results in the customer 360 database.

Per-database

You could use a single SQL query that is run against the customer 360 database to return a bad-values blocklist that is global and not specific to a single data table. Download that as a CSV file. Then follow the same steps.

Expand returned values

In situations where multiple values exist for phone and/or email addresses and the field in the Unified Coalesced table is a comma-separated concatenation of all values, the automated blocklist SQL query may fail to catch individual bad values for phones or email addresses because the query looks at occurrances of the whole (concatenated) value.

Use the UNNEST clause in the SQL segment to expand phone and/or email addresses so that the query looks at individual bad values. The following example shows using an UNNEST clause to expand email addresses that are part of a bad-values blocklist SQL query:

SELECT
  REGEXP_REPLACE(v.value,'\+.*@','@') AS value
  ,M.semantic
  ,M.datasource
  ,M.domain_table
  ,M.proxy
FROM (
  SELECT DISTINCT
    SPLIT(UPPER(email), ',') AS vs
    ,'email' AS semantic
    ,REGEXP_EXTRACT(datasource, '.+?(?=:)') AS datasource
    ,datasource AS domain_table
    ,UPPER(given_name) AS proxy
  FROM Unified_Coalesced AS UC
) AS M
CROSS JOIN UNNEST(vs) AS v(value)
WHERE value IS NOT NULL
AND value <> ''),

Addresses

The profile (PII) semantics to which a bad-values blocklist can be applied are email, phone, given-name, surname, and address.

When address is added to the bad-values blocklist, be sure to verify that real values, such as university addresses, apartment complex addresses, and condominium addresses are not being added to the bad-values blocklist.

Tip

Stitch processes addresses in a bad-values blocklist in this order:

  1. Each record is normalized, and the address and address2 are concatenated together into a single address field.

  2. Each blocklist entry is processed for the normalized address.

  3. Values are removed from the preprocessed record if they occur in any preprocessed blocklist record.

A blocklist cannot define independent address and address2 semantics. When the values in address2 must be part of the bad-values blocklist, the value of the address in its post-processed concatenated state is the best value to use in the blocklist.

An effective bad-values blocklist for address often requires tuning and validation of the results to ensure that the right level of values are removed from the data. Start with a high threshold (at least “40”, but higher if necessary) for address, verify the results, and then adjust the threshold carefully until the desired level of accuracy is achieved. Use an Internet search to help verify each address that is blocklisted as part of the verification process.

When the bad-values blocklist is applied to address keep in mind that it also considers city and state along with address before determining if the threshold is met. This group—address, city, and state—does not replace the address value in the Stitch_BadValues table; the same address value may appear multiple times for each city and state pair. When ordinals are used with the address semantic tag, the address group for each ordinal is checked.

address2 fields

The address field may be used in the bad-values blocklist as part of a complete, normalized address. This should be done carefully because the fields associated with this semantic can follow many patterns and contain many types of values so applying a bad-values blocklist to them is more difficult and the results are less accurate.

Important

The address2 field should never be used in the bad-values blocklist in isolation.

In certain situations, the address2 field is the source of a value that may need to be added to the bad-values blocklist. There are two ways to do this:

  1. Remove a single address.

  2. Concatenate address and address2.

Caution

Both approaches require visual inspection of the results to verify that the blocklist values are being applied correctly and only to the intended fields and not to a more global set of fields.

Remove address

To remove a single address from post-processed output, add SQL similar to the following to the bad-values blocklist query to remove a single address value. Be sure to use the post-processed, normalized format for the data as the value to be removed:

semantic="address", value="123 MAIN ST STE 500"

Concatenate address

To concatenate the address and address2 fields in the bad-values blocklist, update the following line in the bad_addresses block from:

UPPER(address) AS value

to:

CASE
  WHEN address2 IS null
  THEN UPPER(address)
  ELSE UPPER(CONCAT(address,' ',address2))
END AS value

Caution

This approach should not be used when the address field is known to contain a high volume of bad addresses that should be removed, and also the address2 fields within that subset of records contains a variety of junk values. Updating the bad_address block in this scenario may cause addresses to be missed as the blocklist values are applied.

Custom CSV files

You may create a CSV file to use for custom bad-value blocklists. The table structure of this CSV file must be contain the following columns:

  • datasource

  • semantic

  • value

and then must contain a unique value per row, associated to a single Amperity semantic. Use a wildcard value (*) to associate a value with all data sources:

----- ------------ ----------- ----------------------------
 row   datasource   semantic    value
----- ------------ ----------- ----------------------------
 1     ACME         email       internaltest@matchstik.com
 2     ACME         email       externaltext@matchstik.com
 3     ACME         email       foo@matchstik.com
 4     ACME         phone       555-555-5555
 5     ACME         phone       555-1212
 6     *            phone       867-5309
 7     ACME         phone       111-111-1111
 8     *            address     One Infinite Loop
 9     ACME         fk          172fdr9HpTafWNpQmyVR
----- ------------ ----------- ----------------------------

The CSV file itself would be similar to:

datasource,semantic,value
ACME,email,internaltest@matchstik.com
ACME,email,externaltext@matchstik.com
ACME,email,foo@matchstik.com
ACME,phone,555-555-5555
ACME,phone,555-1212
*,phone,867-5309
ACME,phone,111-111-1111
*,address,One Infinite Loop
ACME,fk,172fdr9HpTafWNpQmyVR

To use a custom CSV file to blocklist birthdates

  1. Use a custom CSV file to blocklist birthdates that exist across multiple data sources. For example:

    datasource,semantic,value
    *,birthdate,0001-01-03
    *,birthdate,0001-01-01
    *,birthdate,1901-12-13
    
  2. Import this CSV file as a feed.

  3. Set the semantic and value fields as a primary key.

  4. Uncheck the Make available to Stitch setting.

Tip

This example shows how to use a CSV file to provide custom values that should be blocklisted. However, adding birthdates to a custom blocklist is often unnecessary because Amperity will not use birthdate values to cluster two individuals with distinct names. The likelihood of distinct individuals with very similar PII, such as both having the same name and email address, along with having the same birthdate is very low.

Before adding values to a custom blocklist, first verify if any of these values have created false-positive clusters in the Stitch results.

Custom domain tables

The bad-values blocklist uses a regular expression to identify domain tables. Domain tables are built using a source:feed pattern, whereas custom domain tables use a SQL-safe pattern that uses underscores (_) instead of a colon (:) as a delimiter. When custom domain table names are present, the default regular expression will not identify the underscores (and any related custom domain tables), and may return NULL values.

If a blocklist returns NULL values and if custom domain tables are present, update the regular expression in the SELECT statements for the following sections:

  • bad_addresses

  • bad_emails

  • bad_phones

For each SELECT statement, change:

REGEXP_EXTRACT(datasource, '.+?(?=:)') AS datasource,

to:

COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource,

This update will allow these SELECT statements to continue using a regular expression to find domain tables, and then use * to find custom domain tables and will prevent NULL values from being returned.

External blocklists

A blocklist does not need to be round-tripped as CSV output from the customer 360 database back into Amperity via a feed to a domain table. A blocklist may come from any source, including external sources that contain known bad values. The only requirement for this approach is the table structure must contain fields for datasource, semantic, and values.

Update existing blocklist

You can upload new data to an existing blocklist feed as long as the schema for the uploaded data matches the schema that already exists in the blocklist feed. Select the feed menu options, and then choose Load new data. In the dialog box, select the Upload new option and click Upload to select a file. The table may be truncated on load. Then load the file.

Derived semantics

A bad-values blocklist works best for email addresses, physical addresses, and phone numbers. That said, other Amperity PII semantics may be added to the blocklist.

The bad-values blocklist operates against pre-processed semantic rows, and not against raw customer data. When adding blocklist values, consider which semantics are best to target rather than which fields to target, even if they are often the same.

Caution

Blocklist values are pre-processed using the same semantic configuration as the targeted data source. This is intentional, in part to ensure the individual who creates the bad-values blocklist does not have to understand what the customer data will look like after it has been pre-processed.

This can be a problem with derived semantics.

A derived semantic is a semantic that can be inferred from existing data. For example, “given-name” and “surname” semantics can be inferred from “full-name” and “gender” can be inferred from “title”. Amperity creates derived semantics automatically to ensure the greatest number of individual semantics are available to Stitch during identity resolution.

If the targeted data source is configured to derive given-name and surname semantics from the full-name semantic, values in the full-name semantic should never be added to a bad-values blocklist.

For example, if “BLACK BLACK” were added to the bad-values blocklist via the full-name semantic, then “BLACK” for both given-name and surname semantics would be blocklisted too. This type of situation is not common, but must be considered when deciding which semantics to use for building the bad-values blocklist.

SQL query example

This SQL query provides a working example of a bad-values blocklist that looks for common values that exceed defined thresholds for physical addresses, email addresses, and phone numbers. Use this as a template for defining a bad-values query against your data.

-- QUERY NAME: Blocklist Values
-- DESCRIPTION: Use this query to verify the presence of blocklisted values.
-- REQUIREMENTS: Configure for semantics and use `proxy` to assess the quality
-- of the bad values returned by the query. The recommended proxy is `given_name`.
-- OPTIONS:
-- NOTES:


WITH bad_addresses AS (
  SELECT *
  FROM (
    SELECT
      CASE
        WHEN address2 IS NULL
        THEN UPPER(address)
        ELSE UPPER(CONCAT(address,' ',address2))
      END AS value
      ,UPPER(REPLACE(CONCAT(
        COALESCE(address,''),
        COALESCE(address2,''),
        COALESCE(city,''),
        COALESCE(state,'')),
        ' ')) AS measured_value
      ,'address' AS semantic
      ,COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource
      ,datasource AS domain_table
      ,UPPER(COALESCE(given_name, SPLIT(full_name,' ')[1])) AS proxy
    FROM Unified_Coalesced
  )
  WHERE value IS NOT NULL
  AND value <> ''
)

,bad_emails AS (
  SELECT *
  FROM (
    SELECT
      UPPER(REGEXP_REPLACE(email,'(?:\.|\+.*)(?=.*?@.+\..+)','')) AS value
      ,'email' AS semantic
      ,COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource
      ,datasource AS domain_table
      ,UPPER(COALESCE(given_name, SPLIT(full_name,' ')[1])) AS proxy
    FROM Unified_Coalesced
  )
  WHERE value IS NOT NULL
  AND value <> '')

,bad_phones AS (
  SELECT *
  FROM (
    SELECT
      UPPER(phone) AS value
      ,'phone' AS semantic
      ,COALESCE(REGEXP_EXTRACT(datasource, '.+?(?=:)'), '*') AS datasource
      ,datasource AS domain_table
      ,UPPER(COALESCE(given_name, SPLIT(full_name,' ')[1])) AS proxy
    FROM Unified_Coalesced
  )
  WHERE value IS NOT NULL
  AND value <> '')

SELECT
  *
  ,TO_HEX(MD5(TO_UTF8(CONCAT(domain_table,semantic,value)))) AS pk
FROM (
  SELECT
    datasource
    ,semantic
    ,value
    ,SUM(num_values) AS num_values
    ,SUM(num_proxy) AS num_proxy
    ,domain_table
  FROM (
    SELECT
      '*' AS datasource
      ,semantic
      ,value
      ,measured_value
      ,COUNT(value) AS num_values
      ,COUNT(DISTINCT proxy) AS num_proxy
      ,'' AS domain_table
    FROM bad_addresses
    GROUP BY 1,2,3,4,7
    HAVING COUNT(DISTINCT proxy) > 40
  )
  GROUP BY 1,2,3,6

  UNION

  SELECT
    '*' AS datasource
    ,semantic
    ,value
    ,COUNT(value) AS num_values
    ,COUNT(DISTINCT proxy) AS num_proxy
    ,'' AS domain_table
  FROM bad_emails
  GROUP BY 1,2,3,6
  HAVING COUNT(DISTINCT proxy) > 20

  UNION

  SELECT
    '*' AS datasource
    ,semantic
    ,value
    ,COUNT(value) AS num_values
    ,COUNT(DISTINCT proxy) AS num_proxy
    ,'' AS domain_table
  FROM bad_phones
  GROUP BY 1,2,3,6
  HAVING COUNT(DISTINCT proxy) > 20
)