Stitch QA

Stitch QA is a process that monitors the quality of Stitch results. Stitch QA has three components: a Stitch QA database, a set of queries to run against the Stitch QA database, and a prescribed workflow to follow while analyzing the results of those queries.

Use these results to identify over- and underclustering, to identify values to be blocklisted or labeled, or to discover situations where tuning Stitch to match your tenant’s data is necessary.

Enable Stitch QA

The Stitch QA workflow requires a collection of database tables and a collection of queries that are run against those tables. These collections are added in two steps:

  1. Add Stitch QA database

  2. Add Stitch QA queries

Add Stitch QA database

The recommended way to add the Stitch QA database to your tenant is to use the “Stitch QA” database template.

To add the Stitch QA database

  1. From the Customer 360 page click Create Database.

  2. Name the database “Stitch QA”.

  3. From the Template drop-down, select Stitch QA.

  4. Click Create. This opens the Database Editor page and shows a list of preconfigured database tables.

  5. Under Database Type, ensure that Amp360 and AmpIQ are not selected.

  6. Click the Activate button.

  7. Open the menu for the activated Stitch QA database, and then select Run.

  8. Take a few minutes to review the tables that are added by the “Stitch QA” database template.

  9. Click Activate.

  10. From the Customer 360 page, under All Databases, open the menu for the Stitch QA database, select**Run Options**, Normal, and then Run.

The Stitch QA database is preconfigured with the following tables:

Tip

The following links open standalone pages for each table. This topic discusses how each of these tables fits within the recommended Stitch QA workflow.

Tip

The Unified Coalesced and Unified Preprocessed Raw tables are frequently used together during Stitch QA because many investigation patterns will compare the values in one table to the values in the other.

Add Stitch QA queries

The recommended way to add Stitch QA queries to your tenant is to use the “Stitch QA” query template. You can do this after you have added the Stitch QA database.

To add Stitch QA queries

  1. From the Queries page click Create, and then select Add Folder. This opens the Create Folder dialog box.

  2. Name the folder “Stitch QA”.

  3. From the Template drop-down, select Stitch QA.

  4. Click Create. This will create a folder, into which a collection of draft Stitch QA queries are added.

  5. Take a few minutes to review the queries that are added by the “Stitch QA” query template.

The Stitch QA folder is preconfigured with the following queries:

Tip

The following links open standalone pages for each query. This topic discusses how each of these queries fits within the recommended Stitch QA workflow.

Note

You should expect to customize and extend many of these queries to match the Stitch QA workflow requirements for your tenant. You may add queries to your Stitch QA workflow, as necessary.

Extend Stitch QA

The Stitch QA database should be extended to support the use of the bad-values blocklist and to JOIN the Unified Preprocessed Raw table to the Unified Coalesced table.

  1. Add Stitch BlocklistValues table.

  2. JOIN the Unified Preprocessed Raw table to Unified Coalesced table

Add Stitch BlocklistValues

Important

This table cannot be added to the Stitch QA database until after the bad-values blocklist has been configured.

To add the Stitch BlocklistValues table

  1. From the Customer 360 page, under All Databases, select the menu for the Stitch QA database, and then click Edit.

  2. From the Database Editor, click Add Table.

  3. Name the table “Stitch_BlocklistValues”.

  4. Set the build mode to Passthrough.

  5. From the Source Table drop-down, select the table associated with the bad-values blocklist.

  6. Click Next.

  7. Add a table description. This enables a tooltip that is visible from other areas in Amperity.

  8. Leave everything else unchanged.

  9. Click Save.

  10. Activate, and then run the Stitch QA database.

Update Unified Preprocessed Raw

The Unified Preprocessed Raw table is an output of Stitch. Use this table as part of Stitch QA to view data exactly how values were used by Stitch after preprocessing and blocklisting. This table contains one row for each trivial duplicate. Only the semantics used by Stitch are shown. Semantic values represent pre-processing values and/or bad values that were removed (and replaced by NULL).

You can compare cleaned data to uncleaned data to improve the quality of data in the customer 360 database.

  • The Unified Preprocessed Raw table contains a row from every table that is associated with an Amperity ID and contains the results of Stitch processing. Values in this table are “cleaned” data.

  • The Unified Coalesced table contains the data exactly as it was processed by Stitch. Values in this table are “uncleaned data”.

Use a query similar to the following to join fields in the Unified Preprocessed Raw table to the Unified Coalesced table, and then compare the results.

SELECT
  upr.amperity_id AS "upr_amperity_id"
  ,uc.has_blv
  ,uc.address AS "uc_address"
  ,upr.address AS "upr_address"
  -- ,uc.birthdate AS "uc_birthdate"
  -- ,upr.birthdate AS "upr_birthdate"
  ,uc.city AS "uc_city"
  ,upr.city AS "upr_city"
  ,uc.email AS "uc_email"
  ,upr.email AS "upr_email"
  ,upr.login_trimmed AS "upr_login_trimmed"
  -- ,uc.gender AS "uc_gender"
  -- ,upr.gender AS "upr_gender"
  ,uc.given_name AS "uc_given_name"
  ,upr.given_name AS "upr_given_name"
  ,uc.phone AS "uc_phone"
  ,upr.phone AS "upr_phone"
  ,upr.po_box AS "upr_po_box"
  ,uc.postal AS "uc_postal"
  ,upr.postal AS "upr_postal"
  ,uc.generational_suffix AS "uc_generational_suffix"
  ,upr.sk_generational_suffix AS "upr_sk_generational_suffix"
  ,upr.sk_given_name AS "upr_sk_given_name"
  ,uc.state AS "uc_state"
  ,upr.state AS "upr_state"
  ,uc.surname AS "uc_surname"
  ,upr.surname AS "upr_surname"
  ,upr.datasource AS "upr_datasource"
  ,upr.pk AS "pk"
  ,upr.supersized_id AS "upr_supersized_id"
  -- ,upr.component_id AS "upr_component_id"
  -- ,upr.int_id AS "upr_int_id"
FROM Unified_Coalesced uc
JOIN Unified_Preprocessed_Raw upr
ON uc.rep_ds=upr.datasource AND uc.rep_pk=upr.pk
WHERE UPPER(uc.address) <> upr.address
-- WHERE UPPER(uc.given_name) <> up.given_name
-- WHERE UPPER(uc.gender) <> upr.gender
-- WHERE UPPER(uc.postal) <> upr.postal
-- WHERE UPPER(uc.email) <> upr.email
-- WHERE UPPER(uc.city) <> upr.city
-- WHERE UPPER(uc.phone) <> upr.phone
-- WHERE uc.amperity_id = 'amperity-id'
LIMIT 1000

Review the scenarios for using cleaned vs. uncleaned data during Stitch QA, including comparing semantic values, exploring cleaned values, and filtering by Amperity ID.

What to look for?

It is important to review the quality of Stitch results. The general process for reviewing these results, also referred to as Stitch QA, is to use a series of SQL queries to pull back results that identify issues like overclustering, underclustering, or supersized clusters that indicate areas within the customer data that need more attention.

These issues are most likely caused by values within the data itself, such as from the presence of nicknames or common values that are associated with business addresses, unmonitored email accounts, and so on. In most cases, additional configuration within Amperity, such as adding values to a blocklist, will resolve the issue. In some cases, updating configuration settings in Amperity will be the best approach for fine-tuning the quality of Stitch results.

Important

Every tenant is different and the values, thresholds, and outcomes will be unique. Stitch QA starts with a generalized approach, but often requires tenant-specific changes to SQL queries and configuration settings within Amperity, such as for how to handle issues related to dedupe keys, data source quality, foreign keys, and PII. You should not take any tenant-specific steps until you have built the Stitch QA database, completed adding the Stitch QA SQL queries, and reviewed the initial results.

Anomalies in Stitch output

It’s important to look for anomalies in Stitch output.

What to look for

  1. Run the NULL Amperity IDs query to discover if NULL Amperity IDs are present in the customer 360 database. Investigate any records that are returned with NULL Amperity IDs.

    Important

    The number of NULL Amperity IDs that should be present is zero.

  2. Use the Detailed Examples table to review a detailed set of examples that show you what is in your customer 360 database.

    Use this data to identify which types of features lead to scores with the largest effects on overall Stitch results, including how they are associated to various fields that contain profile (PII) data.

    Important

    The Detailed Examples table must be configured to match the customer profile fields, blocklist values, and foreign keys that are in your tenant.

  3. Identify values that can be added to the bad-values blocklist, which prevents values from being part of the Stitch process.

Blocking keys

A blocking key defines a specific combination of characters to be used as a blocking strategy. For example, the first three characters in given-name, the first character in surname, and birthdate represent a blocking key.

Blocking is a process that uses simple rules to divide massive sets of data records into small blocks that are rapidly processed and offer higher probabilities of discovering matching records.

A blocking strategy acts like a filter against a very large data set. Each blocking strategy applies its filter and all records that match are grouped together into a block. Each record that matches a blocking strategy is a blocking key.

What to look for

  1. The blocking size limit for blocking keys is 100. When blocking returns a group of records that exceeds this limit the group is split apart.

    In some cases, large clusters are split even with matching PII because of this limit. For example:

    SELECT
      COUNT(*)
    FROM Unified_Preprocessed_Raw
    WHERE LOWER(email) = 'someone@email.com'
    

    For example, the following query returns 185 rows:

    SELECT COUNT(*)
    FROM Unified_Preprocessed_Raw
    WHERE LOWER(email) = 'derek@amperity.com'
    

    185 rows exceeds the blocking limit of 100. It is likely that this email address should not be assocated with 185 distinct records.

  2. Use similar queries to return values that exceed the blocking limit. Look for values that should not be associated with high numbers of distinct records and that exceed the blocking size limit threshold. Add these values to the bad-values blocklist as necessary.

Customer keys

A customer key is a column in a data table that contains a unique identifier that represents a unique ID for that data as it exists in the customer’s data outside of Amperity.

What to look for

  1. Run the dedupe keys per Amperity ID query to return a rough distribution of dedupe keys to Amperity ID. Customer keys are interesting because they show before and after for entity distribution.

    Update this query to use customer keys, and then run the query. You should not expect obvious “good” or “bad” results, instead focus on clusters that return 2 (or more) dedupe keys.

Foreign keys

A foreign key is a column in a data table that acts as primary key and can be used for deterministic matching of records. A record pair is assigned an exact match score (5.0) when foreign keys contain identical values during pairwise comparison.

The fk-[namespace] semantic tag identifies a field as a foreign key. A foreign key semantic tag must be namespaced. For example: fk-customer, fk-interaction, fk-audience, or fk-brand.

A foreign key semantic tag may be applied to any column in any data source, but should be associated with a field that can also act as a primary key for that data source and is present in other tables.

A foreign key may be used once within a table. A table may have more than one foreign key. For example, if a data source contains customer and audience identifiers, apply fk-customer to the customer identifier and fk-audience to the audience identifier.

What to look for

  1. Look for records where the Amperity ID and a foreign key. are equal, but one (or more) values associated with email, given-name, and surname are not.

  2. Compare foreign keys across data sources.

  3. Return a rough distribution of deduplicated foreign keys, and then look for clusters of records 2 (or more).

  4. Investigate if blocking size limits for foreign keys caused splits to groups of records.

  5. Are foreign keys being prioritized over separation keys correctly?

  6. Look for records that score at 5.0, but do not have foreign keys in common.

Bad foreign key matches

Run the bad foreign key matches query to return records where the Amperity ID and a foreign key are equal, but one (or more) values associated with email, given-name, and surname are not equal.

Tip

Foreign keys are matched deterministically which can lead to incorrect matches or overclustering.

The presence of bad foreign key matches is an indicator of overclustering.

Compare foreign keys

Run the foreign key validation query and compare coverage and distribution results for foreign keys across data sources.

Note

The foreign key validation query is not added by the “Stitch QA” queries template.

Important

The foreign key validation query is a series of small SELECT statements that can help focus on and narrow-down specific use cases for validating foreign keys. Use this query to:

  1. Look for name IDs with multiple Amperity IDs.

  2. Look for name IDs with different Amperity IDs.

  3. Inspect name IDs across tables.

  4. Check for foreign keys with multiple Amperity IDs across tables.

  5. Check for foreign keys with different Amperity IDs across tables.

  6. Inspect foreign keys across tables.

Deduplicated distribution

Run the dedupe keys per Amperity ID query to return a rough distribution of dedupe keys to Amperity ID.

Foreign keys are interesting because Amperity deterministically matches on them in most situations. When there are 2 (or more) dedupe keys related to foreign keys, this is an indicator that records were connected beyond what could have been connected deterministically.

Update the query to use foreign keys, and then run the query. You should not expect obvious “good” or “bad” results, instead focus on clusters that return 2 (or more) dedupe keys. Evidence of records that are connected, but were previously independent, is an indicator of overclustering.

Tip

Use concatenated values, such as “datasource + fk”, to focus the results on a per-source level.

Foreign key limits

Blocking is a process that uses simple rules to divide massive sets of data records into small blocks that are rapidly processed and offer higher probabilities of discovering matching records.

The size limit for foreign keys is 1000. When blocking returns a group of records that exceeds this limit the group is split apart.

In some cases, large clusters are split even with matching PII because of this limit. This most often occurs with data sourced from interaction records, test accounts, booking agents, common addresses, and other values that appear at high rates within data. Use the bad-values blocklist to remove these values, and then rerun Amperity and review the updated Stitch results.

Foreign key prioritization

Are foreign keys being prioritized over separation keys?

When foreign key matching is the priority, Amperity scores record pairs in the following order:

  1. Does the record contain identical foreign key values?

  2. If true, assign score 5.0. Stop.

  3. If false, does the record contain conflicting separation key values?

  4. If true, assign score 0.0. Stop.

  5. If false, use pairwise comparison scoring.

Foreign key matching priority.

Amperity is configured by default to prioritize foreign key matching over separation key unmatching.

Many tenants use separation keys to split records that do not match. In some use cases separation key unmatching must be prioritized over foreign key matching. In this situation, configure Stitch to prioritize separation keys over foreign keys.

Separation key matching priority.

Presence of trivial duplicates

A trivial duplicate is a set of nearly-identical records that share enough matching PII to clearly identify a single unique individual. Trivial duplicates are identified by Stitch early in the identity resolution process. Only one of these records is passed downstream for additional Stitch processing; the other records—the trivial duplicates—are not.

In some cases, more than one nearly-identical record is passed to downstream Stitch processes. These nearly-identical records are referred to as “trivial duplicates”.

For example, a situation where a high number of records are associated to the same foreign key, the same email address, and the same name, but many individual record are associated to a unique physical address.

In this case, the address field is consistently unreliable. A semantic exclusion configured for address, city, state, and postal would cause these unique address fields to be ignored when collapsing trivial duplicates.

This approach can help records survive blocking by ensuring groups of records are more likely to be under the blocking size limit. The address, city, state, and postal fields are not ignored during pairwise comparison scoring.

What to look for

  1. Use a combination of the Unified Coalesced and Unified Preprocessed Raw tables to review and compare the data that is associated with these records to help understand why groups of records were matched or why certain records were assigned to a cluster.

    Tip

    The Unified Coalesced table contains two useful columns that help identify trivial duplicates: rep_ds and rep_pk.

    • rep_pk is an identifier that represents the first grouping of records done by Stitch. This grouping is based on identical semantic patterns.

    • rep_ds is the datasource that is associated with the rep_pk column.

    The combination of rep_ds and rep_pk represent nearly-identical records that were grouped together by Stitch early in the identity resolution process. (These nearly-identical records are also referred to as a “trivial duplicate”.) All of these nearly-identical records are treated as a single record by downstream Stitch processes.

    Note

    If foreign keys are linked together by a trivial duplicate they will appear in the Unified Preprocessed Raw table as a comma-separated list.

  2. Look for records that score at 5.0, but do not have foreign keys in common. One (or both) records likely have a trivial duplicate with another record with a different foreign key, causing there to be a foreign key match between two groups of trivial duplicates.

  3. If a trivial duplicate is identified, configure Stitch to use a semantic exclusion.

Warning

Semantic exclusions should be applied very carefully. Use a sandbox to configure and apply a semantic exclusion, and then carefully review and validate that all downstream processes are not adversely affected by the change prior to applying a semantic change to a production environment.

Overclustering

An overcluster, or a false positive, occurs when distinct records are incorrectly added to a cluster of records. Each overcluster affects the precision of identity resolution and should be investigated to understand why it occurred.

Important

The presence of overclustering has the greatest effect on Stitch quality!

For example, two records with the same Amperity ID look as if they should be two records with different Amperity IDs because:

What to look for

  1. Run the bad foreign key matches query to return records where the Amperity ID and a foreign key are equal, but one (or more) values associated with email, given-name, and surname are not equal.

  2. Run the many semantic values query to explore cases where a single Amperity ID is associated with many different semantic values or associated with combinations of semantic values that are generally good identifiers.

  3. Run the cluster size distribution query and look for a long-tail of supersized clusters.

  4. Run the dedupe keys per Amperity ID query to return a rough distribution of dedupe keys to Amperity ID.

    Foreign keys are interesting because Amperity deterministically matches on them in most situations. Customer keys are interesting because they show before and after for entity distribution.

    You should not expect obvious “good” or “bad” results, instead focus on clusters that return 2 (or more) dedupe keys.

  5. Run the common values query to return common values across semantics in a single view that shows cardinality for semantic values and the context for values that occur across records and clusters.

  6. A hard conflict occurs when profile values are different enough to ensure that two records should not be grouped together. A hard conflict is most often associated with a birthdate or a generational suffix, but can be associated with other combinations of profile data.

    If you identify values with a hard conflict, use Stitch labels.

  7. Investigate the record pairs for the source of the 5.0 score. Are there any transitive connections between the two records caused by the foreign key?

    A transitive connection exists between individual records when any two records share a strong match to an intermediate record, but do not have a strong match to each other. For example: record 1 matches record 2, record 3 matches record 2, neither records 1 or 3 match to each other, but they have a transitive connection because both match record 2.

  8. Are trivial duplicates present?

    A trivial duplicate is a set of nearly-identical records that share enough matching PII to clearly identify a single unique individual. Trivial duplicates are identified by Stitch early in the identity resolution process. Only one of these records is passed downstream for additional Stitch processing; the other records—the trivial duplicates—are not.

  9. Use a combination of the Unified Coalesced and Unified Preprocessed Raw tables to review and compare the values associated overclustering.

Problematic nicknames

Nicknames can affect the results of blocking and clustering when they prevent obvious matches from being matched. For example, Mike is a nickname for Michael. Blocking strategies that include given-name use the first three characters to match records.

In this example, “Mik” and “Mic” do not match, but depending the presence of foreign keys and other profile values (like email, surname, phone, and address), these records may still be scored together.

Amperity pre-loads a set of common nicknames to your tenant in the form of a static CSV file. This file has thousands of nicknames, including all of the most common nicknames, along with many variations. You can upload your own static CSV files to extend the list of common nicknames to add and/or remove nicknames as needed for your tenant.

Record pairs and scores

The quality of record pairs and their associated pairwise comparison scores should be investigated, starting with low-scoring record pairs in each cluster. At the start of the Stitch QA process it is not uncommon for low-scoring record pairs in a cluster to fall below threshold for pairwise comparison scoring.

Each cluster of records will contain high-scoring record pairs with transitive connections to other high-scoring record pairs. A low-scoring record pair often does not show a transitive connection to a high-scoring record pair.

Each cluster with low-scoring record pairs should be investigated to confirm if those pairs were accurately clustered and to determine if transitive connections to other high-scoring record pairs do exist.

Note

As the Stitch QA process moves forward you should expect to see a notable decline over time for the number of clusters in which low-scoring record pairs fall below threshold for pairwise comparison scoring.

What to look for

  1. Run the weakest match query and examine low-scoring record pairs to confirm if those records were accurately clustered.

  2. Run the cluster scores query to investigate pairwise comparison scores for individual clusters or for a set of clusters.

  3. Nicknames (unusual or less common) can be a reason for a low-scoring record pair.

  4. Consider tuning Stitch to raise or lower the pairwise comparison scoring threshold. The default threshold is 3.0.

    Important

    The default threshold is recommended, regardless of data quality or volume of data or quality of your current Stitch results. A lower threshold leads to more matches and fuzzier matched pairs, whereas a higher threshold will lead to fewer matches and more precise matched pairs. The default threshold is the right balance. Look for ways to resolve low-scoring record pairs without changing the pairwise comparison scoring threshold.

Review cluster details

You should periodically review at a high level cluster sizes and source cluster distributions across all clusters, along with a count of clusters comprised of records that span data sources.

What to look for

  1. Run the cluster size distribution query to identify if distinct entities are clustered together, if clusters are supersized, or for other indicators of overclustering.

  2. Run the source cluster distribution query to return a breakdown of record-to-cluster size distribution by datasource.

  3. Run the connections across sources query to return a count of clusters that are comprised of records across various combinations of datasources.

  4. Run the combined statistics query to return an overview of Stitch statistics that contains the number of records, accounts, Amperity IDs, singletons, clusters with more than one record, overall deduplication rate, and the deduplication rate across all data sources.

  5. Run the separated statistics query to return an overview of Stitch statistics, separated by dedupe keys and tables.

Semantic tags

Be sure to apply semantic tags consistently across data sources, in particular for profile (PII) and keys (customer, blocking, foreign, and separation). Semantic tags that are not applied correctly may be ignored by Stitch, be recognized as custom semantic tags, or may provide unexpected results.

Use the Customer 360 page to verify that rows with profile semantic tags and foreign keys are selected to be available to the Queries page. Has the list of profile and/or foreign key semantics changed since the last time you have performed Stitch QA? This will occur when a data source is added that requires a new foreign key or in a situation where one of the less frequently used profile semantic tags is applied to a new data source.

Open the table in the Database Explorer and verify that all rows in the table that are associated with a semantic tag or a foreign key have a checkmark in the left column. Rows without a checkmark will not make the associated FIELD available to the Queries page. If rows do not have a checkmark, edit the table and apply the checkmark, save the table, activate the Stitch QA database, and then run the database to refresh the table.

Tip

Be precise with email addresses, phone numbers, and physical addresses that require ordinal or namespace semantic tags. If these tags are inconsistent or inaccurate, the values may be ignored by Stitch.

Use the Unified Preprocessed Raw table to verify if data was used by the Stitch process. For example, if two phone numbers were tagged, one with phone and one with a ordinal or namespace, but did not appear in the Unified Preprocessed Raw table, that is an indicator that the orginal or namespace semantic tagging was done improperly.

For the correct way to apply ordinal or namespace semantic tags, see:

What to look for

  1. The output of semantic tags is discoverable from the Unified Preprocessed Raw and Unified Coalesced tables.

    Look for inconsistencies in the results of semantic tagging and, if discovered, edit them to apply consistent tagging patterns, rerun Stitch, and then re-review the results.

  2. Run the split clusters query to investigate patterns in split clusters that lack transitive connections for names, email addresses, and/or post office boxes, and for semantic tagging issues that may arise when using multiple, ordinal, or namespaced semantics.

  3. Run the many Amperity IDs query to look for underclustering. The results of this query help identify incorrectly split entities, explain the cardinality of semantic values, and identify values to add to the bad-values blocklist.

  4. Run the many semantic values query to explore cases where a single Amperity ID is associated with many different semantic values or associated with combinations of semantic values that are generally good identifiers.

  5. Run the unmatched semantic values query and look for situations where unique semantic values are equal, but Amperity IDs are not.

    This query has a variety of use cases focused on individual semantic values, such as running against only email, phone, or address semantic values, combinations of semantic values, frequency limits, and so on.

Supersized clusters

A supersized cluster is a cluster of records that is discovered during the Stitch process that has more than 64 matching records. A supersized cluster does not typically represent a unique individual and is not worthy of further analysis.

A supersized cluster is created when multiple transitive connections are present. For example, a couple named Mary Johnson and Jeffrey Johnson with the following records:

  1. Mary Johnson, maryjohnson @gmail.com, 50 1st Avenue, New York, NY, with 50 connected records.

  2. Jeffrey Johnson, jeffjohnson @gmail.com, 50 1st Avenue, New York, NY, with 25 connected records.

  3. Mary Johnson, mjohnson50 @gmail.com, 50 1st Avenue, New York, NY, with 17 connected records.

  4. Jeffrey Johnson, mjohnson50 @gmail.com, 50 1st Avenue, New York, NY, with 8 connected records.

These records block together in the following ways:

  • Records 1 and 3 block together on name and address.

  • Records 2 and 4 block together on name and address.

  • Records 3 and 4 block together on email.

All four groups of records transitively connect into a single connected cluster with a size of 100.

What to look for

  1. Run the cluster size distribution query and look for a long-tail of supersized clusters.

  2. Look at the is_supersized and supersized_id columns in the Unified Coalesced table, and then compare with the supersized_id column in the Unified Preprocessed Raw table.

  3. The presence of supersized clusters is a strong indicator of overclustering.

  4. Consider tuning Stitch to raise or lower the threshold at which a supersized cluster is defined. The default threshold is “64”.

Underclustering

An undercluster, or a false negative, occurs when distinct records are incorrectly split from a cluster of records. Each undercluster affects the precision of identity resolution and should be investigated to understand why it occurred.

What to look for

  1. Run the unmatched semantic values query and look for situations where unique semantic values are equal, but Amperity IDs are not.

    This query has a variety of use cases focused on individual semantic values, such as running against only email, phone, or address semantic values, combinations of semantic values, frequency limits, and so on.

  2. Run the common values query to return common values across semantics in a single view that shows cardinality for semantic values and the context for values that occur across records and clusters.

  3. Run the many Amperity IDs query to look for underclustering. The results of this query help identify incorrectly split entities, explain the cardinality of semantic values, and identify values to add to the bad-values blocklist.

  4. Look for two records that have been assigned different Amperity IDs, but should have been assigned the same Amperity ID. Do they share a foreign key? Do they have profile (PII) values in common?

  5. Determine if your foreign key matching vs. separation key unmatching strategy needs to be modified.

  6. Are larger clusters part of a supersized cluster?

  7. Look at the first names. Are they similar to the human eye, but potentially different based on edit distance? Should they be added to the list of common nicknames?

  8. Open the Unified Coalesced and Unified Preprocessed Raw tables and compare values in the component_id column. Did these records block together?

    Tip

    Records with different component_id values may show as having been blocked together. This can occur when a connecting record pair was removed because it scored below the pairwise comparison threshold.

  9. Were blocking size limits exceeded? Does your tenant have the right set of blocking strategies configured?

  10. Are trivial duplicates present?

  11. Do business email addresses need to be allowed? Update the Stitch preprocessing profiles configuration setting to allow-business-email.

Values to blocklist

A bad-values blocklist contains known values that appear frequently in data and should be excluded from the Stitch process.

What to look for

  1. Verify that the Stitch BlocklistValues table is configured for use in the Stitch QA database.

  2. Review the values in the bad-values blocklist, including global blocklist values.

  3. Values to blocklist are often identified as an outcome of using Stitch QA to look for looking for general anomalies, inconsistent foreign keys, inconsistent blocking keys, and underclustering.

    Tip

    If your starting point in Stitch QA is the bad-values blocklist, run the following Stitch QA queries:

    1. Run the common values query to return common values across semantics in a single view that shows cardinality for semantic values and the context for values that occur across records and clusters.

    2. Run the many Amperity IDs query to identify values to add to the bad-values blocklist.

    3. Run the unmatched semantic values query and look for situations where unique semantic values are equal, but Amperity IDs are not.

      The unmatched semantic values query has a variety of use cases focused on individual semantic values, such as running against only email, phone, or address semantic values, combinations of semantic values, frequency limits, and so on.

  4. Use a combination of the Unified Coalesced and Unified Preprocessed Raw tables to review and compare the values associated with blv_ columns in the Unified Coalesced table. Are these values in the Stitch BlocklistValues table?

    Tip

    The Unified Preprocessed Raw table is the best source for verifying if profile (PII) data was removed by the bad-values blocklist.

  5. Add values to the bad-values blocklist that should be ignored by Stitch.

    Note

    If you identify values that should be processed by Stitch, but not made available to the customer 360 database, use Stitch labels.