Configure Stitch¶
Stitch uses patented algorithms to evaluate massive volumes of data to discover the hidden connections in your customer records that identify unique individuals. Stitch outputs a unified collection of data that assigns a unique identifier to each unique individual that is discovered within your customer records.
Stitch configuration is defined from a list of settings in the Stitch settings dialog box, which is available from the Stitch page in Amperity. Stitch settings can be accessed by users who are assigned to the DataGrid Administrator policy.
The default settings are recommended for most situations.
Important
Changes to settings are not required, but may help improve data quality and clustering results. Validate changes in a sandbox before promoting them to your production tenant.
Stitched tables¶
Stitch only runs against selected domain tables. A domain table is made available to Stitch by the Make available to Stitch configuration setting in the Feed Editor.
A domain table that is made available to Stitch must also be selected from the list of domain tables in the Stitch settings dialog box.
Each selected table is processed and compared for identity resolution, after which Amperity IDs are assigned to each of your unique customers that are discovered across all domain tables that are included in the Stitch run.
To add tables to the Stitch run
From the Stitch tab, click Settings. This opens the Stitch settings dialog box. On the Stitched tables tab, select each of the tables to include in Stitch results, and then click Save.
After you have selected the list of tables to include in Stitch results, return to the Stitch page, and then click Run.
Note
Only tables with the Make available to Stitch setting enabled in the Feed Editor are available for selection from the Stitched tables tab in the Stitch settings dialog box.
General settings¶
The General settings tab contains a series of configuration settings that may be modified based on analysis of the data in your tenant to help improve data quality and clustering results.
General settings are divided into the following categories:
Cluster quality¶
The following settings determine how Stitch returns clusters of records:
Blocking strategy¶
Blocking is a non-trivial step for record linking. The default blocking strategy provides a reasonable recall rate for most use cases; work with your Amperity representative to identify the best approaches for tuning your tenant’s blocking strategy.
The combination of blocking keys is what creates the ideal recall rate without compromising the performance of Amperity.
An overly generous blocking strategy may result in a high recall rate (too many pairs being evaluated) along with negative system performance.
An overly conservative blocking strategy may result in a low recall rate (too few pairs being evaluated).
Individual blocking keys may be conservative or generous.
To configure the blocking strategies that are used for your tenant, open the Stitch page, and then click Stitch settings. In the list of settings, under Cluster quality, select one (or more) blocking strategies from the Blocking Strategies drop-down:
The default blocking strategies are: “dnf1”, “dnf3”, “dnf4”, “dnf5”, “dnf6”, “dnf7”, “dnf8”, “email”, and “fk”. Click the “x” next to the name of any selected blocking strategy to remove it.
Note
The order in which blocking strategies are listed does not matter. For example:
:dnf1 :dnf3 :dnf4 :dnf5
and:
:dnf1 :dnf4 :dnf5 :dnf3
will be processed in the same way and will return the same results.
In nearly all cases for all customers, the default blocking strategy should provide a reasonable recall rate. Each individual blocking strategy looks at various combinations of PII data:
Strategy |
Key |
---|---|
:address |
Non-default. This blocking strategy groups values associated with the full address using the address and address2 semantics. |
:company |
Non-default. This blocking strategy groups values associated with the company semantic. |
:dnf1 |
Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first character in surname, and birthdate. |
:dnf2 |
Non-default, use carefully. This blocking strategy groups values associated with the following semantics: the full given-name and email. |
:dnf3 |
Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first three characters in surname, and postal. |
:dnf4 |
Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name and address. |
:dnf5 |
Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name and phone. |
:dnf6 |
Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first three characters in surname, and the local part of an email address in email. |
:dnf7 |
Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name and company. |
:dnf8 |
Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first three characters in surname, and PO box values that are derived from address. |
Default. This blocking strategy groups values associated with the following semantics: the full email address in email. |
|
:fk |
Default. This blocking strategy groups values associated with foreign keys. |
:login-partial |
Non-default. This blocking strategy groups values associated with cleaned email addresses derived from email. This is used for low-threshold email address matching. |
:login-trimmed |
Non-default. This blocking strategy groups values associated with the first five characters of an email addresses derived from email. This is used for low-threshold email address matching. |
:name |
Non-default, use carefully. This blocking strategy groups values associated with the following semantics: given-name and surname. The order of given-name and surname is sorted lexicographically. The blocking key for JOHN SMITH and SMITH JOHN is JOHN:SMITH. |
:phone |
Non-default. This blocking strategy groups values associated with the phone semantic. |
Case-sensitive foreign keys¶
Values associated with foreign keys are case-insensitive by default. You may configure individual foreign keys to be case-sensitive by adding the name of that foreign key to this list.
To specify which foreign keys are case-sensitive, open the Stitch page, and then click Stitch settings. In the list of settings, under Cluster quality, select one (or more) case-sensitive foreign keys from the Case-sensitive foreign keys drop-down:
The list of available case-sensitive foreign keys will match the list of foreign keys that have been defined for any domain table in the Sources page.
Matching strategy¶
Amperity prioritizes separation key unmatching over foreign key matching. You may configure Amperity to prioritize foreign key matching over separation key unmatching.
The matching strategy classifier tells Stitch how to apply the results of the blocking strategies, including which groups to analyze and the order in which that analysis should take place, when foreign keys and separation keys are present.
The default behavior prioritizes separation keys over foreign keys.
To change the matching strategy classifier
To change the matching strategy classifier, open the Stitch page, and then click Stitch settings. In the list of settings, under Cluster quality, select a matching threshold from the Matching strategy classifier drop-down:
Warning
This value should be changed only after careful consideration. If changed, be sure to validate these results carefully to ensure that any changes to pairwise comparison scoring had the desired outcome.
Foreign key matching¶
When foreign key matching is the priority, Amperity scores record pairs in the following order:
Does the record contain identical foreign key values?
If true, assign score 5.0. Stop.
If false, does the record contain conflicting separation key values?
If true, assign score 0.0. Stop.
If false, use pairwise comparison scoring.
Separation key unmatching¶
A separation key (sk) is used for deterministic unmatching of records.
By default, Amperity derives separation keys for sk-given-name and sk-generational-suffix. You may configure Amperity to prioritize separation keys over foreign keys.
When separation key unmatching is the priority, Amperity scores record pairs in the following order:
Does the record contain conflicting separation key values?
If true, assign score 0.0. Stop.
If false, does the record contain identical foreign key values?
If true, assign score 5.0. Stop.
If false, use pairwise comparison scoring.
Matching thresholds¶
Matching record pairs are included when they score better than this value. A lower value leads to more matches and weaker pairs; a higher value leads to fewer matches and more precise pairs. Five values may be chosen: 1.0 (weak), 2.0 (moderate), 3.0 (high, default and recommended value), 4.0 (excellent), and 5.0 (exact).
Based on the precision and recall rate that was observed in the initial Stitch results, the matching threshold can be tuned to achieve better results by updating the weakest match category to can be included in Stitch results.
In general, a lower threshold will lead to more matches and fuzzier matched pairs, whereas a higher threshold will lead to fewer matches and more precise matched pairs. For the default ordinal classifier, five different thresholds can be chosen, and they are defined as follows:
Threshold |
Weakest Match |
---|---|
1.0 |
Weak |
2.0 |
Moderate |
3.0 |
High |
4.0 |
Excellent |
5.0 |
Exact |
For each threshold, all the matches that are equal to or stronger than the weakest matching type will be clustered together. For example, if the threshold 3.0 is chosen,
To change the matching threshold, open the Stitch page, and then click Stitch settings. In the list of settings, under Cluster quality, select a matching threshold from the Thresholds drop-down:
Tip
When the threshold is set to “High”, any matching pair that belongs to the High, Excellent, and Exact match types will be stitched into a single cluster, assuming all of those pairs pass successfully through the blocking phase.
Performance¶
The following settings are useful when troubleshooting performance issues with Stitch:
Allowed empty tables¶
Stitch may be configured to allow tables to be empty. For example, some CCPA and GDPR workflows run daily, but do not always contain data, such as when zero customers make a data subject access request (DSAR) request.
To allow empty tables, open the Stitch page, and then click Stitch settings. In the list of settings, under Performance, from the Allow empty tables drop-down, select one (or more) tables:
The list of tables will match the list of domain tables that have been defined in the Sources page.
Days of recorded history¶
Amperity stores 365 days of history for unified changes tables, which contain all changes to clusters and primary keys between the current and previous Stitch runs. Changing this setting will not recreate table histories that have already been dropped.
You can configure the number of days that are stores for the Unified Changes PKs and Unified Changes Clusters tables. The default is 365 days.
To update the days of recorded history, open the Stitch page, and then click Stitch settings. In the list of settings, under Performance, update the value of the Recorded days field:
Note
Changes to this setting will not recreate history that has already been dropped.
Stitch processing¶
The following settings configure how Stitch processes data:
Force Stitch to run¶
Stitch will run when domain tables contain updates to data that is used by Stitch for identity resolution. Stitch will not run when updates are not present.
To force Stitch to run, open the Stitch page, and then click Stitch settings. In the list of settings, under Stitch processing, enable the Force Stitch to run setting:
Note
When this setting is enabled, the one-to-one Stitch setting is forced to be disabled.
Ignore jitter alerts¶
Jitter tracks changes to Amperity IDs across Stitch runs. You may configure Stitch to ignore jitter, such as during the initial configuration phase for your tenant or when running a tenant in training and/or demonstration use cases.
Note
The rates at which jitter may occur is when large numbers of customer records are added to or removed from your tenant or when two percent (or greater) of all customer records are assigned an updated Amperity ID.
For example, if you are using a tenant for training purposes or if you are loading a very large and complex set of data to an established tenant. In these situations a high jitter rate is common, but not unexpected.
To ignore jitter alerts, open the Stitch page, and then click Stitch settings. In the list of settings, under Stitch processing, enable the Ignore jitter alerts setting:
This setting as false in your production tenant as often as possible.
Skip unified changes¶
Unified changes tables contain all of the changes to clusters and primary keys between the current and previous Stitch runs. These tables are resource intensive and may be skipped.
The Unified Changes Clusters and Unified Changes PKs tables may be resource intensive because they contain all of the changes to clusters and primary keys between the current and previous Stitch run.
The Unified Changes Clusters table contains a history of changes to cluster graphs, relative to the previous Stitch run.
The Unified Changes PKS table contains a history of changes to primary keys, relative to the previous Stitch run.
The default configuration for Stitch always builds these tables, but you can configure Stitch to skip building scores.
To skip building the unified changes tables, open the Stitch page, and then click Stitch settings. In the list of settings, under Stitch processing, enable the Skip unified changes output setting:
Skip unified scores¶
The unified scores table contains all of the scoring between record pairs for the previous Stitch run. This table is resource intensive and may be skipped.
The Unified Scores table records all of the pairwise comparison scores and match categories for all groups of records, and then for each group of records all of the pairwise scores that are present between records within that group.
The Unified Scores tables may be resource intensive. The default configuration for Stitch always builds this table, but you can configure Stitch to skip building it.
To skip building unified scores, open the Stitch page, and then click Stitch settings. In the list of settings, under Stitch processing, enable the Skip unified scores output setting:
Stable IDs¶
An Amperity ID is a patented unique identifier that is assigned to clusters of customer records. A single Amperity ID represents a single individual. Unlike other systems, the Amperity ID is reassessed every day for the most comprehensive view of your customers.
As new data is input to Amperity, the Stitch process identifies when new or changed data applies to existing clusters of customer records, and then updates those records, maintains the cluster, and retains a stable Amperity ID assignment. A new Amperity ID is only created when new individuals are identified.
Stable ID assignment can be a resource-intensive process, in particular when:
Adding data sources that contain large numbers of rows (100+ million rows) of customer records.
Updating existing data sources with large numbers of rows on a periodic (monthly, quarterly, etc.) basis.
Data contains a very large number of duplicate values, such as 400k+ instances of an email address that is associated to a common business process.
You can configure the stable ID assignment process in the following ways:
Disable stable IDs¶
You may disable stable ID assignment to improve performance during the initial configuration phase for your tenant. Be sure to re-enable stable ID assignment before you start the Stitch QA process.
To disable stable IDs, open the Stitch page, and then click Stitch settings. In the list of settings, under Stitch processing, enable the Disable stable IDs setting:
Important
Be sure to disable this setting before you start to analyze and validate production-quality Stitch results or when performing any level of Stitch QA.
Increase partitions¶
When large differences are present between clusters of records for the current and previous Stitch runs you can increase the number of partitions that are used by Stitch during stable ID assignment. This situation can occur when data sources provide periodic updates, such as monthly, quarterly, etc., that contain a very large number of rows.
By default, stable ID assignment is based on the edges that exist between current and previous clusters of customer records, weighted by the number of shared primary keys. The shared primary keys are sorted in descending order, after which ties are broken by sorting the cluster IDs in ascending order. Edges are removed when higher-ranked edges are associated to identical clusters of customer records. Edges that survive this ranking are then used to map current cluster IDs to previous cluster IDs. Changes to stable ID assignment are captured in the Unified Changes Clusters and Unified Changes PKs tables.
In some cases, the differences between the current and previous clusters of customer records is very large, requiring access to a large amount of memory to complete the stable ID assignment process. In this type of situation Stitch processes may take a very long time, or even appear to be stuck; in some cases the Stitch job has run out of memory and will need to be rerun. In these situations, do the following:
Look for unusual values, such as a large set of identical email address, that appear in the Unified Coalesced table. This can sometimes be the cause of slow stable ID assignment. Mitigate the presence of these unusual values, and then run Stitch again.
Increase the number of partitions that are available to Stitch during the stable ID assignment process. You can increase the value of stable-id-partition-count to a value between 2-10 to improve the performance of Stitch during stable ID assigment.
This setting should be used temporarily, but for some tenants it may need to be left at a non-default value. Changing partitions may only be done using an advanced configuration setting:
:amperity.stitch.settings/stable-id-partition-count 6
One-to-one Stitch¶
You may configure Stitch to assign Amperity IDs to customers that are identified by a unique (and consistent) customer key. (This configuration option is sometimes referred to as “deterministic Stitch”. You should not expect your Amperity IDs to be persistent when this mode is enabled.)
Important
If Stitch has already been run in standard mode, and then if one-to-one mode is enabled, there will be 100% jitter. Amperity IDs are not expected to be the same between standard and one-to-one Stitch runs.
Use the following steps to configure your tenant for one-to-one Stitch:
Ensure that each table that is made available to Stitch applies the ck semantic tag to the field that contains the existing customer ID.
Tip
You may apply the ck semantic tag from a feed or from a custom domain table.
Apply all other semantic tags – customer profile, foreign key, primary key, orders and items – to the correct fields in all of your data sources. These tags will have no effect when running one-to-one Stitch, but are required for customer profiles, transactions, segment insights, and predictive modeling.
To configure Amperity for one-to-one Stitch, open the Stitch page, and then click Stitch settings. In the list of settings, under Bypass Stitch, move the slider for the Run 1:1 Stitch setting to the right:
Run Stitch.
When the run is complete, each unique customer ID will be associated with an Amperity ID.
Outcome of one-to-one Stitch
The following table describes the changes you will see in your tenant after it is configured for one-to-one Stitch.
Tab |
Changes |
---|---|
Stitch |
The overview page will show a 0.0% deduplication rate. The Amperity ID will align to the total source IDs provided by the ck. The Data Explorer disables the tabs for Cluster Graph and Pairwise Comparison. These tabs are not available when one-to-one Stitch mode is configured. Note The Amperity ID that is generated in one-to-one Stitch mode is based on customer keys (and not on stable clusters of customer records). |
Customer 360 |
The Unified Scores table is not generated. The following fields are removed from the Unified Coalesced table: component_id, is_supersized, rep_ds, rep_pk, and supersized_id. Fields related to the bad-values blocklist are not available, including has_blv, blv_address, blv_email, blv_given_name, blv_phone, and blv_surname. All standard tables will contain a ck field. The Stitch QA database template is not needed. |
Queries |
The Stitch QA queries template is not needed. |
Advanced settings¶
Stitch provides a series of advanced settings.
The following settings are available for advanced configuration of Stitch:
Note
The editor for Stitch report settings uses Extensible Data Notation (EDN) formatting:
{:amperity.stitch.settings/badvalues-config [],
:amperity.stitch.settings/parallelism 2,
:amperity.stitch.settings/pre-processing-profiles
#{:normalize-gender :prioritize-src-gender},
:amperity.stitch.settings/soft-trivial-dupe-semantic-exclusions #{},
:amperity.stitch.settings/soft-trivial-dupe-size-threshold 10,
:amperity.stitch.settings/supersized-cluster-min-size 500,
:amperity.stitch.settings/supersized-partition-max-depth 4}
Advanced configuration settings are described in more detail below. You may also override general configuration settings.
Automatic bad value detection¶
Amperity is configured to automatically apply blocklists to email addresses, phone numbers, and physical addresses when a bad value is discovered at a frequency that exceeds the defined threshold.
The default configuration is:
:amperity.stitch.settings/badvalues-config [
{:threshold 20, :proxy "given-name", :semantic "email"}
{:threshold 20, :proxy "surname", :semantic "email"}
{:threshold 20, :proxy "given-name", :semantic "phone"}
{:threshold 40, :proxy "given-name", :semantic "address"}]
How the default configuration works:
An email address is added to the bad-values blocklist when the same email addresses is associated with more than 20 distinct given names.
A phone number is added to the bad-values blocklist when the same phone number is associated with more than 20 distinct given names.
A physical addresses is added to the bad-values blocklist when the same physical addresses is associated with more than 40 distinct given names.
The threshold values can be increased or decreased as needed for your tenant.
You can add blocklist items using the following syntax:
{:threshold 00, :proxy "semantic-name", :semantic "semantic-name"}
Disable automatic bad value detection¶
You can disable automatic bad-value blocklists by editing the configuration to be empty square brackets:
:amperity.stitch.settings/badvalues-config []
Clustering algorithm¶
Warning
Use :hierarchical clustering unless instructed differently.
The advanced configuration setting for the clustering algorithm is:
:stitch/clustering-algorithm :hierarchical
The default value for the Stitch clustering algorithm is :hierarchical, which applies hierarchical clustering. This value should not be changed without careful consideration. Other configuration values are: :nil, which uses connected components directly.
Pre-processing profiles¶
Stitch can be configured to use non-default preprocessing profiles for the following use cases:
Australian phone numbers
Business email addresses
Clean foreign keys
Multiple preprocessing profiles
Normalize gender
Important
Use a non-default preprocessing profile in specific situations only.
The advanced configuration setting for preprocessing profiles is:
:amperity.stitch.settings/pre-processing-profiles #{ default }
Australian phone numbers¶
To use Australian phone numbers, use:
:amperity.stitch.settings/pre-processing-profiles #{ :australian-customer }
Business email addresses¶
To allow business email addresses, use:
:amperity.stitch.settings/pre-processing-profiles #{ :allow-business-email }
Clean foreign keys¶
To clean foreign keys (trim whitespace and update them to uppercase), use:
:amperity.stitch.settings/pre-processing-profiles #{ :clean-fk }
Multiple preprocessing profiles¶
To apply more than one preprocessing profile, use:
:amperity.stitch.settings/pre-processing-profiles #{ :allow-business-email :clean-fk }
Email addresses¶
Many email addresses are not useful for identity resolution. Some of them are generic, such as info@some-domain.com, and are often associated with a place of business and are never associated with a unique individual. Other email addresses are bogus, having been entered as a requirement for providing a genuine email address, but are otherwise fake, such as 123@some-domain.com.
The following values associated with the email semantic are ignored by Stitch when performing identity resolution:
@NOEMAIL.COM
@NOMAIL.COM
0000000000
123@
1234@
99@
ABC@
ABC123@
ADMIN@
BOOKING@
CLIENT@
CLIENTS@
CONFIRMATION@
CONFIRMATIONS@
CONTACT@
CUSTOMERSERVICE
CUSTOMERSERVICE@
CUSTOMERSERVICES
CUSTOMERSERVICES@
DECLINE@
DECLINED@
DENIED@
EMAIL@
@EMAIL.TST
EXAMPLE@
FAKENAME@
GUEST@
GUESTS@
HELP@
HELPS@
HOTELHELP@
HOTELPARTNER@
HOTELPARTNERS@
INFO@
JUNK@
MAIL@
ME@
N@A
NAME@
NO@
NOEMAIL@
NOMAIL@
NONE@
NONENONE@
NOREPLY@
NOTHANKS@
NOTHANKYOU@
ONLINERESERVATION
ONLINERESERVATION@
ONLINERESERVATIONS
ONLINERESERVATIONS@
OPERATION@
OPERATIONS@
QUERIES@
QUERY@
REFUSED@
RES@
RESERVAS
RESERVATION@
RESERVATIONS@
ROOMRESERVATION@
ROOMRESERVATIONS@
SAMPLE@
SAMPLES@
SERVICE@
SHOP@
TEST@
TESTING@
TESTEMAIL@
TRAVEL@
TRAVELS
VENDOR@
VENDORS@
XXX@
The values in bold are always ignored.
Stitch may be configured to allow certain generic email addresses to be available to Stitch as part of identity resolution when the pre-processing-profile configuration setting is set to:
pre-processing-profile :allow-business-email
When this setting is updated, only the following email address patterns are ignored by Stitch:
@NOEMAIL.COM
@NOMAIL.COM
123@
1234@
99@
ABC@
ABC123@
DECLINE@
DECLINED@
DENIED@
FAKENAME@
JUNK@
NO@
NOEMAIL@
NOMAIL@
NONE@
NONENONE@
NOREPLY@
NOTHANKS@
NOTHANKYOU@
REFUSED@
XXX@
Use a bad-values blocklist to configure Amperity to continue ignoring any of the email address patterns that were removed from the default list of ignored email patterns.
Normalize gender¶
To configure gender processing preferences, use:
:amperity.stitch.settings/pre-processing-profiles #{ :normalize-gender }
:amperity.stitch.settings/pre-processing-profiles #{ :skip-derive-gender }
:amperity.stitch.settings/pre-processing-profiles #{ :prioritize-src-gender }
or:
:amperity.stitch.settings/pre-processing-profiles #{ :normalize-gender }
:amperity.stitch.settings/pre-processing-profiles #{ :skip-derive-gender }
:amperity.stitch.settings/pre-processing-profiles #{ :prioritize-derived-gender }
Supersized clusters¶
A supersized cluster is a cluster of records that is discovered during the Stitch process that has more than 64 matching records. A supersized cluster does not typically represent a unique individual and is not worthy of further analysis. You may configure the threshold at which Stitch will discover a supersized cluster.
A supersized cluster is created when multiple transitive connections are present across clusters of records. For example, individuals named Mary Johnson and Jeffrey Johnson with the following records:
Mary Johnson, maryjohnson @gmail.com, 50 1st Avenue, New York, NY, with 50 connected records.
Jeffrey Johnson, jeffjohnson @gmail.com, 50 1st Avenue, New York, NY, with 25 connected records.
Mary Johnson, mjohnson50 @gmail.com, 50 1st Avenue, New York, NY, with 17 connected records.
Jeffrey Johnson, mjohnson50 @gmail.com, 50 1st Avenue, New York, NY, with 8 connected records.
These records block together in the following ways:
Records 1 and 3 block together on name and address.
Records 2 and 4 block together on name and address.
Records 3 and 4 block together on email.
All four groups of records transitively connect into a single connected cluster with a size of 100.
Amperity defines a supersized cluster as any cluster with 64 (or more) connections. To change this threshold to a higher or lower value, update the following advanced configuration setting:
:amperity.stitch.settings/supersized-cluster-min-size 64
Trivial duplicates¶
A trivial duplicate is a set of nearly-identical records that share enough matching PII to clearly identify a single unique individual. Trivial duplicates are identified by Stitch early in the identity resolution process. Only one of these records is passed downstream for additional Stitch processing; the other records – the trivial duplicates – are not.
The following table represents a set of records that contain enough matching PII to associate all of these records to a single unique individual:
------- ----------- ---------- ----------------- ------ ------------------------
pk FirstName LastName Address ... Email
------- ----------- ---------- ----------------- ------ ------------------------
r-1 Justin Currie 123 West Elm St ... ack192390190@gmail.com
r-4 Justin Currie 123 West Elm St ... vnbb1109fe12@gmail.com
r-7 Justin Currie 123 West Elm St ... ack192390190@gmail.com
r-12 Justin Currie 123 West Elm St ... ack192390190@gmail.com
r-78 Justin Currie 123 West Elm St ... doi210898r09@appco.com
r-87 Justin Currie 123 West Elm St ... vnbb1109fe12@gmail.com
r-204 Justin Currie 123 West Elm St ... vnbb1109fe12@gmail.com
r-377 Justin Currie 123 West Elm St ... vnbb1109fe12@gmail.com
r-589 Justin Currie 123 West Elm St ... ffjo19290190@gmail.com
r-829 Justin Currie 123 West Elm St ... vnbb11095412@gmail.com
r-911 Justin Currie 123 West Elm St ... ack192390190@gmail.com
------- ----------- ---------- ----------------- ------ ------------------------
However, not all of the PII is identical. The email addresses do not match across all of the records. The trivial duplication process will collapse the identical records down, leaving the following set of records:
------- ----------- ---------- ----------------- ------ ------------------------
pk FirstName LastName Address ... Email
------- ----------- ---------- ----------------- ------ ------------------------
r-1 Justin Currie 123 West Elm St ... ack192390190@gmail.com
r-78 Justin Currie 123 West Elm St ... doi210898r09@appco.com
r-589 Justin Currie 123 West Elm St ... ffjo19290190@gmail.com
r-829 Justin Currie 123 West Elm St ... vnbb11095412@gmail.com
------- ----------- ---------- ----------------- ------ ------------------------
The complete set of records (including trivial duplicates) will be available in the Unified Coalesced table. The collapsed records will be available in the Unified Preprocessed Raw table.
A qualified trivial duplicate is a set of records with enough matching PII to score 3.0 (or greater) and were grouped together.
Qualified trivial duplicates are treated as a single record by downstream Stitch processes and are truncated from the Unified Preprocessed Raw table.
What are the rep_ds and rep_pk columns?
Use the rep_pk and rep_ds columns in the Unified Coalesced table to help with situations where it’s necessary to understand why two records were not clustered together.
The rep_pk column is an identifier that represents the first grouping of records done by Stitch. This grouping is based on identical semantic patterns.
The rep_ds column shows the datasource that is associated with the rep_pk column.
The combination of rep_ds and rep_pk represent qualified trivial duplicates that were discovered by Stitch early in the identity resolution process.
The advanced configuration setting for trivial duplicates is:
:amperity.stitch.settings/soft-trivial-dupe-size-threshold 10
An increase to this value will decrease the likelihood that multiple records with trivial differences are collapsable into a single record for the identity resolution process. This setting should only be tuned after understanding the quality of downstream data, including stitched records.
Semantic exclusions¶
Semantic exclusions for trivial duplicates may be specified. When excluded, for the purpose of identifying clusters of records, the values associated with that semantic will be ignored. This should only be done for limited scenarios where certain types of data are known to be of lower quality.
To define an exclusion, add the following advanced configuration setting for Stitch:
:amperity.stitch.settings/soft-trivial-dupe-semantic-exclusions #{"semantic_name"}
where semantic_name is a the name of a semantic, such as email. (This value is nil by default.)
The following table represents a set of records that contain enough PII that should identify a single unique individual:
------- ----------- ---------- ----------------- ------ ------------------------
pk FirstName LastName Address ... Email
------- ----------- ---------- ----------------- ------ ------------------------
r-1 Justin Currie 123 West Elm St ... ack192390190@gmail.com
r-4 Justin Currie 123 West Elm St ... vnbb1109fe12@gmail.com
r-7 Justin Currie 123 West Elm St ... vnbb11ss0912@gmail.com
r-12 Justin Currie 123 West Elm St ... fio190912380@cgg.com
r-78 Justin Currie 123 West Elm St ... doi210898r09@appco.com
r-87 Justin Currie 123 West Elm St ... vnbb41090912@gmail.com
r-204 Justin Currie 123 West Elm St ... vnbb11090912@gmail.com
r-377 Justin Currie 123 West Elm St ... vnbb12290912@gmail.com
r-589 Justin Currie 123 West Elm St ... dfjo19290190@gmail.com
r-829 Justin Currie 123 West Elm St ... vnbb11095412@gmail.com
r-911 Justin Currie 123 West Elm St ... vnbb11450912@gmail.com
------- ----------- ---------- ----------------- ------ ------------------------
All of the information is identical, except for variations in email addresses that causes each of them to be unique. Because the email addresses are not identical, Amperity will not collapse them into a single record for identity resolution purposes even when it is obvious that all of these records are the same individual.
The maximum allowed number of records with trivial duplicates is set to “10” by default. When the number of records with trivial duplicates is greater than this value, each individual record will be treated as a single record. Since the previous example shows eleven records for Justin Currie, each with a unique value for the email column, eleven individual records would be created.
Semantic exclusions define a threshold over which records like these will be collapsed into a trivial duplicate. For example, let’s say you have defined a semantic exclusion for email and left the size threshold at “10”.
For each unique combination of PII–excluding email addresses!–the distinct email addresses that are associated with that unique combination of PII are compared. If there are more than 10 distinct email addresses, those records are collapsed into a trivial duplicate.
Warning
Semantic exclusions should be applied very carefully. Use a sandbox to configure and apply a semantic exclusion, and then carefully review and validate that all downstream processes are not adversely affected by the change prior to applying a semantic change to a production environment.
Example: email addresses¶
A long-running promotion for a free food item results in a large number of email addresses associated with the same first name, last name, and phone number. This results in a large number of nearly-identical records, each with a unique email address. You can use semantic exclusions to define a threshold over which records like this are collapsed into a trivial duplicate.
Configure Stitch to define a semantic exclusion for email addresses:
:amperity.stitch.settings/soft-trivial-dupe-semantic-exclusions #{"email"}
and then define the threshold:
:amperity.stitch.settings/soft-trivial-dupe-size-threshold 25
For each unique combination of PII–excluding email addresses!–the distinct email addresses that are associated with that unique combination of PII are compared. If there are more than 25 distinct email addresses, those records are collapsed into a trivial duplicate.
Stitch reports¶
A Stitch report shows cluster graphs for individuals associated with the Amperity ID. You can configure the Stitch report to include or exclude specific Amperity IDs. Ensuring that certain Amperity IDs are included (or excluded) can help improve the quality of the Stitch report. The Amperity IDs that are included will appear first in the series of individuals shown when exploring Amperity IDs.
Stitch may be configured to include or exclude specific Amperity IDs in the Stitch report.
Use :amperity.stitch-report.example/exclude to define a list of Amperity IDs to be excluded from the report.
Use :amperity.stitch-report.example/include to define a list of Amperity IDs to be included in the report.
These settings are available from the Stitch page. Click Configure, add the list of Amperity IDs to be included or excluded under Stitch report configuration, and then run Stitch.
For example:
{
:amperity.stitch-report.example/exclude
[
"123a456b-439c-145d-345e-123kmc901nju"
"124a432b-743c-012d-432e-678fgh901cef"
"125a456b-649c-032d-373e-678afd901ijk"
"234a456b-459c-532d-346e-345lkd901dcr"
...
],
:amperity.stitch-report.example/include
[
"723a456b-789c-012d-345e-321fee901lpo"
"432a456b-789c-567d-876e-678fgh901asd"
"567a456b-789c-543d-365e-543fda901hcg"
"987a456b-789c-234d-645e-345plo901ikm"
...
],
}