Configure Stitch¶
Stitch uses patented algorithms to evaluate massive volumes of data to discover the hidden connections in your customer records that identify unique individuals. Stitch outputs a unified collection of data that assigns a unique identifier to each unique individual that is discovered within your customer records.
Stitch configuration settings¶
Stitch configuration is defined by a list of settings in the Settings dialog box.
Caution
For most situations, there is no reason to change these settings. In some cases, after consultation with your Amperity representative, tuning Stitch configuration settings may be helpful.
To edit Stitch configuration settings
From the Stitch tab, click Settings.
In the Stitch Settings dialog box, under Stitch configuration, review the configuration settings.
Make your changes.
Note
Stitch settings should be configured using a sandbox. Verify Stitch results, verify Stitch QA results, the values in all standard database tables, and the behavior of downstream workflows within Amperity to ensure that changes to Stitch settings have the desired effects.
Click Save.
From the Stitch tab, click Run. Carefully review the results to ensure that your changes had the desired effect.
Configuration options¶
The types of configuration changes some organizations make include:
Warning
Stitch configuration is represented as a block of Clojure code exposed via the Amperity UI:
{:amperity.stitch.settings/blocking-strategies
#{:dnf1 :dnf3 :dnf4 :dnf5 :dnf6 :dnf7 :dnf8 :email :fk},
:amperity.stitch.settings/classifier :general-ordinal-fk-priority,
:amperity.stitch.settings/clustering-algorithm :hierarchical,
:amperity.stitch.settings/enable-low-cardinality-alerts? false,
:amperity.stitch.settings/enable-stable-id? true,
:amperity.stitch.settings/force? false,
:amperity.stitch.settings/harvest-feature-profiles? false,
:amperity.stitch.settings/ignore-jitter? true,
:amperity.stitch.settings/metrics-partition-period 100,
:amperity.stitch.settings/output-partitions 128,
:amperity.stitch.settings/parallelism 2,
:amperity.stitch.settings/pre-processing-profile default,
:amperity.stitch.settings/samples-per-feature-signature 3,
:amperity.stitch.settings/skip-scores-output? false,
:amperity.stitch.settings/skip-unified-changes? false,
:amperity.stitch.settings/soft-trivial-dupe-size-threshold 10,
:amperity.stitch.settings/stable-id-partition-count 1,
:amperity.stitch.settings/supersized-cluster-min-size 64,
:amperity.stitch.settings/supersized-partition-max-depth 4,
:amperity.stitch.settings/threshold 3,
:amperity.stitch.settings/unified-changes-recorded-days 30,
:amperity.stitch.settings/use-uuid-key-ranges? true,
:amperity.stitch.settings/write-clustering-table? false}
Many configuration settings are configurable and are described in more detail below.
Automatic bad value detection¶
Amperity is configured to automatically apply blocklists to email addresses, phone numbers, and physical addresses when a bad value is discovered at a frequency that exceeds the defined threshold.
The default configuration is:
:amperity.stitch.settings/badvalues-config [
{:threshold 20, :proxy "given-name", :semantic "email"}
{:threshold 20, :proxy "given-name", :semantic "phone"}
{:threshold 40, :proxy "given-name", :semantic "address"}]
How the default configuration works:
An email address is added to the bad-values blocklist when the same email addresses is associated with more than 20 distinct given names.
A phone number is added to the bad-values blocklist when the same phone number is associated with more than 20 distinct given names.
A physical addresses is added to the bad-values blocklist when the same physical addresses is associated with more than 40 distinct given names.
The threshold values can be increased or decreased as needed for your tenant.
You can add blocklist items using the following syntax:
{:threshold 00, :proxy "semantic-name", :semantic "semantic-name"}
Disable automatic bad value detection¶
You can disable automatic bad-value blocklists by editing the configuration to be empty square brackets:
:amperity.stitch.settings/badvalues-config []
Blocking strategy¶
Blocking is a non-trivial step for record linking in the Stitch process. An overly generous blocking strategy may result in a high recall rate (too many pairs being evaluated) along with negative system performance. An overly conservative blocking strategy may result in a low recall rate (too few pairs being evaluated). Individual blocking keys may be conservative or generous. The combination of blocking keys is what creates the ideal recall rate without compromising the performance of Amperity.
The default blocking strategy:
:stitch/blocking-strategies #{:dnf1 :dnf3 :dnf4 :dnf5 :dnf6 :dnf7 :dnf8 :email :fk}
Note
The order in which blocking strategies are listed does not matter. For example:
:dnf1 :dnf3 :dnf4 :dnf5
and:
:dnf1 :dnf4 :dnf5 :dnf3
will be processed in the same way and will return the same results.
In nearly all cases for all customers, the default blocking strategy should provide a reasonable recall rate. Each individual blocking strategy looks at various combinations of PII data:
Strategy |
Key |
---|---|
:address |
Non-default. This blocking strategy groups values associated with the full address using the address and address2 semantics. |
:company |
Non-default. This blocking strategy groups values associated with the company semantic. |
:dnf1 |
Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first character in surname, and birthdate. |
:dnf2 |
Non-default, use carefully. This blocking strategy groups values associated with the following semantics: the full given-name and email. |
:dnf3 |
Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first three characters in surname, and postal. |
:dnf4 |
Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name and address. |
:dnf5 |
Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name and phone. |
:dnf6 |
Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first three characters in surname, and the local part of an email address in email. |
:dnf7 |
Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name and company. |
:dnf8 |
Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first three characters in surname, and PO box values that are derived from address. |
Default. This blocking strategy groups values associated with the following semantics: the full email address in email. |
|
:fk |
Default. This blocking strategy groups values associated with foreign keys. |
:login-partial |
Non-default. This blocking strategy groups values associated with cleaned email addresses derived from email. This is used for low-threshold email address matching. |
:login-trimmed |
Non-default. This blocking strategy groups values associated with the first five characters of an email addresses derived from email. This is used for low-threshold email address matching. |
:name |
Non-default, use carefully. This blocking strategy groups values associated with the following semantics: given-name and surname. The order of given-name and surname is sorted lexicographically. The blocking key for JOHN SMITH and SMITH JOHN is JOHN:SMITH. |
:phone |
Non-default. This blocking strategy groups values associated with the phone semantic. |
Case-sensitive foreign keys¶
Values associated with foreign keys are case-insensitive by default. To configure values for particular foreign keys to be case-sensitive, add the following configuration setting to Stitch:
:amperity.stitch.settings/case-sensitive-fks #{"fk-name"}
where fk-name
is a the name of the foreign key for which values will be treated as case-sensitive.
Clustering algorithm¶
Warning
Use :hierarchical
clustering unless instructed differently.
The configuration setting for the clustering algorithm is:
:stitch/clustering-algorithm :hierarchical
The default value for the Stitch clustering algorithm is :hierarchical
, which applies hierarchical clustering. This value should not be changed without careful consideration. Other configuration values are: :nil
, which uses connected components directly.
Days of recorded history¶
You can configure the number of days that are stores for the Unified_Changes_PKs and Unified_Changes_Clusters tables. The default is thirty days.
:amperity.stitch.settings/unified-changes-recorded-days 30,
Note
Changing this setting will not recreate history that has already been dropped.
Email address patterns¶
Many email addresses are not useful for identity resolution. Some of them are generic, such as info@some-domain.com, and are often associated with a place of business and are never associated with a unique individual. Other email addresses are bogus, having been entered as a requirement for providing a genuine email address, but are otherwise fake, such as 123@some-domain.com.
The following values associated with the email semantic are ignored by Stitch when performing identity resolution:
@NOEMAIL.COM
@NOMAIL.COM
0000000000
123@
1234@
99@
ABC@
ABC123@
ADMIN@
BOOKING@
CLIENT@
CLIENTS@
CONFIRMATION@
CONFIRMATIONS@
CONTACT@
CUSTOMERSERVICE
CUSTOMERSERVICE@
CUSTOMERSERVICES
CUSTOMERSERVICES@
DECLINE@
DECLINED@
DENIED@
EMAIL@
@EMAIL.TST
EXAMPLE@
FAKENAME@
GUEST@
GUESTS@
HELP@
HELPS@
HOTELHELP@
HOTELPARTNER@
HOTELPARTNERS@
INFO@
JUNK@
MAIL@
ME@
N@A
NAME@
NO@
NOEMAIL@
NOMAIL@
NONE@
NONENONE@
NOREPLY@
NOTHANKS@
NOTHANKYOU@
ONLINERESERVATION
ONLINERESERVATION@
ONLINERESERVATIONS
ONLINERESERVATIONS@
OPERATION@
OPERATIONS@
QUERIES@
QUERY@
REFUSED@
RES@
RESERVAS
RESERVATION@
RESERVATIONS@
ROOMRESERVATION@
ROOMRESERVATIONS@
SAMPLE@
SAMPLES@
SERVICE@
SHOP@
TEST@
TESTING@
TESTEMAIL@
TRAVEL@
TRAVELS
VENDOR@
VENDORS@
XXX@
The values in bold are always ignored.
Stitch may be configured to allow certain generic email addresses to be available to Stitch as part of identity resolution when the pre-processing-profile
configuration setting is set to:
pre-processing-profile :allow-business-email
When this setting is updated, only the following email address patterns are ignored by Stitch:
@NOEMAIL.COM
@NOMAIL.COM
123@
1234@
99@
ABC@
ABC123@
DECLINE@
DECLINED@
DENIED@
FAKENAME@
JUNK@
NO@
NOEMAIL@
NOMAIL@
NONE@
NONENONE@
NOREPLY@
NOTHANKS@
NOTHANKYOU@
REFUSED@
XXX@
Use a bad-values blocklist to configure Amperity to continue ignoring any of the email address patterns that were removed from the default list of ignored email patterns.
Empty tables¶
You can configure Stitch to accept empty tables. For example, some CCPA and GDPR workflows run daily, but do not always contain data, such as when zero customers make a data subject access request (DSAR) request.
Use the following setting to configure Stitch to ignore empty tables:
:amperity.stitch.settings/allowed-empty-tables #{"Table:Name"}
For example:
:amperity.stitch.settings/allowed-empty-tables #{"CCPA:ViewRequests"}
Stitch may be configured to ignore more than one empty table:
:amperity.stitch.settings/allowed-empty-tables #{"CCPA:ViewRequests", "CCPA:DeleteRequests"}
Force Stitch to run¶
Stitch will run when domain tables contain updates to data that is used by Stitch for identity resolution. Stitch will not run when updates are not present. To force Stitch to run set the following setting to true
:
:amperity.stitch.settings/force? false
Ignore jitter¶
Jitter tracks changes to Amperity IDs across Stitch runs. You can disable tracking these changes in certain situations. For example, if you are using a tenant for training purposes or if you are loading a very large and complex set of data to an established tenant. In these situations a high jitter rate is common, but not unexpected.
You should leave this setting as false
in your production tenant as often as possible. To ignore jitter set the following setting to true
:
:amperity.stitch.settings/ignore-jitter? false
Matching strategy¶
Amperity is configured by default to prioritize foreign key matching over separation key unmatching.
The matching strategy classifier tells Stitch how to apply the results of the blocking strategies, including which groups to analyze and the order in which that analysis should take place, when foreign keys and separation keys are present.
The default behavior prioritizes foreign keys over separation keys.
Foreign key matching¶
When foreign key matching is the priority, Amperity scores record pairs in the following order:
Does the record contain identical foreign key values?
If true, assign score 5.0. Stop.
If false, does the record contain conflicting separation key values?
If true, assign score 0.0. Stop.
If false, use pairwise comparison scoring.

Separation key umatching¶
A separation key (sk) is used for deterministic unmatching of records.
By default, Amperity derives separation keys for sk-given-name and sk-generational-suffix. You may configure Amperity to prioritize separation keys over foreign keys.
When separation key unmatching is the priority, Amperity scores record pairs in the following order:
Does the record contain conflicting separation key values?
If true, assign score 0.0. Stop.
If false, does the record contain identical foreign key values?
If true, assign score 5.0. Stop.
If false, use pairwise comparison scoring.

To change the matching strategy classifier
Update the configuration setting for model selection from:
:stitch/classifier :general-ordinal-fk-priority
to:
:stitch/classifier :general-ordinal-sk-priority
Warning
This value should be changed only after careful consideration. If changed, be sure to validate these results carefully to ensure that any changes to pairwise comparison scoring had the desired outcome.
Matching thresholds¶
Setting the threshold is a key step in tuning Stitch results for Amperity. The recommended threshold setting is 3
.
Based on the precision and recall rate that was observed in the initial Stitch results, the matching threshold can be tuned to achieve better results by updating the weakest match category to can be included in Stitch results.
In general, a lower threshold will lead to more matches and fuzzier matched pairs, whereas a higher threshold will lead to fewer matches and more precise matched pairs. For the default ordinal classifier, five different thresholds can be chosen, and they are defined as follows:
Threshold |
Weakest Match |
---|---|
1.0 |
Weak |
2.0 |
Moderate |
3.0 |
High |
4.0 |
Excellent |
5.0 |
Exact |
For each threshold, all the matches that are equal to or stronger than the weakest matching type will be clustered together. For example, if the threshold 3.0 is chosen,
The configuration setting for thresholds is:
:amperity.stitch.settings/threshold 3
which ensures that any pair that belongs to the High, Excellent, and Exact match types are stitched into a single cluster, assuming all pairs pass through the blocking phase.
One-to-one Stitch¶
Stitch may be configured for one-to-one Stitch. This mode assigns an Amperity ID to each of your customers that are identified by a unique (and consistent) customer key that is input to Amperity. (This configuration option is sometimes referred to as “deterministic Stitch”.)
Important
If Stitch has already been run in standard mode, and then if one-to-one mode is enabled, there will be 100% jitter. Amperity IDs are not expected to be the same between standard and one-to-one Stitch runs.
Use the following steps to configure your tenant for one-to-one Stitch:
Ensure that each table that is made available to Stitch applies the ck semantic tag to the field that contains the existing customer ID.
Tip
You may apply the ck semantic tag from a feed or from a custom domain table.
Apply all other semantic tags – customer profile, foreign key, primary key, orders and items – to the correct fields in all of your data sources. These tags will have no effect when running one-to-one Stitch, but are required Amp360 (customer profiles and transactions) and AmpIQ (segment insights and predictive modeling).
To configure Amperity for one-to-one Stitch, open the Stitch tab, and then click Settings. In the list of settings, add the following configuration setting:
:amperity.stitch.settings/one-to-one? true,
Run Stitch.
When the run is complete, each unique customer ID will be associated with an Amperity ID.
Outcome of one-to-one Stitch
The following table describes the changes you will see in your tenant after it is configured for one-to-one Stitch.
Tab |
Changes |
---|---|
Stitch |
The overview page will show a 0.0% deduplication rate. The Amperity ID will align to the total source IDs provided by the ck. The Data Explorer disables the tabs for Cluster Graph and Pairwise Comparison. These tabs are not available when one-to-one Stitch mode is configured. Note The Amperity ID that is generated in one-to-one Stitch mode is based on customer keys (and not on stable clusters of customer records). |
Customer 360 |
The Unified_Scores table is not generated. The following fields are removed from the Unified_Coalesced table: component_id, is_supersized, rep_ds, rep_pk, and supersized_id. Fields related to the bad-values blocklist are not available, including has_blv, blv_address, blv_email, blv_given_name, blv_phone, and blv_surname. All standard tables will contain a ck field. The Stitch QA database template is not needed. |
Queries |
The Stitch QA queries template is not needed. |
Preprocessing profiles¶
Stitch can be configured to use non-default preprocessing profiles for the following use cases:
Allow business email addresses
Australian phone numbers
Clean foreign keys
Default
Normalize gender
Skip derived gender
Prioritize gender in source data
Prioritize derived gender
Important
Use a non-default preprocessing profile in specific situations only.
The default configuration setting for preprocessing profiles is:
:amperity.stitch.settings/pre-processing-profiles #{ default }
To allow business email addresses, use:
:amperity.stitch.settings/pre-processing-profiles #{ :allow-business-email }
To use Australian phone numbers, use:
:amperity.stitch.settings/pre-processing-profiles #{ :australian-customer }
To clean foreign keys (trim whitespace and update them to uppercase), use:
:amperity.stitch.settings/pre-processing-profiles #{ :clean-fk }
To configure gender processing preferences, use:
:amperity.stitch.settings/pre-processing-profiles #{ :normalize-gender }
:amperity.stitch.settings/pre-processing-profiles #{ :skip-derive-gender }
:amperity.stitch.settings/pre-processing-profiles #{ :prioritize-src-gender }
- or -
:amperity.stitch.settings/pre-processing-profiles #{ :prioritize-derived-gender }
To configure more than one preprocessing profile, use:
:amperity.stitch.settings/pre-processing-profiles #{ :allow-business-email :clean-fk }
Stable ID assignment¶
An Amperity ID is a patented unique identifier that is assigned to clusters of customer records. A single Amperity ID represents a single individual. Unlike other systems, the Amperity ID is reassessed every day for the most comprehensive view of your customers.
As new data is input to Amperity, the Stitch process identifies when new or changed data applies to existing clusters of customer records, and then updates those records, maintains the cluster, and retains a stable Amperity ID assignment. A new Amperity ID is only created when new individuals are identified.
Stable ID assignment can be a resource-intensive process, in particular when:
Adding data sources that contain large numbers of rows (100+ million rows) of customer records.
Updating existing data sources with large numbers of rows on a periodic (monthly, quarterly, etc.) basis.
Data contains a very large number of duplicate values, such as 400k+ instances of an email address that is associated to a common business process.
You can configure the stable ID assignment process in the following ways:
Disable stable IDs¶
You can disable stable ID assignment to improve the performance of Amperity during the initial configuration phase for your tenant. (Note that some tenants, because of the way they use Amperity, do not require stable ID assignment and may choose to configure this setting to be false
.)
:amperity.stitch.settings/enable-stable-id? false,
Important
Be sure to reset enable-stable-id
to true
before you start to analyze and validate production-quality Stitch results or to perform any level of Stitch QA.
Increase partitions¶
When large differences are present between clusters of records for the current and previous Stitch runs you can increase the number of partitions that are used by Stitch during stable ID assignment. This situation can occur when data sources provide periodic updates, such as monthly, quarterly, etc., that contain a very large number of rows.
By default, stable ID assignment is based on the edges that exist between current and previous clusters of customer records, weighted by the number of shared primary keys. The shared primary keys are sorted in descending order, after which ties are broken by sorting the cluster IDs in ascending order. Edges are removed when higher-ranked edges are associated to identical clusters of customer records. Edges that survive this ranking are then used to map current cluster IDs to previous cluster IDs. Changes to stable ID assignment are captured in the Unified_Changes_Clusters and Unified_Changes_PKs tables.
In some cases, the differences between the current and previous clusters of customer records is very large, requiring access to a large amount of memory to complete the stable ID assignment process. In this type of situation Stitch processes may take a very long time, or even appear to be stuck; in some cases the Stitch job has run out of memory and will need to be re-run. In these situations, do the following:
Look for unusual values, such as a large set of identical email address, that appear in the Unified_Coalesced table. This can sometimes be the cause of slow stable ID assignment. Mitigate the presence of these unusual values, and then run Stitch again.
Increase the number of partitions that are available to Stitch during the stable ID assignment process. You can increase the value of
stable-id-partition-count
to a value between 2-10 to improve the performance of Stitch during stable ID assigment. This setting should be used temporarily, but for some tenants it may need to be left at a non-default value. For example::amperity.stitch.settings/stable-id-partition-count 6
Skip unified changes¶
The Unified_Changes_Clusters and Unified_Changes_PKs tables may be resource intensive becaue they contain all of the changes to clusters and primary keys between the current and previous Stitch run.
The Unified_Changes_Clusters table contains a history of changes to cluster graphs, relative to the previous Stitch run.
The Unified_Changes_PKS table contains a history of changes to primary keys, relative to the previous Stitch run.
The default configuration for Stitch always builds these tables, but you can configure Stitch to skip building them when the following setting is set to true
:
:amperity.stitch.settings/skip-unified-changes? false,
Skip unified scores¶
The Unified_Scores table records all of the pairwise comparison scores and match categories for all groups of records, and then for each group of records all of the pairwise scores that are present between records within that group.
The Unified_Scoress tables may be resource intensive. The default configuration for Stitch always builds this table, but you can configure Stitch to skip building it when the following setting is set to true
:
:amperity.stitch.settings/skip-scores-output? false,
Stitch reports¶
A Stitch report shows cluster graphs for individuals associated with the Amperity ID. You can configure the Stitch report to include or exclude specific Amperity IDs. Ensuring that certain Amperity IDs are included (or excluded) can help improve the quality of the Stitch report. The Amperity IDs that are included will appear first in the series of individuals shown when exploring Amperity IDs.
Stitch may be configured to include or exclude specific Amperity IDs in the Stitch report.
Use
:amperity.stitch-report.example/exclude
to define a list of Amperity IDs to be excluded from the report.Use
:amperity.stitch-report.example/include
to define a list of Amperity IDs to be included in the report.
These settings are available from the Stitch tab. Click Configure, add the list of Amperity IDs to be included or excluded under Stitch report configuration, and then run Stitch.
For example:
{
:amperity.stitch-report.example/exclude
[
"123a456b-789c-012d-345e-678fgh901ijk"
"124a456b-789c-012d-345e-678fgh901ijk"
"125a456b-789c-012d-345e-678fgh901ijk"
"126a456b-789c-012d-345e-678fgh901ijk"
...
],
:amperity.stitch-report.example/include
[
"723a456b-789c-012d-345e-678fgh901ijk"
"724a456b-789c-012d-345e-678fgh901ijk"
"725a456b-789c-012d-345e-678fgh901ijk"
"726a456b-789c-012d-345e-678fgh901ijk"
...
],
}
Supersized clusters¶
A supersized cluster is a cluster of records that is discovered during the Stitch process that has more than 64 matching records. A supersized cluster does not typically represent a unique individual and is not worthy of further analysis.
A supersized cluster is created when multiple transitive connections are present across clusters of records. For example, individuals named Mary Johnson and Jeffrey Johnson with the following records:
Mary Johnson, maryjohnson @gmail.com, 50 1st Avenue, New York, NY, with 50 connected records.
Jeffrey Johnson, jeffjohnson @gmail.com, 50 1st Avenue, New York, NY, with 25 connected records.
Mary Johnson, mjohnson50 @gmail.com, 50 1st Avenue, New York, NY, with 17 connected records.
Jeffrey Johnson, mjohnson50 @gmail.com, 50 1st Avenue, New York, NY, with 8 connected records.
These records block together in the following ways:
Records 1 and 3 block together on name and address.
Records 2 and 4 block together on name and address.
Records 3 and 4 block together on email.
All four groups of records transitively connect into a single connected cluster with a size of 100.
Amperity defines a supersized cluster as any cluster with 64 (or more) connections. To change this threshold to a higher or lower value, update the following Stitch configuration setting:
:amperity.stitch.settings/supersized-cluster-min-size 64
Trivial duplicates¶
A trivial duplicate is a set of nearly-identical records that share enough matching PII to clearly identify a single unique individual. Trivial duplicates are identified by Stitch early in the identity resolution process. Only one of these records is passed downstream for additional Stitch processing; the other records – the trivial duplicates – are not.
The following table represents a set of records that contain enough matching PII to associate all of these records to a single unique individual:
------- ----------- ---------- ----------------- ------ ------------------------
pk FirstName LastName Address ... Email
------- ----------- ---------- ----------------- ------ ------------------------
r-1 Justin Currie 123 West Elm St ... ack192390190@gmail.com
r-4 Justin Currie 123 West Elm St ... vnbb1109fe12@gmail.com
r-7 Justin Currie 123 West Elm St ... ack192390190@gmail.com
r-12 Justin Currie 123 West Elm St ... ack192390190@gmail.com
r-78 Justin Currie 123 West Elm St ... doi210898r09@appco.com
r-87 Justin Currie 123 West Elm St ... vnbb1109fe12@gmail.com
r-204 Justin Currie 123 West Elm St ... vnbb1109fe12@gmail.com
r-377 Justin Currie 123 West Elm St ... vnbb1109fe12@gmail.com
r-589 Justin Currie 123 West Elm St ... ffjo19290190@gmail.com
r-829 Justin Currie 123 West Elm St ... vnbb11095412@gmail.com
r-911 Justin Currie 123 West Elm St ... ack192390190@gmail.com
------- ----------- ---------- ----------------- ------ ------------------------
However, not all of the PII is identical. The email addresses do not match across all of the records. The trivial duplication process will collapse the identical records down, leaving the following set of records:
------- ----------- ---------- ----------------- ------ ------------------------
pk FirstName LastName Address ... Email
------- ----------- ---------- ----------------- ------ ------------------------
r-1 Justin Currie 123 West Elm St ... ack192390190@gmail.com
r-78 Justin Currie 123 West Elm St ... doi210898r09@appco.com
r-589 Justin Currie 123 West Elm St ... ffjo19290190@gmail.com
r-829 Justin Currie 123 West Elm St ... vnbb11095412@gmail.com
------- ----------- ---------- ----------------- ------ ------------------------
The complete set of records (including trivial duplicates) will be available in the Unified_Coalesced table. The collapsed records will be available in the Unified_Preprocessed_Raw table.
A qualified trivial duplicate is a set of records with enough matching PII to score 3.0 (or greater) and were grouped together.
Qualified trivial duplicates are treated as a single record by downstream Stitch processes and are truncated from the Unified_Preprocessed_Raw table.
What are the rep_ds and rep_pk columns?
Use the rep_pk and rep_ds columns in the Unified_Coalesced table to help with situations where it’s necessary to understand why two records were not clustered together.
The rep_pk column is an identifier that represents the first grouping of records done by Stitch. This grouping is based on identical semantic patterns.
The rep_ds column shows the datasource that is associated with the rep_pk column.
The combination of rep_ds and rep_pk represent qualified trivial duplicates that were discovered by Stitch early in the identity resolution process.
The configuration setting for trivial duplicates is:
:amperity.stitch.settings/soft-trivial-dupe-size-threshold 10
An increase to this value will decrease the likelihood that multiple records with trivial differences are collapsable into a single record for the identity resolution process. This setting should only be tuned after understanding the quality of downstream data, including stitched records.
Semantic exclusions¶
Semantic exclusions for trivial duplicates may be specified. When excluded, for the purpose of identifying clusters of records, the values associated with that semantic will be ignored. This should only be done for limited scenarios where certain types of data are known to be of lower quality.
To define an exclusion, add the following configuration setting for Stitch:
:amperity.stitch.settings/soft-trivial-dupe-semantic-exclusions #{"semantic_name"}
where semantic_name
is a the name of a semantic, such as email
. (This value is nil
by default.)
The following table represents a set of records that contain enough PII that should identify a single unique individual:
------- ----------- ---------- ----------------- ------ ------------------------
pk FirstName LastName Address ... Email
------- ----------- ---------- ----------------- ------ ------------------------
r-1 Justin Currie 123 West Elm St ... ack192390190@gmail.com
r-4 Justin Currie 123 West Elm St ... vnbb1109fe12@gmail.com
r-7 Justin Currie 123 West Elm St ... vnbb11ss0912@gmail.com
r-12 Justin Currie 123 West Elm St ... fio190912380@cgg.com
r-78 Justin Currie 123 West Elm St ... doi210898r09@appco.com
r-87 Justin Currie 123 West Elm St ... vnbb41090912@gmail.com
r-204 Justin Currie 123 West Elm St ... vnbb11090912@gmail.com
r-377 Justin Currie 123 West Elm St ... vnbb12290912@gmail.com
r-589 Justin Currie 123 West Elm St ... dfjo19290190@gmail.com
r-829 Justin Currie 123 West Elm St ... vnbb11095412@gmail.com
r-911 Justin Currie 123 West Elm St ... vnbb11450912@gmail.com
------- ----------- ---------- ----------------- ------ ------------------------
All of the information is identical, except for variations in email addresses that causes each of them to be unique. Because the email addresses are not identical, Amperity will not collapse them into a single record for identity resolution purposes even when it is obvious that all of these records are the same individual.
The maximum allowed number of records with trivial duplicates is set to 10
by default. When the number of records with trivial duplicates is greater than this value, each individual record will be treated as a single record. Since the previous example shows eleven records for Justin Currie, each with a unique value for the email column, eleven individual records would be created.
Semantic exclusions define a threshold over which records like these will be collapsed into a trivial duplicate. For example, let’s say you have defined a semantic exclusion for email and left the size threshold at 10
.
For each unique combination of PII–excluding email addresses!–the distinct email addresses that are associated with that unique combination of PII are compared. If there are more than 10 distinct email addresses, those records are collapsed into a trivial duplicate.
Warning
Semantic exclusions should be applied very carefully. Use a sandbox to configure and apply a semantic exclusion, and then carefully review and validate that all downstream processes are not adversely affected by the change prior to applying a semantic change to a production environment.
Example: Email addresses¶
A long-running promotion for a free food item results in a large number of email addresses associated with the same first name, last name, and phone number. This results in a large number of nearly-identical records, each with a unique email address. You can use semantic exclusions to define a threshold over which records like this are collapsed into a trivial duplicate.
Configure Stitch to define a semantic exclusion for email addresses:
:amperity.stitch.settings/soft-trivial-dupe-semantic-exclusions #{"email"}
and then define the threshold:
:amperity.stitch.settings/soft-trivial-dupe-size-threshold 25
For each unique combination of PII–excluding email addresses!–the distinct email addresses that are associated with that unique combination of PII are compared. If there are more than 25 distinct email addresses, those records are collapsed into a trivial duplicate.