Configure Stitch

Stitch uses patented algorithms to evaluate massive volumes of data to discover the hidden connections in your customer records that identify unique individuals. Stitch outputs a unified collection of data that assigns a unique identifier to each unique individual that is discovered within your customer records.

Stitch configuration settings

Stitch configuration is defined by a list of settings in the Settings dialog box.

Caution

For most situations, there is no reason to change these settings. In some cases, after consultation with your Amperity representative, tuning Stitch configuration settings may be helpful.

To edit Stitch configuration settings

  1. From the Stitch tab, click Settings.

  2. In the Stitch Settings dialog box, under Stitch configuration, review the configuration settings.

  3. Make your changes.

    Note

    Stitch settings should be configured using a sandbox. Verify Stitch results, verify Stitch QA results, the values in all standard database tables, and the behavior of downstream workflows within Amperity to ensure that changes to Stitch settings have the desired effects.

  4. Click Save.

  5. From the Stitch tab, click Run. Carefully review the results to ensure that your changes had the desired effect.

Configuration options

The types of configuration changes some organizations make include:

Warning

Stitch configuration is represented as a block of Clojure code exposed via the Amperity UI:

{:amperity.stitch.settings/blocking-strategies
 #{:dnf1 :dnf3 :dnf4 :dnf5 :dnf6 :dnf7 :dnf8 :email :fk},
 :amperity.stitch.settings/classifier :general-ordinal-fk-priority,
 :amperity.stitch.settings/clustering-algorithm :hierarchical,
 :amperity.stitch.settings/enable-low-cardinality-alerts? false,
 :amperity.stitch.settings/enable-stable-id? true,
 :amperity.stitch.settings/force? false,
 :amperity.stitch.settings/harvest-feature-profiles? false,
 :amperity.stitch.settings/ignore-jitter? true,
 :amperity.stitch.settings/metrics-partition-period 100,
 :amperity.stitch.settings/output-partitions 128,
 :amperity.stitch.settings/parallelism 2,
 :amperity.stitch.settings/pre-processing-profile default,
 :amperity.stitch.settings/samples-per-feature-signature 3,
 :amperity.stitch.settings/skip-scores-output? false,
 :amperity.stitch.settings/skip-unified-changes? false,
 :amperity.stitch.settings/soft-trivial-dupe-size-threshold 10,
 :amperity.stitch.settings/stable-id-partition-count 1,
 :amperity.stitch.settings/supersized-cluster-min-size 64,
 :amperity.stitch.settings/supersized-partition-max-depth 4,
 :amperity.stitch.settings/threshold 3,
 :amperity.stitch.settings/unified-changes-recorded-days 30,
 :amperity.stitch.settings/use-uuid-key-ranges? true,
 :amperity.stitch.settings/write-clustering-table? false}

Many configuration settings are configurable and are described in more detail below.

Automatic bad value detection

Amperity is configured to automatically apply blocklists to email addresses, phone numbers, and physical addresses when a bad value is discovered at a frequency that exceeds the defined threshold.

The default configuration is:

:amperity.stitch.settings/badvalues-config [
{:threshold 20, :proxy "given-name", :semantic "email"}
{:threshold 20, :proxy "given-name", :semantic "phone"}
{:threshold 40, :proxy "given-name", :semantic "address"}]

How the default configuration works:

  1. An email address is added to the bad-values blocklist when the same email addresses is associated with more than 20 distinct given names.

  2. A phone number is added to the bad-values blocklist when the same phone number is associated with more than 20 distinct given names.

  3. A physical addresses is added to the bad-values blocklist when the same physical addresses is associated with more than 40 distinct given names.

The threshold values can be increased or decreased as needed for your tenant.

You can add blocklist items using the following syntax:

{:threshold 00, :proxy "semantic-name", :semantic "semantic-name"}

Disable automatic bad value detection

You can disable automatic bad-value blocklists by editing the configuration to be empty square brackets:

:amperity.stitch.settings/badvalues-config []

Blocking strategy

Blocking is a non-trivial step for record linking in the Stitch process. An overly generous blocking strategy may result in a high recall rate (too many pairs being evaluated) along with negative system performance. An overly conservative blocking strategy may result in a low recall rate (too few pairs being evaluated). Individual blocking keys may be conservative or generous. The combination of blocking keys is what creates the ideal recall rate without compromising the performance of Amperity.

The default blocking strategy:

:stitch/blocking-strategies #{:dnf1 :dnf3 :dnf4 :dnf5 :dnf6 :dnf7 :dnf8 :email :fk}

Note

The order in which blocking strategies are listed does not matter. For example:

:dnf1 :dnf3 :dnf4 :dnf5

and:

:dnf1 :dnf4 :dnf5 :dnf3

will be processed in the same way and will return the same results.

In nearly all cases for all customers, the default blocking strategy should provide a reasonable recall rate. Each individual blocking strategy looks at various combinations of PII data:

Strategy

Key

:address

Non-default. This blocking strategy groups values associated with the full address using the address and address2 semantics.

:company

Non-default. This blocking strategy groups values associated with the company semantic.

:dnf1

Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first character in surname, and birthdate.

:dnf2

Non-default, use carefully. This blocking strategy groups values associated with the following semantics: the full given-name and email.

:dnf3

Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first three characters in surname, and postal.

:dnf4

Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name and address.

:dnf5

Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name and phone.

:dnf6

Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first three characters in surname, and the local part of an email address in email.

:dnf7

Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name and company.

:dnf8

Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first three characters in surname, and PO box values that are derived from address.

:email

Default. This blocking strategy groups values associated with the following semantics: the full email address in email.

:fk

Default. This blocking strategy groups values associated with foreign keys.

:login-partial

Non-default. This blocking strategy groups values associated with cleaned email addresses derived from email. This is used for low-threshold email address matching.

:login-trimmed

Non-default. This blocking strategy groups values associated with the first five characters of an email addresses derived from email. This is used for low-threshold email address matching.

:name

Non-default, use carefully. This blocking strategy groups values associated with the following semantics: given-name and surname. The order of given-name and surname is sorted lexicographically. The blocking key for JOHN SMITH and SMITH JOHN is JOHN:SMITH.

:phone

Non-default. This blocking strategy groups values associated with the phone semantic.

Case-sensitive foreign keys

Values associated with foreign keys are case-insensitive by default. To configure values for particular foreign keys to be case-sensitive, add the following configuration setting to Stitch:

:amperity.stitch.settings/case-sensitive-fks #{"fk-name"}

where fk-name is a the name of the foreign key for which values will be treated as case-sensitive.

Clustering algorithm

Warning

Use :hierarchical clustering unless instructed differently.

The configuration setting for the clustering algorithm is:

:stitch/clustering-algorithm :hierarchical

The default value for the Stitch clustering algorithm is :hierarchical, which applies hierarchical clustering. This value should not be changed without careful consideration. Other configuration values are: :nil, which uses connected components directly.

Days of recorded history

You can configure the number of days that are stores for the Unified_Changes_PKs and Unified_Changes_Clusters tables. The default is thirty days.

:amperity.stitch.settings/unified-changes-recorded-days 30,

Note

Changing this setting will not recreate history that has already been dropped.

Email address patterns

Many email addresses are not useful for identity resolution. Some of them are generic, such as info@some-domain.com, and are often associated with a place of business and are never associated with a unique individual. Other email addresses are bogus, having been entered as a requirement for providing a genuine email address, but are otherwise fake, such as 123@some-domain.com.

The following values associated with the email semantic are ignored by Stitch when performing identity resolution:

  • @NOEMAIL.COM

  • @NOMAIL.COM

  • 0000000000

  • 123@

  • 1234@

  • 99@

  • ABC@

  • ABC123@

  • ADMIN@

  • BOOKING@

  • CLIENT@

  • CLIENTS@

  • CONFIRMATION@

  • CONFIRMATIONS@

  • CONTACT@

  • CUSTOMERSERVICE

  • CUSTOMERSERVICE@

  • CUSTOMERSERVICES

  • CUSTOMERSERVICES@

  • DECLINE@

  • DECLINED@

  • DENIED@

  • EMAIL@

  • @EMAIL.TST

  • EXAMPLE@

  • FAKENAME@

  • GUEST@

  • GUESTS@

  • HELP@

  • HELPS@

  • HOTELHELP@

  • HOTELPARTNER@

  • HOTELPARTNERS@

  • INFO@

  • JUNK@

  • MAIL@

  • ME@

  • N@A

  • NAME@

  • NO@

  • NOEMAIL@

  • NOMAIL@

  • NONE@

  • NONENONE@

  • NOREPLY@

  • NOTHANKS@

  • NOTHANKYOU@

  • ONLINERESERVATION

  • ONLINERESERVATION@

  • ONLINERESERVATIONS

  • ONLINERESERVATIONS@

  • OPERATION@

  • OPERATIONS@

  • QUERIES@

  • QUERY@

  • REFUSED@

  • RES@

  • RESERVAS

  • RESERVATION@

  • RESERVATIONS@

  • ROOMRESERVATION@

  • ROOMRESERVATIONS@

  • SAMPLE@

  • SAMPLES@

  • SERVICE@

  • SHOP@

  • TEST@

  • TESTING@

  • TESTEMAIL@

  • TRAVEL@

  • TRAVELS

  • VENDOR@

  • VENDORS@

  • XXX@

The values in bold are always ignored.

Stitch may be configured to allow certain generic email addresses to be available to Stitch as part of identity resolution when the pre-processing-profile configuration setting is set to:

pre-processing-profile :allow-business-email

When this setting is updated, only the following email address patterns are ignored by Stitch:

  • @NOEMAIL.COM

  • @NOMAIL.COM

  • 123@

  • 1234@

  • 99@

  • ABC@

  • ABC123@

  • DECLINE@

  • DECLINED@

  • DENIED@

  • FAKENAME@

  • JUNK@

  • NO@

  • NOEMAIL@

  • NOMAIL@

  • NONE@

  • NONENONE@

  • NOREPLY@

  • NOTHANKS@

  • NOTHANKYOU@

  • REFUSED@

  • XXX@

Use a bad-values blocklist to configure Amperity to continue ignoring any of the email address patterns that were removed from the default list of ignored email patterns.

Empty tables

You can configure Stitch to accept empty tables. For example, some CCPA and GDPR workflows run daily, but do not always contain data, such as when zero customers make a data subject access request (DSAR) request.

Use the following setting to configure Stitch to ignore empty tables:

:amperity.stitch.settings/allowed-empty-tables #{"Table:Name"}

For example:

:amperity.stitch.settings/allowed-empty-tables #{"CCPA:ViewRequests"}

Stitch may be configured to ignore more than one empty table:

:amperity.stitch.settings/allowed-empty-tables #{"CCPA:ViewRequests", "CCPA:DeleteRequests"}

Force Stitch to run

Stitch will run when domain tables contain updates to data that is used by Stitch for identity resolution. Stitch will not run when updates are not present. To force Stitch to run set the following setting to true:

:amperity.stitch.settings/force? false

Ignore jitter

Jitter tracks changes to Amperity IDs across Stitch runs. You can disable tracking these changes in certain situations. For example, if you are using a tenant for training purposes or if you are loading a very large and complex set of data to an established tenant. In these situations a high jitter rate is common, but not unexpected.

You should leave this setting as false in your production tenant as often as possible. To ignore jitter set the following setting to true:

:amperity.stitch.settings/ignore-jitter? false

Matching strategy

Amperity is configured by default to prioritize foreign key matching over separation key unmatching.

The matching strategy classifier tells Stitch how to apply the results of the blocking strategies, including which groups to analyze and the order in which that analysis should take place, when foreign keys and separation keys are present.

The default behavior prioritizes foreign keys over separation keys.

Foreign key matching

When foreign key matching is the priority, Amperity scores record pairs in the following order:

  1. Does the record contain identical foreign key values?

  2. If true, assign score 5.0. Stop.

  3. If false, does the record contain conflicting separation key values?

  4. If true, assign score 0.0. Stop.

  5. If false, use pairwise comparison scoring.

Foreign key matching priority.

Separation key umatching

A separation key (sk) is used for deterministic unmatching of records.

By default, Amperity derives separation keys for sk-given-name and sk-generational-suffix. You may configure Amperity to prioritize separation keys over foreign keys.

When separation key unmatching is the priority, Amperity scores record pairs in the following order:

  1. Does the record contain conflicting separation key values?

  2. If true, assign score 0.0. Stop.

  3. If false, does the record contain identical foreign key values?

  4. If true, assign score 5.0. Stop.

  5. If false, use pairwise comparison scoring.

Separation key matching priority.

To change the matching strategy classifier

Update the configuration setting for model selection from:

:stitch/classifier :general-ordinal-fk-priority

to:

:stitch/classifier :general-ordinal-sk-priority

Warning

This value should be changed only after careful consideration. If changed, be sure to validate these results carefully to ensure that any changes to pairwise comparison scoring had the desired outcome.

Matching thresholds

Setting the threshold is a key step in tuning Stitch results for Amperity. The recommended threshold setting is 3.

Based on the precision and recall rate that was observed in the initial Stitch results, the matching threshold can be tuned to achieve better results by updating the weakest match category to can be included in Stitch results.

In general, a lower threshold will lead to more matches and fuzzier matched pairs, whereas a higher threshold will lead to fewer matches and more precise matched pairs. For the default ordinal classifier, five different thresholds can be chosen, and they are defined as follows:

Threshold

Weakest Match

1.0

Weak

2.0

Moderate

3.0

High

4.0

Excellent

5.0

Exact

For each threshold, all the matches that are equal to or stronger than the weakest matching type will be clustered together. For example, if the threshold 3.0 is chosen,

The configuration setting for thresholds is:

:amperity.stitch.settings/threshold 3

which ensures that any pair that belongs to the High, Excellent, and Exact match types are stitched into a single cluster, assuming all pairs pass through the blocking phase.

One-to-one Stitch

Stitch may be configured for one-to-one Stitch. This mode assigns an Amperity ID to each of your customers that are identified by a unique (and consistent) customer key that is input to Amperity. (This configuration option is sometimes referred to as “deterministic Stitch”.)

Important

If Stitch has already been run in standard mode, and then if one-to-one mode is enabled, there will be 100% jitter. Amperity IDs are not expected to be the same between standard and one-to-one Stitch runs.

Use the following steps to configure your tenant for one-to-one Stitch:

  1. Ensure that each table that is made available to Stitch applies the ck semantic tag to the field that contains the existing customer ID.

    Tip

    You may apply the ck semantic tag from a feed or from a custom domain table.

  2. Apply all other semantic tags – customer profile, foreign key, primary key, orders and items – to the correct fields in all of your data sources. These tags will have no effect when running one-to-one Stitch, but are required Amp360 (customer profiles and transactions) and AmpIQ (segment insights and predictive modeling).

  3. To configure Amperity for one-to-one Stitch, open the Stitch tab, and then click Settings. In the list of settings, add the following configuration setting:

    :amperity.stitch.settings/one-to-one? true,
    
  4. Run Stitch.

    When the run is complete, each unique customer ID will be associated with an Amperity ID.

Outcome of one-to-one Stitch

The following table describes the changes you will see in your tenant after it is configured for one-to-one Stitch.

Tab

Changes

Stitch

The overview page will show a 0.0% deduplication rate. The Amperity ID will align to the total source IDs provided by the ck.

The Data Explorer disables the tabs for Cluster Graph and Pairwise Comparison. These tabs are not available when one-to-one Stitch mode is configured.

Note

The Amperity ID that is generated in one-to-one Stitch mode is based on customer keys (and not on stable clusters of customer records).

Customer 360

The Unified_Scores table is not generated.

The following fields are removed from the Unified_Coalesced table: component_id, is_supersized, rep_ds, rep_pk, and supersized_id.

Fields related to the bad-values blocklist are not available, including has_blv, blv_address, blv_email, blv_given_name, blv_phone, and blv_surname.

All standard tables will contain a ck field.

The Stitch QA database template is not needed.

Queries

The Stitch QA queries template is not needed.

Preprocessing profiles

Stitch can be configured to use non-default preprocessing profiles for the following use cases:

  1. Allow business email addresses

  2. Australian phone numbers

  3. Clean foreign keys

  4. Default

  5. Normalize gender

  6. Skip derived gender

  7. Prioritize gender in source data

  8. Prioritize derived gender

Important

Use a non-default preprocessing profile in specific situations only.

The default configuration setting for preprocessing profiles is:

:amperity.stitch.settings/pre-processing-profiles #{ default }

To allow business email addresses, use:

:amperity.stitch.settings/pre-processing-profiles #{ :allow-business-email }

To use Australian phone numbers, use:

:amperity.stitch.settings/pre-processing-profiles #{ :australian-customer }

To clean foreign keys (trim whitespace and update them to uppercase), use:

:amperity.stitch.settings/pre-processing-profiles #{ :clean-fk }

To configure gender processing preferences, use:

:amperity.stitch.settings/pre-processing-profiles #{ :normalize-gender }
:amperity.stitch.settings/pre-processing-profiles #{ :skip-derive-gender }
:amperity.stitch.settings/pre-processing-profiles #{ :prioritize-src-gender }

- or -

:amperity.stitch.settings/pre-processing-profiles #{ :prioritize-derived-gender }

To configure more than one preprocessing profile, use:

:amperity.stitch.settings/pre-processing-profiles #{ :allow-business-email :clean-fk }

Stable ID assignment

An Amperity ID is a patented unique identifier that is assigned to clusters of customer records. A single Amperity ID represents a single individual. Unlike other systems, the Amperity ID is reassessed every day for the most comprehensive view of your customers.

As new data is input to Amperity, the Stitch process identifies when new or changed data applies to existing clusters of customer records, and then updates those records, maintains the cluster, and retains a stable Amperity ID assignment. A new Amperity ID is only created when new individuals are identified.

Stable ID assignment can be a resource-intensive process, in particular when:

  1. Adding data sources that contain large numbers of rows (100+ million rows) of customer records.

  2. Updating existing data sources with large numbers of rows on a periodic (monthly, quarterly, etc.) basis.

  3. Data contains a very large number of duplicate values, such as 400k+ instances of an email address that is associated to a common business process.

You can configure the stable ID assignment process in the following ways:

  1. Disable stable IDs

  2. Increase the number of partitions that are available to stable ID assignment

  3. Skip building the **Unified_Changes** table (temporarily)

Disable stable IDs

You can disable stable ID assignment to improve the performance of Amperity during the initial configuration phase for your tenant. (Note that some tenants, because of the way they use Amperity, do not require stable ID assignment and may choose to configure this setting to be false.)

:amperity.stitch.settings/enable-stable-id? false,

Important

Be sure to reset enable-stable-id to true before you start to analyze and validate production-quality Stitch results or to perform any level of Stitch QA.

Increase partitions

When large differences are present between clusters of records for the current and previous Stitch runs you can increase the number of partitions that are used by Stitch during stable ID assignment. This situation can occur when data sources provide periodic updates, such as monthly, quarterly, etc., that contain a very large number of rows.

By default, stable ID assignment is based on the edges that exist between current and previous clusters of customer records, weighted by the number of shared primary keys. The shared primary keys are sorted in descending order, after which ties are broken by sorting the cluster IDs in ascending order. Edges are removed when higher-ranked edges are associated to identical clusters of customer records. Edges that survive this ranking are then used to map current cluster IDs to previous cluster IDs. Changes to stable ID assignment are captured in the Unified_Changes_Clusters and Unified_Changes_PKs tables.

In some cases, the differences between the current and previous clusters of customer records is very large, requiring access to a large amount of memory to complete the stable ID assignment process. In this type of situation Stitch processes may take a very long time, or even appear to be stuck; in some cases the Stitch job has run out of memory and will need to be re-run. In these situations, do the following:

  1. Look for unusual values, such as a large set of identical email address, that appear in the Unified_Coalesced table. This can sometimes be the cause of slow stable ID assignment. Mitigate the presence of these unusual values, and then run Stitch again.

  2. Increase the number of partitions that are available to Stitch during the stable ID assignment process. You can increase the value of stable-id-partition-count to a value between 2-10 to improve the performance of Stitch during stable ID assigment. This setting should be used temporarily, but for some tenants it may need to be left at a non-default value. For example:

    :amperity.stitch.settings/stable-id-partition-count 6
    

Skip unified changes

The Unified_Changes_Clusters and Unified_Changes_PKs tables may be resource intensive becaue they contain all of the changes to clusters and primary keys between the current and previous Stitch run.

  • The Unified_Changes_Clusters table contains a history of changes to cluster graphs, relative to the previous Stitch run.

  • The Unified_Changes_PKS table contains a history of changes to primary keys, relative to the previous Stitch run.

The default configuration for Stitch always builds these tables, but you can configure Stitch to skip building them when the following setting is set to true:

:amperity.stitch.settings/skip-unified-changes? false,

Skip unified scores

The Unified_Scores table records all of the pairwise comparison scores and match categories for all groups of records, and then for each group of records all of the pairwise scores that are present between records within that group.

The Unified_Scoress tables may be resource intensive. The default configuration for Stitch always builds this table, but you can configure Stitch to skip building it when the following setting is set to true:

:amperity.stitch.settings/skip-scores-output? false,

Stitch reports

A Stitch report shows cluster graphs for individuals associated with the Amperity ID. You can configure the Stitch report to include or exclude specific Amperity IDs. Ensuring that certain Amperity IDs are included (or excluded) can help improve the quality of the Stitch report. The Amperity IDs that are included will appear first in the series of individuals shown when exploring Amperity IDs.

Stitch may be configured to include or exclude specific Amperity IDs in the Stitch report.

  • Use :amperity.stitch-report.example/exclude to define a list of Amperity IDs to be excluded from the report.

  • Use :amperity.stitch-report.example/include to define a list of Amperity IDs to be included in the report.

These settings are available from the Stitch tab. Click Configure, add the list of Amperity IDs to be included or excluded under Stitch report configuration, and then run Stitch.

For example:

{
  :amperity.stitch-report.example/exclude
    [
    "123a456b-789c-012d-345e-678fgh901ijk"
    "124a456b-789c-012d-345e-678fgh901ijk"
    "125a456b-789c-012d-345e-678fgh901ijk"
    "126a456b-789c-012d-345e-678fgh901ijk"
    ...
  ],
 :amperity.stitch-report.example/include
  [
    "723a456b-789c-012d-345e-678fgh901ijk"
    "724a456b-789c-012d-345e-678fgh901ijk"
    "725a456b-789c-012d-345e-678fgh901ijk"
    "726a456b-789c-012d-345e-678fgh901ijk"
    ...
  ],
}

Supersized clusters

A supersized cluster is a cluster of records that is discovered during the Stitch process that has more than 64 matching records. A supersized cluster does not typically represent a unique individual and is not worthy of further analysis.

A supersized cluster is created when multiple transitive connections are present across clusters of records. For example, individuals named Mary Johnson and Jeffrey Johnson with the following records:

  1. Mary Johnson, maryjohnson @gmail.com, 50 1st Avenue, New York, NY, with 50 connected records.

  2. Jeffrey Johnson, jeffjohnson @gmail.com, 50 1st Avenue, New York, NY, with 25 connected records.

  3. Mary Johnson, mjohnson50 @gmail.com, 50 1st Avenue, New York, NY, with 17 connected records.

  4. Jeffrey Johnson, mjohnson50 @gmail.com, 50 1st Avenue, New York, NY, with 8 connected records.

These records block together in the following ways:

  • Records 1 and 3 block together on name and address.

  • Records 2 and 4 block together on name and address.

  • Records 3 and 4 block together on email.

All four groups of records transitively connect into a single connected cluster with a size of 100.

Amperity defines a supersized cluster as any cluster with 64 (or more) connections. To change this threshold to a higher or lower value, update the following Stitch configuration setting:

:amperity.stitch.settings/supersized-cluster-min-size 64

Trivial duplicates

A trivial duplicate is a set of nearly-identical records that share enough matching PII to clearly identify a single unique individual. Trivial duplicates are identified by Stitch early in the identity resolution process. Only one of these records is passed downstream for additional Stitch processing; the other records – the trivial duplicates – are not.

The following table represents a set of records that contain enough matching PII to associate all of these records to a single unique individual:

------- ----------- ---------- ----------------- ------ ------------------------
 pk      FirstName   LastName   Address           ...    Email
------- ----------- ---------- ----------------- ------ ------------------------
 r-1     Justin      Currie     123 West Elm St   ...    ack192390190@gmail.com
 r-4     Justin      Currie     123 West Elm St   ...    vnbb1109fe12@gmail.com
 r-7     Justin      Currie     123 West Elm St   ...    ack192390190@gmail.com
 r-12    Justin      Currie     123 West Elm St   ...    ack192390190@gmail.com
 r-78    Justin      Currie     123 West Elm St   ...    doi210898r09@appco.com
 r-87    Justin      Currie     123 West Elm St   ...    vnbb1109fe12@gmail.com
 r-204   Justin      Currie     123 West Elm St   ...    vnbb1109fe12@gmail.com
 r-377   Justin      Currie     123 West Elm St   ...    vnbb1109fe12@gmail.com
 r-589   Justin      Currie     123 West Elm St   ...    ffjo19290190@gmail.com
 r-829   Justin      Currie     123 West Elm St   ...    vnbb11095412@gmail.com
 r-911   Justin      Currie     123 West Elm St   ...    ack192390190@gmail.com
------- ----------- ---------- ----------------- ------ ------------------------

However, not all of the PII is identical. The email addresses do not match across all of the records. The trivial duplication process will collapse the identical records down, leaving the following set of records:

------- ----------- ---------- ----------------- ------ ------------------------
 pk      FirstName   LastName   Address           ...    Email
------- ----------- ---------- ----------------- ------ ------------------------
 r-1     Justin      Currie     123 West Elm St   ...    ack192390190@gmail.com
 r-78    Justin      Currie     123 West Elm St   ...    doi210898r09@appco.com
 r-589   Justin      Currie     123 West Elm St   ...    ffjo19290190@gmail.com
 r-829   Justin      Currie     123 West Elm St   ...    vnbb11095412@gmail.com
------- ----------- ---------- ----------------- ------ ------------------------

The complete set of records (including trivial duplicates) will be available in the Unified_Coalesced table. The collapsed records will be available in the Unified_Preprocessed_Raw table.

A qualified trivial duplicate is a set of records with enough matching PII to score 3.0 (or greater) and were grouped together.

Qualified trivial duplicates are treated as a single record by downstream Stitch processes and are truncated from the Unified_Preprocessed_Raw table.

What are the rep_ds and rep_pk columns?

Use the rep_pk and rep_ds columns in the Unified_Coalesced table to help with situations where it’s necessary to understand why two records were not clustered together.

The rep_pk column is an identifier that represents the first grouping of records done by Stitch. This grouping is based on identical semantic patterns.

The rep_ds column shows the datasource that is associated with the rep_pk column.

The combination of rep_ds and rep_pk represent qualified trivial duplicates that were discovered by Stitch early in the identity resolution process.

The configuration setting for trivial duplicates is:

:amperity.stitch.settings/soft-trivial-dupe-size-threshold 10

An increase to this value will decrease the likelihood that multiple records with trivial differences are collapsable into a single record for the identity resolution process. This setting should only be tuned after understanding the quality of downstream data, including stitched records.

Semantic exclusions

Semantic exclusions for trivial duplicates may be specified. When excluded, for the purpose of identifying clusters of records, the values associated with that semantic will be ignored. This should only be done for limited scenarios where certain types of data are known to be of lower quality.

To define an exclusion, add the following configuration setting for Stitch:

:amperity.stitch.settings/soft-trivial-dupe-semantic-exclusions #{"semantic_name"}

where semantic_name is a the name of a semantic, such as email. (This value is nil by default.)

The following table represents a set of records that contain enough PII that should identify a single unique individual:

------- ----------- ---------- ----------------- ------ ------------------------
 pk      FirstName   LastName   Address           ...    Email
------- ----------- ---------- ----------------- ------ ------------------------
 r-1     Justin      Currie     123 West Elm St   ...    ack192390190@gmail.com
 r-4     Justin      Currie     123 West Elm St   ...    vnbb1109fe12@gmail.com
 r-7     Justin      Currie     123 West Elm St   ...    vnbb11ss0912@gmail.com
 r-12    Justin      Currie     123 West Elm St   ...    fio190912380@cgg.com
 r-78    Justin      Currie     123 West Elm St   ...    doi210898r09@appco.com
 r-87    Justin      Currie     123 West Elm St   ...    vnbb41090912@gmail.com
 r-204   Justin      Currie     123 West Elm St   ...    vnbb11090912@gmail.com
 r-377   Justin      Currie     123 West Elm St   ...    vnbb12290912@gmail.com
 r-589   Justin      Currie     123 West Elm St   ...    dfjo19290190@gmail.com
 r-829   Justin      Currie     123 West Elm St   ...    vnbb11095412@gmail.com
 r-911   Justin      Currie     123 West Elm St   ...    vnbb11450912@gmail.com
------- ----------- ---------- ----------------- ------ ------------------------

All of the information is identical, except for variations in email addresses that causes each of them to be unique. Because the email addresses are not identical, Amperity will not collapse them into a single record for identity resolution purposes even when it is obvious that all of these records are the same individual.

The maximum allowed number of records with trivial duplicates is set to 10 by default. When the number of records with trivial duplicates is greater than this value, each individual record will be treated as a single record. Since the previous example shows eleven records for Justin Currie, each with a unique value for the email column, eleven individual records would be created.

Semantic exclusions define a threshold over which records like these will be collapsed into a trivial duplicate. For example, let’s say you have defined a semantic exclusion for email and left the size threshold at 10.

For each unique combination of PII–excluding email addresses!–the distinct email addresses that are associated with that unique combination of PII are compared. If there are more than 10 distinct email addresses, those records are collapsed into a trivial duplicate.

Warning

Semantic exclusions should be applied very carefully. Use a sandbox to configure and apply a semantic exclusion, and then carefully review and validate that all downstream processes are not adversely affected by the change prior to applying a semantic change to a production environment.

Example: Email addresses

A long-running promotion for a free food item results in a large number of email addresses associated with the same first name, last name, and phone number. This results in a large number of nearly-identical records, each with a unique email address. You can use semantic exclusions to define a threshold over which records like this are collapsed into a trivial duplicate.

Configure Stitch to define a semantic exclusion for email addresses:

:amperity.stitch.settings/soft-trivial-dupe-semantic-exclusions #{"email"}

and then define the threshold:

:amperity.stitch.settings/soft-trivial-dupe-size-threshold 25

For each unique combination of PII–excluding email addresses!–the distinct email addresses that are associated with that unique combination of PII are compared. If there are more than 25 distinct email addresses, those records are collapsed into a trivial duplicate.