Configure Stitch

Stitch uses patented algorithms to evaluate massive volumes of data to discover the hidden connections in your customer records that identify unique individuals. Stitch outputs a unified collection of data that assigns a unique identifier to each unique individual that is discovered within your customer records.

Stitch configuration settings

Stitch configuration is defined by a list of settings in the Settings dialog box.

Caution

For most situations, there is no reason to change these settings. In some cases, after consultation with your Amperity representative, tuning Stitch configuration settings may be helpful.

To edit Stitch configuration settings

  1. From the Stitch tab, click Settings.

  2. In the Stitch Settings dialog box, under Stitch configuration, review the configuration settings.

  3. Make your changes.

    Note

    Stitch settings should be configured using a sandbox. Verify Stitch results, verify Stitch QA results, the values in all standard database tables, and the behavior of downstream workflows within Amperity to ensure that changes to Stitch settings have the desired effects.

  4. Click Save.

  5. From the Stitch tab, click Run. Carefully review the results to ensure that your changes had the desired effect.

Configuration options

The types of configuration changes some organizations make include:

Warning

Stitch configuration is represented as a block of Clojure code exposed via the Amperity UI:

{:amperity.stitch.settings/blocking-strategies
 #{:dnf1 :dnf3 :dnf4 :dnf5 :dnf6 :dnf7 :dnf8 :email :fk},
 :amperity.stitch.settings/classifier :general-ordinal-fk-priority,
 :amperity.stitch.settings/clustering-algorithm :hierarchical,
 :amperity.stitch.settings/enable-low-cardinality-alerts? false,
 :amperity.stitch.settings/enable-stable-id? true,
 :amperity.stitch.settings/harvest-feature-profiles? false,
 :amperity.stitch.settings/ignore-jitter? true,
 :amperity.stitch.settings/metrics-partition-period 100,
 :amperity.stitch.settings/output-partitions 128,
 :amperity.stitch.settings/parallelism 2,
 :amperity.stitch.settings/pre-processing-profile default,
 :amperity.stitch.settings/samples-per-feature-signature 3,
 :amperity.stitch.settings/skip-scores-output? false,
 :amperity.stitch.settings/skip-unified-changes? false,
 :amperity.stitch.settings/soft-trivial-dupe-size-threshold 10,
 :amperity.stitch.settings/stable-id-partition-count 1,
 :amperity.stitch.settings/supersized-cluster-min-size 500,
 :amperity.stitch.settings/supersized-partition-max-depth 4,
 :amperity.stitch.settings/threshold 3,
 :amperity.stitch.settings/unified-changes-recorded-days 30,
 :amperity.stitch.settings/use-uuid-key-ranges? true,
 :amperity.stitch.settings/write-clustering-table? false}

Many configuration settings are configurable and are described in more detail below.

One-to-one Stitch

Stitch may be configured for one-to-one Stitch. This mode assigns an Amperity ID to each of your customers that are identified by a unique (and consistent) customer key that is input to Amperity. (This configuration option is sometimes referred to as “deterministic Stitch”.)

Important

If Stitch has already been run in standard mode, and then if one-to-one mode is enabled, there will be 100% jitter. Amperity IDs are not expected to be the same between standard and one-to-one Stitch runs.

Use the following steps to configure your tenant for one-to-one Stitch:

  1. Ensure that each table that is made available to Stitch applies the ck semantic tag to the field that contains the existing customer ID.

    Tip

    You may apply the ck semantic tag from a feed or from a custom domain table.

  2. Apply all other semantic tags – customer profile, foreign key, primary key, orders and items – to the correct fields in all of your data sources. These tags will have no effect when running one-to-one Stitch, but are required Amp360 (customer profiles and transactions) and AmpIQ (segment insights and predictive modeling).

  3. To configure Amperity for one-to-one Stitch, open the Stitch tab, and then click Settings. In the list of settings, add the following configuration setting:

    :amperity.stitch.settings/one-to-one? true,
    
  4. Run Stitch.

    When the run is complete, each unique customer ID will be associated with an Amperity ID.

Outcome of one-to-one Stitch

The following table describes the changes you will see in your tenant after it is configured for one-to-one Stitch.

Tab

Changes

Stitch

The overview page will show a 0.0% deduplication rate. The Amperity ID will align to the total source IDs provided by the ck.

The Data Explorer disables the tabs for Cluster Graph and Pairwise Comparison. These tabs are not available when one-to-one Stitch mode is configured.

Note

The Amperity ID that is generated in one-to-one Stitch mode is based on customer keys (and not on stable clusters of customer records).

Customer 360

The Unified_Scores table is not generated.

The following fields are removed from the Unified_Coalesced table: component_id, is_supersized, rep_ds, rep_pk, and supersized_id.

Fields related to the bad-values blocklist are not available, including has_blv, blv_address, blv_email, blv_given_name, blv_phone, and blv_surname.

All standard tables will contain a ck field.

The Stitch QA database template is not needed.

Queries

The Stitch QA queries template is not needed.

Blocking strategy

Blocking is a non-trivial step for record linking in the Stitch process. An overly generous blocking strategy may result in a high recall rate (too many pairs being evaluated) along with negative system performance. An overly conservative blocking strategy may result in a low recall rate (too few pairs being evaluated). Individual blocking keys may be conservative or generous. The combination of blocking keys is what creates the ideal recall rate without compromising the performance of Amperity.

The default blocking strategy:

:stitch/blocking-strategies #{:dnf1 :dnf3 :dnf4 :dnf5 :dnf6 :dnf7 :dnf8 :email :fk}

Note

The order in which blocking strategies are listed does not matter. For example:

:dnf1 :dnf3 :dnf4 :dnf5

and:

:dnf1 :dnf4 :dnf5 :dnf3

will be processed in the same way and will return the same results.

In nearly all cases for all customers, the default blocking strategy should provide a reasonable recall rate. Each individual blocking strategy looks at various combinations of PII data:

Strategy

Key

:address

Non-default. This blocking strategy groups values associated with the full address using the address and address2 semantics.

:company

Non-default. This blocking strategy groups values associated with the company semantic.

:dnf1

Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first character in surname, and birthdate.

:dnf2

Non-default, use carefully. This blocking strategy groups values associated with the following semantics: the full given-name and email.

:dnf3

Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first three characters in surname, and postal.

:dnf4

Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name and address.

:dnf5

Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name and phone.

:dnf6

Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first three characters in surname, and the local part of an email address in email.

:dnf7

Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name and company.

:dnf8

Default. This blocking strategy groups values associated with the following semantics: the first three characters in given-name, the first three characters in surname, and PO box values that are derived from address.

:email

Default. This blocking strategy groups values associated with the following semantics: the full email address in email.

:fk

Default. This blocking strategy groups values associated with foreign keys.

:login-partial

Non-default. This blocking strategy groups values associated with cleaned email addresses derived from email. This is used for low-threshold email address matching.

:login-trimmed

Non-default. This blocking strategy groups values associated with the first five characters of an email addresses derived from email. This is used for low-threshold email address matching.

:name

Non-default, use carefully. This blocking strategy groups values associated with the following semantics: given-name and surname. The order of given-name and surname is sorted lexicographically. The blocking key for JOHN SMITH and SMITH JOHN is JOHN:SMITH.

:phone

Non-default. This blocking strategy groups values associated with the phone semantic.

Case-sensitive foreign keys

Values associated with foreign keys are case-insensitive by default. To configure values for particular foreign keys to be case-sensitive, add the following configuration setting to Stitch:

:amperity.stitch.settings/case-sensitive-fks #{"fk-name"}

where fk-name is a the name of the foreign key for which values will be treated as case-sensitive.

Clustering algorithm

Warning

Use :hierarchical clustering unless instructed differently.

The configuration setting for the clustering algorithm is:

:stitch/clustering-algorithm :hierarchical

The default value for the Stitch clustering algorithm is :hierarchical, which applies hierarchical clustering. This value should not be changed without careful consideration. Other configuration values are: :nil, which uses connected components directly.

Email address patterns

Many email addresses are not useful for identity resolution. Some of them are generic, such as info@some-domain.com, and are often associated with a place of business and are never associated with a unique individual. Other email addresses are bogus, having been entered as a requirement for providing a genuine email address, but are otherwise fake, such as 123@some-domain.com.

The following values associated with the email semantic are ignored by Stitch when performing identity resolution:

  • @NOEMAIL.COM

  • @NOMAIL.COM

  • 0000000000

  • 123@

  • 1234@

  • 99@

  • ABC@

  • ABC123@

  • ADMIN@

  • BOOKING@

  • CLIENT@

  • CLIENTS@

  • CONFIRMATION@

  • CONFIRMATIONS@

  • CONTACT@

  • CUSTOMERSERVICE

  • CUSTOMERSERVICE@

  • CUSTOMERSERVICES

  • CUSTOMERSERVICES@

  • DECLINE@

  • DECLINED@

  • DENIED@

  • EMAIL@

  • @EMAIL.TST

  • EXAMPLE@

  • FAKENAME@

  • GUEST@

  • GUESTS@

  • HELP@

  • HELPS@

  • HOTELHELP@

  • HOTELPARTNER@

  • HOTELPARTNERS@

  • INFO@

  • JUNK@

  • MAIL@

  • ME@

  • N@A

  • NAME@

  • NO@

  • NOEMAIL@

  • NOMAIL@

  • NONE@

  • NONENONE@

  • NOREPLY@

  • NOTHANKS@

  • NOTHANKYOU@

  • ONLINERESERVATION

  • ONLINERESERVATION@

  • ONLINERESERVATIONS

  • ONLINERESERVATIONS@

  • OPERATION@

  • OPERATIONS@

  • QUERIES@

  • QUERY@

  • REFUSED@

  • RES@

  • RESERVAS

  • RESERVATION@

  • RESERVATIONS@

  • ROOMRESERVATION@

  • ROOMRESERVATIONS@

  • SAMPLE@

  • SAMPLES@

  • SERVICE@

  • SHOP@

  • TEST@

  • TESTING@

  • TESTEMAIL@

  • TRAVEL@

  • TRAVELS

  • VENDOR@

  • VENDORS@

  • XXX@

The values in bold are always ignored.

Stitch may be configured to allow certain generic email addresses to be available to Stitch as part of identity resolution when the pre-processing-profile configuration setting is set to:

pre-processing-profile :allow-business-email

When this setting is updated, only the following email address patterns are ignored by Stitch:

  • @NOEMAIL.COM

  • @NOMAIL.COM

  • 123@

  • 1234@

  • 99@

  • ABC@

  • ABC123@

  • DECLINE@

  • DECLINED@

  • DENIED@

  • FAKENAME@

  • JUNK@

  • NO@

  • NOEMAIL@

  • NOMAIL@

  • NONE@

  • NONENONE@

  • NOREPLY@

  • NOTHANKS@

  • NOTHANKYOU@

  • REFUSED@

  • XXX@

Use a bad-values blocklist to configure Amperity to continue ignoring any of the email address patterns that were removed from the default list of ignored email patterns.

Empty tables

You can configure Stitch to accept empty tables. For example, some CCPA and GDPR workflows run daily, but may not contain data on a given day. For example, a table will be empty when zero customers make a data subject access request (DSAR) request.

Use the following setting to configure Stitch to ignore empty tables:

:amperity.stitch.settings/allowed-empty-tables #{"Table:Name"}

For example:

:amperity.stitch.settings/allowed-empty-tables #{"CCPA:ViewRequests"}

Stitch may be configured to ignore more than one empty table:

:amperity.stitch.settings/allowed-empty-tables #{"CCPA:ViewRequests", "CCPA:DeleteRequests"}

Matching strategy

Amperity is configured by default to prioritize foreign key matching over separation key unmatching.

The matching strategy classifier tells Amperity how to apply the results of the blocking strategies, including which groups to analyze and the order in which that analysis should take place, when foreign keys and separation keys are present.

The default behavior prioritizes foreign keys over separation keys.

Foreign key matching

When foreign key matching is the priority, Amperity scores record pairs in the following order:

  1. Does the record contain identical foreign key values?

  2. If true, assign score 5.0. Stop.

  3. If false, does the reord contain conflicting separation key values?

  4. If true, assign score 0.0. Stop.

  5. If false, use pairwise comparison scoring.

Foreign key matching priority.

Separation key umatching

A separation key (sk) is used for deterministic unmatching of records.

By default, Amperity derives separation keys for sk-given-name and sk-generational-suffix. You may configure Amperity to prioritize separation keys over foreign keys.

When separation key unmatching is the priority, Amperity scores record pairs in the following order:

  1. Does the record contain conflicting separation key values?

  2. If true, assign score 0.0. Stop.

  3. If false, does the record contain identical foreign key values?

  4. If true, assign score 5.0. Stop.

  5. If false, use pairwise comparison scoring.

Separation key matching priority.

To change the matching strategy classifier

Update the configuration setting for model selection from:

:stitch/classifier :general-ordinal-fk-priority

to:

:stitch/classifier :general-ordinal-sk-priority

Warning

This value should be changed only after careful consideration. If changed, be sure to validate these results carefully to ensure that any changes to pairwise comparison scoring had the desired outcome.

Matching thresholds

Setting the threshold is a key step in tuning Stitch results for Amperity. The recommended threshold setting is 3.

Based on the precision and recall rate the user observed in the initial Stitch results, the user can tune the Stitch threshold to achieve better results by setting the weakest match category to be included in Stitch results.

In general, a lower threshold will lead to more matches and fuzzier matched pairs, whereas a higher threshold will lead to fewer matches and more precise matched pairs. For the default ordinal classifier, five different thresholds can be chosen, and they are defined as follows:

Threshold

Weakest Match

1.0

Weak

2.0

Moderate

3.0

High

4.0

Excellent

5.0

Exact

For each threshold, all the matches that are equal to or stronger than the weakest matching type will be clustered together. For example, if the threshold 3.0 is chosen,

The configuration setting for thresholds is:

:amperity.stitch.settings/threshold 3

which ensures that any pair that belongs to the High, Excellent, and Exact match types are stitched into a single cluster, assuming all pairs pass through the blocking phase.

Preprocessing profiles

Stitch can be configured to use non-default preprocessing profiles for the following use cases:

  1. Allow business email addresses

  2. Australian phone numbers

  3. Clean foreign keys

  4. Default

Important

Use a non-default preprocessing profile in specific situations only.

The configuration setting for preprocessing profiles is:

:amperity.stitch.settings/pre-processing-profile default,

To allow business email addresses, change this setting to:

:amperity.stitch.settings/pre-processing-profile allow-business-email,

Record history days

The configuration setting for how many days of changes to record in the Unified_Changes table. The default is thirty days.

:amperity.stitch.settings/unified-changes-recorded-days 30,

Note

Changing this setting will not recreate history that has already been dropped.

Stable ID assignment

An Amperity ID is a patented unique identifier that is assigned to clusters of customer records. A single Amperity ID represents a single individual. Unlike other systems, the Amperity ID is reassessed every day for the most comprehensive view of your customers.

As new data is input to Amperity, the Stitch process identifies when new or changed data applies to existing clusters of customer records, and then updates those records, maintains the cluster, and retains a stable Amperity ID assignment. A new Amperity ID is only created when new individuals are identified.

Stable ID assignment can be a resource-intensive process, in particular when:

  1. Adding data sources that contain large numbers of rows (100+ million rows) of customer records.

  2. Updating existing data sources with large numbers of rows on a periodic (monthly, quarterly, etc.) basis.

  3. Data contains a very large number of duplicate values, such as 400k+ instances of an email address that is associated to a common business process.

You can configure the stable ID assignment process in the following ways:

  1. Disable stable IDs

  2. Increase the number of partitions that are available to stable ID assignment

  3. Skip building the **Unified_Changes** table (temporarily)

Disable stable IDs

You can disable stable ID assignment to improve the performance of Amperity during the initial configuration phase for your tenant. (Note that some tenants, because of the way they use Amperity, do not require stable ID assignment and may choose to configure this setting to be false.)

:amperity.stitch.settings/enable-stable-id? false,

Important

Be sure to reset enable-stable-id to true before you start to analyze and validate production-quality Stitch results or to perform any level of Stitch QA.

Increase partitions

When large differences are present between clusters of records for the current and previous Stitch runs you can increase the number of partitions that are used by Stitch during stable ID assignment. This situation can occur when data sources provide periodic updates, such as monthly, quarterly, etc., that contain a very large number of rows.

By default, stable ID assignment is based on the edges that exist between current and previous clusters of customer records, weighted by the number of shared primary keys. The shared primary keys are sorted in descending order, after which ties are broken by sorting the cluster IDs in ascending order. Edges are removed when higher-ranked edges are associated to identical clusters of customer records. Edges that survive this ranking are then used to map current cluster IDs to previous cluster IDs. Changes to stable ID assignment are captured in the Unified_Changes_Clusters and Unified_Changes_PKs tables.

In some cases, the differences between the current and previous clusters of customer records is very large, requiring access to a large amount of memory to complete the stable ID assignment process. In this type of situation Stitch processes may take a very long time, or even appear to be stuck; in some cases the Stitch job has run out of memory and will need to be re-run. In these situations, do the following:

  1. Look for unusual values, such as a large set of identical email address, that appear in the Unified_Coalesced table. This can sometimes be the cause of slow stable ID assignment. Mitigate the presence of these unusual values, and then run Stitch again.

  2. Increase the number of partitions that are available to Stitch during the stable ID assignment process. You can increase the value of stable-id-partition-count to a value between 2-10 to improve the performance of Stitch during stable ID assigment. This setting should be used temporarily, but for some tenants it may need to be left at a non-default value. For example:

    :amperity.stitch.settings/stable-id-partition-count 6
    

Skip unified changes

The Unified_Changes_Clusters and Unified_Changes_PKs tables may be resource intensive becaue they contain all of the changes to clusters and primary keys between the current and previous Stitch run.

  • The Unified_Changes_Clusters table contains a history of changes to cluster graphs, relative to the previous Stitch run.

  • The Unified_Changes_PKS table contains a history of changes to primary keys, relative to the previous Stitch run.

The default configuration for Stitch always builds these tables, but you can configure Stitch to skip building them when the following setting is set to true:

:amperity.stitch.settings/skip-unified-changes? false,

Stitch reports

A Stitch report shows cluster graphs for individuals associated with the Amperity ID. You can configure the Stitch report to include or exclude specific Amperity IDs. Ensuring that certain Amperity IDs are included (or excluded) can help improve the quality of the Stitch report. The Amperity IDs that are included will appear first in the series of individuals shown when exploring Amperity IDs.

Stitch may be configured to include or exclude specific Amperity IDs in the Stitch report.

  • Use :amperity.stitch-report.example/exclude to define a list of Amperity IDs to be excluded from the report.

  • Use :amperity.stitch-report.example/include to define a list of Amperity IDs to be included in the report.

These settings are available from the Stitch tab. Click Configure, add the list of Amperity IDs to be included or excluded under Stich report configuration, and then run Stitch.

For example:

{
  :amperity.stitch-report.example/exclude
    [
    "123a456b-789c-012d-345e-678fgh901ijk"
    "124a456b-789c-012d-345e-678fgh901ijk"
    "125a456b-789c-012d-345e-678fgh901ijk"
    "126a456b-789c-012d-345e-678fgh901ijk"
    ...
  ],
 :amperity.stitch-report.example/include
  [
    "723a456b-789c-012d-345e-678fgh901ijk"
    "724a456b-789c-012d-345e-678fgh901ijk"
    "725a456b-789c-012d-345e-678fgh901ijk"
    "726a456b-789c-012d-345e-678fgh901ijk"
    ...
  ],
}

Supersized clusters

A supersized cluster is a cluster of records that is discovered during the Stitch process that has more than ~100 matching records. When a cluster has more than ~100 records, this is more often an indicator for abandonment of continued analysis than one of an indicator of interest for further analysis.

A supersized cluster is created when multiple transitive connections are present. For example, a couple named Mary Johnson and Jeffrey Johnson with the following records:

  1. Mary Johnson, maryjohnson @gmail.com, 50 1st Avenue, New York, NY, with 50 connected records.

  2. Jeffrey Johnson, jeffjohnson @gmail.com, 50 1st Avenue, New York, NY, with 25 connected records.

  3. Mary Johnson, mjohnson50 @gmail.com, 50 1st Avenue, New York, NY, with 17 connected records.

  4. Jeffrey Johnson, mjohnson50 @gmail.com, 50 1st Avenue, New York, NY, with 8 connected records.

These records block together in the following ways:

  • Records 1 and 3 block together on name and address.

  • Records 2 and 4 block together on name and address.

  • Records 3 and 4 block together on email.

All four groups of records transitively connect into a single connected cluster with a size of 100.

Amperity defines a supersized cluster as any cluster with 64 (or more) connections. To change this threshold to a higher or lower value, update the following Stitch configuration setting:

:amperity.stitch.settings/supersized-cluster-min-size 64

Trivial duplicates

A trivial duplicate is a set of nearly-identical records that are identified by Stitch early in the identity resolution process. Only one of the nearly-identical records is passed to downstream Stitch processes.

For example, the following records are nearly identical:

------- ----------- ---------- ----------------- ------ ------------------------
 pk      FirstName   LastName   Address           ...    Email
------- ----------- ---------- ----------------- ------ ------------------------
 r-1     Justin      Currie     123 West Elm St   ...    ack192390190@gmail.com
 r-4     Justin      Currie     123 West Elm St   ...    vnbb11090912@gmail.com
 r-7     Justin      Currie     123 West Elm St   ...    vnbb11090912@gmail.com
 r-12    Justin      Currie     123 West Elm St   ...    fio190912380@cgg.com
 r-78    Justin      Currie     123 West Elm St   ...    doi210898r09@appco.com
 r-87    Justin      Currie     123 West Elm St   ...    vnbb11090912@gmail.com
 r-204   Justin      Currie     123 West Elm St   ...    vnbb11090912@gmail.com
 r-377   Justin      Currie     123 West Elm St   ...    vnbb11090912@gmail.com
 r-589   Justin      Currie     123 West Elm St   ...    dfjo19290190@gmail.com
 r-829   Justin      Currie     123 West Elm St   ...    vnbb11090912@gmail.com
 r-911   Justin      Currie     123 West Elm St   ...    vnbb11090912@gmail.com
------- ----------- ---------- ----------------- ------ ------------------------

All of the information is identical (including the truncated City, State, and Postal columns), except for variations in email addresses. Because the records are not identical, Amperity will not collapse them into a single record for identity resolution purposes. But when too many records with trivial differences are present, the identity resolution process will not collapse records even when it is obvious that all of these records are the same individual.

The maximum allowed number of records with trivial duplicates is set to 10 by default. When the number of records with trivial duplicates is greater than this value, each individual record will be treated as a single record. Since the previous example shows eleven records for Justin Currie, each with a unique value for the email column, eleven individual records would be created.

The configuration setting for trivial duplicates is:

:amperity.stitch.settings/soft-trivial-dupe-size-threshold 10

An increase to this value will decrease the likelihood that multiple records with trivial differences are collapsable into a single record for the identity resolution process. This setting should only be tuned after understanding the quality of downstream data, including stitched records.

Semantic exclusions

Semantic exclusions for trivial duplicates may be specified. When excluded, for the purpose of identifying clusters of records, the values associated with that semantic will be ignored. This should only be done for limited scenarios where certain types of data are known to be of lower quality.

To define an exclusion, add the following configuration setting for Stitch:

:amperity.stitch.settings/soft-trivial-dupe-semantic-exclusions #{"semantic_name"}

where semantic_name is a the name of a semantic, such as email. (This value is nil by default.)

Warning

Semantic exclusions should be applied very carefully. Use a sandbox to configure and apply a semantic exclusion, and then carefully review and validate that all downstream processes are not adversely affected by the change prior to applying a semantic change to a production environment.

Example: Email addresses

A long-running promotion for a free food item results in a large number of email addresses associated with the same first name, last name, and phone number. This results in a large number of nearly-identical records, each with a unique email address. You can use semantic exclusions to define a threshold over which records like this are collapsed into a trivial duplicate.

Configure Stitch to define a semantic exclusion for email addresses:

:amperity.stitch.settings/soft-trivial-dupe-semantic-exclusions #{"email"}

and then define the threshold:

:amperity.stitch.settings/soft-trivial-dupe-size-threshold 25

For each unique combination of PII–excluding email addresses!–the distinct email addresses that are associated with that unique combination of PII are compared. If there are more than 25 distinct email addresses, those records are collapsed into a trivial duplicate.