Stitch labels

Stitch labels identify when a single customer record was incorrectly merged together (overclustered) or when two customer records were incorrectly split apart (underclustered).

Over-clustering

An overcluster, or a false positive, occurs when distinct records are incorrectly added to a cluster of records. Each overcluster affects the precision of identity resolution and should be investigated to understand why it occurred.

Precision is the relation between true positives, i.e. correct matches, and the total number of true positives and false positives, i.e. incorrect matches. A decrease in overclustering will increase precision.

Under-clustering

An undercluster, or a false negative, occurs when distinct records are incorrectly split from a cluster of records. Each undercluster affects the precision of identity resolution and should be investigated to understand why it occurred.

Recall is the relation between true positives to the total number of true positives and false negatives, i.e. incorrect splits. An increase in underclustering will decrease recall.

Table schema

The schema for a Stitch labels table is a CSV file. It must contain the following fields:

  • row_id A sequential ordering of rows in the Stitch labels table, starting at 1.

  • label_id A string that must match the label_id for another row in the Stitch labels table. This string does not need to be unique, but it must be unique enough to allow the label_id / partition_id pairs to be uniquely identifiable within the table.

  • partition_id An integer (1 or 2) that identifies if two rows in the Stitch labels table should match (1 and 1) or be split (1 and 2).

  • datasource The name of the domain table from which the customer record originated.

  • semantic The semantic tag associated with the value. The recommended semantic is pk, as this is the Amperity ID for individual customer records. The pk value must be unique within the Stitch labels table.

    Note

    You may use other semantic values for Stitch labels, such as email or a foreign key, as long as the associated blocking strategy is configured for Stitch.

    1. The :email blocking strategy must be configured before using email as a semantic value.

    2. The :fk blocking strategy must be configured before using a foreign key as a semantic value.

    These blocking strategies are configured by default, but should be verified before using email addresses or foreign keys as semantic values for Stitch labels.

  • value The value for the field in the domain table.

For example:

row_id,label_id,partition_id,datasource,semantic,value
1,JCurrie,1,Table:One,email,justin.c@email.com
2,JCurrie,2,Table:Two,email,j.currie@mail.com

A Stitch labels table is intended to fine-tune results. If too many entries exist in a Stitch labels table, the solution may be to tune Stitch itself before adding more entries to the labels table. You may use more than one Stitch labels table.

Match vs. split

The combination of label_id and partition_id identifies which rows in the Stitch labels table should match or be split.

For example:

-------- ---------- -------------- -------------- ----------- --------------------
 row_id   label_id   partition_id   datasource     semantic    value
-------- ---------- -------------- -------------- ----------- --------------------
 1        TSmith     1              Table:One      pk          123a-456b-789c
 2        TSmith     1              Table:Two      pk          456a-789b-123c
-------- ---------- -------------- -------------- ----------- --------------------

This tells Stitch that TSmith, despite having different values in different data sources, should be matched to the same customer record.

Whereas the following table tells Stitch that JCurrie, despite having the same value in different data sources, should be split into two customer records:

-------- ---------- -------------- -------------- ----------- --------------------
 row_id   label_id   partition_id   datasource     semantic    value
-------- ---------- -------------- -------------- ----------- --------------------
 1        JCurrie    1              Table:One      email       justin.c@email.com
 2        JCurrie    2              Table:Two      email       j.currie@mail.com
-------- ---------- -------------- -------------- ----------- --------------------

How Stitch labels work

The Stitch labels table does not require all of the possible combinations of semantic values to be specified. If any two rows in the Stitch labels table indicate that a customer record should be merged or split, it won’t matter about any of the other semantic values matching (or not matching) elsewhere. Stitch will force the outcome to be what the Stitch labels table indicates.

  • Records labeled with the same label-id and the same partition-id will be merged into the same cluster. All records associated with these two records will be merged into the same cluster.

  • Records labeled with the same label-id and a different partition-id will be split into different clusters based on the partition-id. Other records associated with these two records may be split (or may not be split), depending on the outcome of the Stitch clustering analysis for each individual customer record.

A Stitch labels table can have as many rows as required, but each individual row in the table must have a label_id that matches another label_id in another row in the table. More than one Stitch labels table may be used.

Add a Stitch labels table

A Stitch labels table is a CSV file that is maintained as a local file, and then uploaded as a feed to Amperity.

  1. Use the split clusters query in the Stitch QA folder in the Queries page to look for examples of overclustering and underclustering.

  2. Use the SQL Query Editor to run a query similar to:

    SELECT
      *
    FROM Unified_Coalesced
    WHERE amperity_id IN ('123a-456b-789c','234d-567e-891f')
    

    where “‘123a-456b-789c’,’234d-567e-891f’” represents the pair of Amperity IDs in the overcluster or undercluster.

    This query will return all of the rows associated with those Amperity IDs. Examine the results to understand if the customer records were merged or split correctly.

    Tip

    Use the Unified Preprocessed Raw table instead of the Unified Coalesced table to compare normalized values used by Stitch instead of the values in the source tables.

  3. Add instances of incorrectly merged and/or split customer records to a CSV file with the correct schema for Stitch labels.

  4. Ingest the CSV file as a feed.

  5. Select row_id field as primary key.

  6. Add semantic values to each field that matches the name of the column in the CSV file, with the exception of the semantic column, which must be associated with a profile (PII) semantic.

    The semantic tags for Stitch labels are: sl/label-id, sl/partition-id, sl/datasource, sl/semantic, and sl/value.

    For example, the label_id column should be assigned the sl/label-id semantic and a row with email should be assigned the sl/semantic semantic.

  7. Activate the feed and examine the results in the domain table.

  8. Run Stitch.

  9. Run the Customer 360.

  10. Re-run the split clusters Stitch QA query and verify that the customer record were merged or split correctly.

Examples

The following examples show overclustering and underclustering, and how to apply the desired outcome using a Stitch labels table.

Note

This topic uses similar examples as the ones in the Stitch nicknames topic to show how to use Stitch labels instead of nicknames to help Stitch evaluate records so they are grouped correctly.

If the match/mismatch is only due to issues with given names, you should consider using nicknames instead of labels to resolve the issue.

Name conflict

Teaninau and Teeyon are phonetically similar names, but are not obvious nicknames and after examining the merged customer records it is unclear that these individuals should be part of the same customer record. They may be related, as there are some shared details. If the customer records are merged, and you are sure they should not be part of the same customer record, add an entry to the Stitch labels table to ensure that Teaninau and Teeyon are always split into two customer records:

row_id,label_id,partition_id,datasource,semantic,value
1,TeaninauTeeyon,1,Table:One,pk,123a-456b-789c
2,TeaninauTeeyon,2,Table:Two,pk,234d-567e-891f

If the customer records are split, and you are sure they should be part of the same customer record, add an entry to the Stitch labels table to ensure that Teaninau and Teeyon are always merged into a single customer record:

row_id,label_id,partition_id,datasource,semantic,value
1,TeaninauTeeyon,1,Table:One,pk,123a-456b-789c
2,TeaninauTeeyon,1,Table:Two,pk,234d-567e-891f

Gender mismatch from typo

Adam and Ada are not the same name and it is unlikely that Ada is a nickname for Adam. Add an entry to the Stitch labels table to ensure that Adam and Ada are always split into two customer records:

row_id,label_id,partition_id,datasource,semantic,value
1,AdamAda,1,Table:One,pk,123a-456b-789c
2,AdamAda,2,Table:Two,pk,234d-567e-891f

Likely nickname

Ty and Tylian were split into two customer records, but after examining the split customer records and noticing they share other details (email address and phone number), it’s very likely that Ty is a nickname for Tylian. Add an entry to the Stitch labels table to ensure that Ty and Tylian are always merged into a single customer record:

row_id,label_id,partition_id,datasource,semantic,value
1,TyTylian,1,Table:One,pk,123a-456b-789c
2,TyTylian,2,Table:Two,pk,234d-567e-891f