Stitch

Stitch uses patented algorithms to evaluate massive volumes of data to discover the hidden connections in your customer records that identify unique individuals. Stitch outputs a unified collection of data that assigns a unique identifier to each unique individual that is discovered within your customer records.

Stitch tab

The Stitch tab shows detailed results of the Stitch process, which takes customer data, exctracts customer records, and then compares record pairs using over 40 different machine learning models. Each record pair is given a score, which represents the strength of the match. Amperity creates clusters of records based on the connection between pairs, and then gives each cluster a unique Amperity ID.

Stitch tab

Run Stitch

A Stitch run can be started manually by clicking the Run button at the top of the Stitch tab. A Stitch run may be started automatically by an upstream process, such as after successful domain table updates that were initiated by courier group automation.

Select domain tables

The Stitch process only runs against selected domain tables. A domain table is made available to Stitch from its feed settings (via the Make available to Stitch configuration setting). A domain table that is made available to Stitch must also be selected from the list of domain tables in the Stitch Settings dialog box.

The tables that are selected to be part of the Stitch run are processed and compared for identity resolution, which will result in assignment of Amperity IDs to each unique individual that is discovered across all of the domain table data included in the Stitch run.

To add tables to the Stitch run

  1. From the Stitch tab, click Settings.

  2. In the Stitch Settings dialog box, under Table, select from the list of tables which ones are to be included in a Stitch run, and then click Save.

To start the Stitch run

  1. From the Stitch tab, click Run.

About the Stitch run

A Stitch run takes a certain amount of time, depending on the size of the data analyzed and the number of potential pairs in the data. In general, you should expect to wait at least 20 minutes (but as much as 2 hours) for the Stitch run to complete. You can navigate to other areas of Amperity and do other tasks while waiting for the notification to run and complete successfully. Stitch has two notifications: the Stitch run, and then the Stitch report. The UI will refresh with updates after the Stitch report.

Important

Before starting the Stitch run, verify that all tables that should be analyzed by Stitch have been made available to Stitch via the Feed Editor, that all processes that load data to Amperity (including couriers, feeds, and domain tables) have finished processing, and that all domain tables are selected.

Explore Stitch results

The Stitch tab shows the outcome of the Stitch process, including the number of unique Amperity IDs in customer data and a series of charts that highlight the connectivity between data sources.

The Stitch overview provides a high-level set of statistics about the current state of stitched records in Amperity, including deduplication rates and a way to explore Amperity IDs in the customer 360 database by stitched records, cluster graphs, pairwise connections, and by data source.

Explore by Amperity ID

An Amperity ID is a patented unique identifier that is assigned to clusters of customer records. A single Amperity ID represents a single individual. Unlike other systems, the Amperity ID is reassessed every day for the most comprehensive view of your customers.

As new data is input to Amperity, the Stitch process identifies when new or changed data applies to existing clusters of customer records, and then updates those records, maintains the cluster, and retaining a stable Amperity ID assignment. A new Amperity ID is only created when new individuals are identified.

Note

The Amperity ID is a universally unique identifier (UUID) that is represented by 36 characters spread across five groups separated by hyphens: 8-4-4-4-12.

For example:

123e4567-e89b-12d3-a456-426614174000

Amperity IDs do not replace primary and foreign keys already assigned in customer data; Amperity IDs exist alongside primary and foreign keys within the customer 360 profile and act as key for finding clusters of unique customer records.

To explore by Amperity ID

  1. From the Stitch tab, click the Explore Amperity IDs button.

  2. This opens the Data Explorer to the Stitched Records tab.

  3. Use the left and right arrows surrounding the full name to view additional records.

  4. Click through each record and each tab. When finished exploring, click Close.

Explore by data source

The Stitched Sources section of the Stitch tab shows a comparison of domain tables and the record pairs identified both within each data source and across all data sources. This is presented as an UpSet Plot chart with links to the underlying data sources via the Data Explorer.

The following diagram shows the components of the UpSet plot chart, inclusive of the distribution of Amperity IDs across all data sources, and then for each data source, an individual breakdown of how that data source compares to all other data sources. (An UpSet plot chart will have a row for each data source. This diagram shows the first two only.)

An UpSet plot chart, located within the Stitch tab in Amperity.

Each individual stitched data source can be explored from the UpSet plot. The UpSet plot includes a source-by-source breakdown of stitched data. For each record, a View source link is available. This opens the Data Explorer and displays a Schema for the data source that shows the name of the field as it is defined in customer data, the data type, the Amperity semantic applied to the field, and sample data. A Sample shows 100 records from that data source, where each of the fields defined in the customer data source are presented as columns of data.

To explore by data source

  1. From the Stitch tab, under Stitched Sources, review the UpSet plot chart.

  2. Click the database table name for any database table in the Upset plot chart to view more information about that data source.

  3. Click the View source link to open the Data Explorer for that table, in which you can review the schema and view sample data.

  4. When finished exploring, click Close.

Explore previous Stitch results

You can explore previous Stitch results from the Stitch tab. From the Stitch tab, next to the Stitch page title, select a result from the drop-down menu to view the Stitch results for that Stitch run.

Explore semantics

A semantic is a way to apply a common understanding to individual points of data across multiple data sources, even when data sources have different schemas, naming conventions, and levels of data quality. Assigning a semantic tag to individual columns in customer data is an important prerequisite to the Stitch process.

The Semantics link at the top of the Stitch tab opens a dialog box that lists the configured semantics made available to Stitch from domain tables. This list is broken down by domain table, and then by semantic. For each semantic, it lists the semantic, the data type (string, date, integer, and so on), and the name of the field as defined in customer data.

To explore semantics

  1. From the Stitch tab, click the Semantics link. This opens the Stitch Tools dialog box.

  2. Each table that contains stitched records is listed in the dialog box.

  3. For each table, a list of semantic names, their types, and the fields to which they are associated is listed.

  4. When finished, click Close.

View cluster graph

Clustering is the process of deciding which records are included in a customer profile. A matching threshold defines the minimum threshold at which two records can be matched, and then included in a cluster. Lower quality matches may be included, but only as a transitive connection. Distinct customer profiles emerge as a cluster of record pairs.

A cluster graph is one of the outcomes of the Stitch process. It is a visual representation of every pairwise connection in a cluster of records. Each pair can be explored in more detail.

The Cluster Graph tab in the Data Explorer shows a graph with a line relationship between each stitched record, along with a detailed breakdown of PII similarities (and differences) for each pair of stitched records in the cluster graph.

The data explorer, showing the cluster graph.

To view the cluster graph

  1. From the Stitch tab, click the Explore Amperity IDs button.

  2. This opens the Data Explorer to the Stitched Records tab.

  3. Click the Cluster Graph tab.

  4. In the cluster graph, select individual lines to view the details for that pair of records. The columns on the right shows the fields in the records that are associated with PII semantics. Compare the values on each side to see how closely these two records match.

  5. Use the left and right arrows surrounding the full name to view additional cluster graphs for additional record clusters.

  6. When finished exploring, click Close.

View deduplication rate

The deduplication rate represents the total number of unique individuals within a customer data set. This rate measures the difference between the total number of original identifiers in customer data and the total number of Amperity IDs that were assigned to unique individuals.

Example

A tenant has three sources of customer records represented by tables 1, 2, and 3. In the Stitch report the:

  • Total number of records is 314.1k

  • Total number of clusters is 212.0k

  • Overall deduplication rate is 32.5%

  • Individual deduplication rates for three customer records are 7.7%, 6.6%, and 0%

How is this possible? Let’s walk through it.

The overall deduplication rate (32.5%) represents the total number of records relative to the number of Amperity IDs. There can be a low deduplication rate on individual tables, but high connectivity between tables.

An UpSet plot chart has a row for each table. In this case, the row for table 1 shows shows 117k source IDs and 108k Amperity IDs. This represents a 7.7% deduplication rate.

Deduplication rates for customer records.

Next compare the overlap between customer records 1 and 3 by hovering over customer record 1. The hover box shows there are more than 69k records shared between tables 1 and 3. This is a significant amount of overlap between two tables and is the primary contributor to the 32.5% overall deduplication rate.

Deduplication rate, explained

The deduplication rate is the reduction that occurs when the total number of Amperity IDs are compared to the original source IDs provided in customer data. For example:

  1. Total records: 314.1k. The sum of all records from all tables.

  2. Total clusters: 212.0k. The sum of all clusters from all tables.

  3. Records in table 1: 117k. The sum of all records in table 1.

  4. Clusters in table 1: 108k. The sum of all clusters in table 1.

The overall deduplication rate is 32.5%:

100 * [(314.1k - 212.9k) / 314.1k] = 32.5%

The deduplication rate for table 1 is 7.7%:

100 * [(117k - 108k) / 117k] = 7.7%

Important

Deduplication rate depends! The previous example shows deduplication rate for a database that does not use customer keys:

(total customer records - Amperity IDs) /  (total customer records)

When a database uses customer keys, the math to determine deduplication rate is the same, but the starting point is the customer keys.

(total customer key records - Amperity IDs) /  (total customer key records)

To view deduplication rate

  1. From the Stitch tab, under Overview, review the total source IDs and total Amperity IDs.

  2. The deduplication rate is shown on the right as a percentage.

View pairwise connections

A pairwise connection is a pair of matching records within a block that have an initial score above threshold. Each pairwise connection within a block is scored, after which all pairwise connections that scored above threshold represent a single, unique individual.

The Pairwise Connections tab in the Data Explorer shows a breakdown of stitched record pairs by score.

The data explorer, showing pairwise connections.

A score is assigned to every pairwise connection. The score is measured in two parts, separated by a period.

The first part–the record pair score–correlates to the match category, which is a machine learning classifier that is applied by Amperity to individual record pairs. The record pair score corresponds to the classification: 5 for exact matches, 4 for excellent matches, 3 for high matches, 2 for moderate matches, 1 for weak matches, and 0 for no conflicts.

The second part–the record pair strength–is used by Stitch to help determine the quality of the record pair score. This value appears in the Stitch report as a two decimal number. A record pair strength by itself is not a direct indicator of the quality of a pairwise connection score.

To view pairwise connections

  1. From the Stitch tab, click the Explore Amperity IDs button.

  2. This opens the Data Explorer to the Stitched Records tab.

  3. Click the Pairwise Connections tab.

  4. Pairwise connections are broken into categories by strength of connection: exact matches, excellent matches, high matches, moderate matches, weak matches, and no matches. Click each category to view individual records by strength of connection.

  5. Use the left and right arrows surrounding the full name to view additional pairwise connections for additional record clusters.

  6. When finished exploring, click Close.

View Stitch metrics

You can view metrics for changes to records and Amperity IDs that may have occurred between Stitch runs. Click the Stitch Metrics link in the notifications pane to open the Stitch Metrics dialog box.

The Stitch Metrics dialog box.

This dialog box identifies the tenant, the time at which the job started, the ID for the Stitch report, and the Stitch ID, and then shows the following details about this Stitch run:

  • The collapsed ID count refers to the number of records present after nearly-identical records were removed.

  • The related pairs count shows number of unique record pairs that were identified by a blocking strategy.

  • The filtered related pairs count shows the number of unique record pairs that scored above the matching category threshold.

The table contains a row for each data source that was made available to this Stitch run, along with columns for each row that show:

  • The number of Amperity IDs in the current Stitch run.

  • The number of Amperity IDs in the previous Stitch run.

  • The number of Amperity IDs in the current Stitch run that were not in the previous Stitch run.

  • The number of Amperity IDs that were in the previous Stitch run, but are not in the current Stitch run.

  • The number of distinct cluster transitions. A cluster transition occurs when records move from one cluster to another.

  • The number of distinct cluster transitions, including those from records added to or deleted from the dataset.

View stitched records

A stitched record is a unique output of the Stitch process that associates the Amperity ID to an individual customer record.

The Stitched Records tab in the Data Explorer shows a table with a row for each of the individual records that share the same Amperity ID.

The data explorer, showing stitched records.

To view stitched records

  1. From the Stitch tab, click the Explore Amperity IDs button.

  2. This opens the Data Explorer to the Stitched Records tab.

  3. All of the records that are associated with the same Amperity ID are listed. Columns show the differences beween each record.

  4. When finished exploring, click Close.