How Stitch Works

Stitch uses patented algorithms to evaluate massive volumes of data to discover the hidden connections in your customer records that identify unique individuals. Stitch outputs a unified collection of data that assigns a unique identifier to each unique individual that is discovered within your customer records.

This is how Amperity finds your customers in your data.

Amperity uses a series of patented innovations to ensure that identity resolution against your customer data is accurate and that the output of the Stitch process represents a true unified view of your customers.

This topic is an introduction to how Stitch works. Read Entity Matching in the Wild: a Consistent and Versatile Framework to Unify Data in Industrial Applications for a detailed explanation of how Amperity provides a consistent, reliable, and stable customer ID.

Stages of identity resolution

Identity resolution is a critical step in understanding who your customers are. Stitch is the component within Amperity that performs identity resolution by comparing all of your customer data, identifying unifying groups of customer records, and then identifying unique customer profiles that represent each of your unique, individual customers.

The stages of identity resolution are:

  1. Semantic tagging

  2. Preprocessing data

  3. Union of tables

  4. Blocking

  5. Initial scoring

  6. Pairwise comparison

  7. Stable ID assignment

Semantic tags

A semantic is a way to apply a common understanding to individual points of data across multiple data sources, even when data sources have different schemas, naming conventions, and levels of data quality. Assigning a semantic tag to individual columns in customer data is an important prerequisite to the Stitch process.

Extract, load, transform (ELT)

An important benefit of semantic tagging is that raw data can be provided directly to Amperity, which avoids a traditional (and more expensive) extract, transform, and load (ETL) process. Amperity can extract, load, and then transform raw data from any number of large datasets.

A semantic tag standardizes profile (PII), transaction, and other important customer details across all columns in all data tables.

What semantic tags does Stitch rely on?

Stitch relies on the following semantic tags to be applied to customer records:

  • given-name (first name) and surname (last name). In some cases, a full-name is inferred (if not available).

  • Other important profile details, such as birthdate, email, and phone.

  • The address, address2, city, state, and postal tags are combined to represent a complete physical address.

  • Other location details, such as country and company.

  • Additional profile details, when available, such as gender, generational-suffix (Jr., Sr., III, etc.), and title.

Stitch uses foreign keys to associate individual customers to their interactions with your brands.

Semantic tags must be defined for every feed that will provide profile data to Stitch. This ensures that data from rich sources of profile data are brought into Amperity in a consistent manner, which improves the outcome of the Stitch process.

Semantic tagging works like this:

  1. A field in the customer’s system named “fname” stores an individual’s given name.

  2. A field in the customer’s system named “lname” stores the same individual’s last name.

  3. A field in the customer’s system named “primary-phone” stores a phone number.

  4. A field in the customer’s system named “date” stores an individual’s birthdate.

  5. And so on.

For those semantic tags, the feed should apply semantic tags like this:

Input Field

Semantic Tag

fname

given-name

lname

surname

primary-phone

phone

date

birthdate

This same pattern is applied to every customer data source that is brought into Amperity and it results in every single semantically-tagged field being analyzed by Amperity during the Stitch process in exactly the same way.

Amperity has built-in semantic tags for personally-identifiable information (PII), transactions, and behaviors. In addition, custom semantic tagging may be applied to fields when adding them can help identify unique individuals across massive data sets.

Preprocess data

Data is preprocessed into a consistent data pattern before it is combined into a virtual table for use with record matching, blocking, and pairwise comparison. Amperity preprocesses all values in all fields to which semantic tags for profile data were applied.

Note

Preprocessing data ensures that Stitch has access to consistent formatting of data for matching purposes. Preprocessed data is written to the Unified_Preprocessed_Raw data table. Amperity does not assert that preprocessed data values are better than the original values in the customer’s data.

Addresses

Amperity preprocesses addresses by converting common abbreviations to complete words, removing periods and commas (., ,), and converting all characters to UPPERCASE.

Original value

Preprocessed to

123 W. Elm St.

123 West Elm St.

123 W Elm St

123 West Elm Street

123 WEST ELM STREET

44 holiday dr.

44 Holiday Dr.

44 Holiday Drive

44 HOLIDAY DRIVE

1000 1st Ave. Ste. 1960

1000 FIRST AVENUE SUITE 1960

555 Puget Ave

P.O. Box 555

555 PUGET AVENUE PO BOX 555

Phone numbers

Amperity preprocesses phone numbers by removing parentheses, hyphens, and spaces, consolidating every phone number to a numeric string.

Original value

Preprocessed to

(333)-444-5678

3334445678

415 290 5727

4152905727

+1 (978) 425 6779

9784256779

222 4455

2224455

Email addresses

Amperity preprocesses email addresses by ensuring that only the local username and domain are present, separated by @, and converted to UPPERCASE.

Important

All email addresses are validated against a common list of local username patterns that typically indicate junk email addresses, such as test@, no@, reservation@, and so on. When an email address matches one of these patterns, that value is preprocessed to NULL.

Original value

Preprocessed to

derek+1234@amperity.com

DEREK@AMPERITY.COM

test@goaway.com

NULL

gary.smith+123@gmail.com

GARY.SMITH@GMAIL.COM

Note

Field values that were ignored during preprocessing are available as output of Stitch from the Unified_Preprocessed_Raw table. Fields values that were ignored due to blocklisting are available as output of Stitch from the Unified_Coalesced table.

Union of tables

All records from all tables that contain customer profile data are merged into a single virtual table that aligns all of the data that is associated with all defined semantic types.

For example, a customer has data sources for online transactions, in-store transactions, loyalty programs, clickstream data for a mobile application, and so on.

All records from all tables that contain customer profile data are merged into a single virtual table.

Semantic tags are applied to all of these data sources consistently across all data sources. Every email address, physical address, phone number, first and last name is associated to profile semantics. Every order, item, purchase amount, discount amount, return, is associated to transaction semantics.

It is OK if each row does not contain a value for each column. The alignment itself is what is necessary to make this data usable by Stitch for downstream processing and identity resolution.

The following example shows a couple rows from a few tables, the aligned (and preprocessed) semantic values, and no values when the data source did not provide it. Imagine this for all of your customer data, hundreds of millions of records, hundreds of millions of rows, with fields in the virtual table that span your complete set of customer data.

Source

Surname

Address

Email

Postal

Loyalty ID

Loyalty

SMITH

123 MAIN STREET

JOHN@MAIL.COM

98101

A-12345-a

Loyalty

JONES

10 SOUTH LANE

JONES@GMAIL.COM

10101

B-23456-b

In-store

SMITH

98101

A-12345-a

Online

SMITH

123 MAIN STREET

JOHN@MAIL.COM

98101

A-12345-a

Online

JONES

10 SOUTH LANE

JONES@GMAIL.COM

10101

B-23456-b

Clickstream

A-12345-a

Clickstream

B-23456-b

Blocking

Blocking is a process that uses simple rules to divide massive sets of data records into small blocks that are rapidly processed and offer higher probabilities of discovering matching records.

Note

Blocking is a non-trivial step for record linking in the Stitch process. An overly generous blocking strategy may result in a high recall rate (too many pairs being evaluated) along with negative system performance. An overly conservative blocking strategy may result in a low recall rate (too few pairs being evaluated). Individual blocking keys may be conservative or generous. The combination of blocking keys is what creates the ideal recall rate without compromising the performance of Amperity.

A blocking strategy acts like a filter against a very large data set. Each blocking strategy applies its filter and all records that match are grouped together into a block. Each record that matches a blocking strategy is a blocking key.

A blocking key is a specific outcome of a blocking strategy. For example, a blocking strategy for email has a blocking key similar to customer@domain.com.

A block is a group of records that match the characteristics defined by the blocking strategy.

Blocks are created by comparing all records against all blocking strategies. When a record contains values that match a blocking strategy these values are combined into a single string value, also referred to as a blocking key.

For example, a blocking strategy that matches:

  • given-name(3)

  • surname(3)

  • postal

results in a blocking key string value similar to Jus:Cur:98101. This operation is similar to the following SELECT statement:

SELECT
  left.pk
  ,right.pk
FROM unified_semantic_data LEFT
JOIN unified_semantic_data RIGHT
ON left.given_name(0,3) = right.given_name(0,3)
AND left.surname(0,3) = right.surname(0,3)
AND left.postal=right.postal

The following sections step through a series of diagrams that describe how blocking works.

Potential blocks

The blocking process starts with no matches between records.

The start of the blocking process contains zero matching records.

Each of these individual dots represents an individual record that can potentially match other records. In the following diagrams, dots are highlighted and lines are added between them to indicate that at least one blocking key match has been discovered by Stitch.

Given name, surname, birthdate

The blocking process steps through each blocking strategy, with each blocking strategy defining specific matching patterns against which all records are compared. As records are analyzed and matched to patterns, the matching strings are grouped together for later comparison.

Blocking by given-name(3), surname(1), and birthdate.

This example shows an important blocking strategy that groups values associated with the following semantics:

  • The first three characters in given-name

  • The first character in surname

  • birthdate.

Given name, surname, zip code

A record can match more than one blocking key. Some of the records highlighted in this example were also matched in the previous example.

Blocking by given-name(3), surname(3), and postal code.

This example shows another important blocking strategy that groups values associated with the following semantics:

  • The first three characters in given-name

  • The first three characters in surname

  • postal.

First 5 characters in email

As each blocking strategy is applied, more groups of records are identified.

Many email addresses are not useful for identity resolution. Some of them are generic, such as info@some-domain.com, and are often associated with a place of business and should never be associated with a unique individual. Other email addresses are bogus, having been entered as a requirement for providing a genuine email address, but are otherwise fake, such as 123@some-domain.com.

Amperity uses a list of known “bad” email patterns, such as admin@, contact@, guest@, no@, none@, and then uses the list to exclude from Stitch results any email address that matches a pattern in the list. (This step is done during preprocessing, not blocking.)

Blocking by email(5).

This example shows additional record matches discovered after comparing the first five characters in email addresses across records.

Blocking complete

When finished, the blocking process has unioned all of the matching blocking keys together into distinct groups of records.

The end of the blocking process has identified groups of records that are ready for pairwise comparison.

These groups of records will be scored, first as an initial scoring pass that quickly filters out matching pairs that score below threshold, and then as a detailed pass that compares a record in a group to all of the other records in that group.

Initial scoring

Each of the matching pairs that were directly identified during blocking are scored. Matching pairs that score below threshold are filtered out, which creates smaller groups of records and also new groups of records, depending on which matching pairs are filtered out.

The following example shows several matching pairs scoring below threshold, using larger dots to indicate which matching pairs scored below threshold.

  • Four groups of records show matching pairs scoring below threshold in a way that will split each of them into two groups of records.

  • One group of records shows three scores below threshold, one that does not affect the number of records in the group (because other scores for that record were above threshold) and one that is removed entirely.

Matching pairs discovered during blocking are quickly scored.

The remaining matching pairs that scored above threshold remain in groups. The following example shows most groups getting smaller, but also four new groups identified.

Fewer records in groups, but also new groups.

Note

Initial scoring uses the same scoring method as pairwise comparison, with exact, excellent, and high scores being “above threshold” and moderate, weak, and no conflict scores being “below threshold”. The individual scoring methods are covered in greater detail in the following section about pairwise comparisons.

Pairwise comparison

A pairwise comparison is a process that compares, and then scores all of the possible connections between all records in a group of records.

A pairwise connection is a pair of matching records within a block that have an initial score above threshold. Each pairwise connection within a block is scored, after which all pairwise connections that scored above threshold represent a single, unique individual.

Note

Pairwise comparison uses the same scoring method as initial scoring, but expands scoring to include records with transitive connections.

A transitive connection exists between individual records when any two records share a strong match to an intermediate record, but do not have a strong match to each other. For example: record 1 matches record 2, record 3 matches record 2, neither records 1 or 3 match to each other, but they have a transitive connection because both match record 2.

Let’s walk through the process of pairwise scoring using a single block of records:

A single block of records ready for pairwise comparison scoring.

A score is assigned to every pairwise connection. The score is measured in two parts, separated by a period.

The first part–the record pair score–correlates to the match category, which is a machine learning classifier that is applied by Amperity to individual record pairs. The record pair score corresponds to the classification: 5 for exact matches, 4 for excellent matches, 3 for high matches, 2 for moderate matches, 1 for weak matches, and 0 for no conflicts.

The second part–the record pair strength–is used by Stitch to help determine the quality of the record pair score. This value appears in the Stitch report as a two decimal number. A record pair strength by itself is not a direct indicator of the quality of a pairwise connection score.

The following thresholds are available:

Threshold

Match category

5

Exact

4

Excellent

3

High

2

Moderate

1

Weak

By default, only record pairs with a pairwise comparison score of exact, excellent or high are kept.

Important

Records are scored based on a number of features, including:

  • String matching patterns. such as Levenshtein and Jaro–Winkler distances, and Jaccard similarity

  • Commonality statistics that focus on name distributions

  • Name matching, including for nicknames, combined with addresses and phone numbers

  • Lookup tables

Comparisons are made across a broad set of categories, including names, birthdates, email addresses, physical locations, and phone numbers.

Category

Comparisons

Names

  • How popular is the name?

  • How closely do the names match?

  • Do the names match on first or last, but not first and last?

  • Are there any obvious conflicts?

  • Are the names unlike each other?

Birthdates

  • How closely do the birthdates match?

  • Are the birthdates unlike each other?

Email addresses

  • Do the usernames match exactly?

  • Are there common or uncommon usernames that match, despite having different domains?

  • Does one side of the email address (username or domain) have an exact match and the other side have an approximate match?

Physical locations

  • How closely do the addresses match?

  • Do the addresses have the same zip code or city?

  • Are there any obvious conflicts?

  • Are the addresses unlike each other?

Phone numbers

  • Do the phone numbers match exactly?

Potential connections

The pairwise comparison process goes beyond initial scoring to compare (and then score) all of the possible connections between all of the records that belong to the same group.

This section uses a group of eight records to show how pairwise comparisons work. A line between records will indicate the threshold for the comparison that was discovered.

The start of the pairwise comparison process contains zero connections.

This example shows the start of the pairwise comparison process and zero connections.

Exact matches

An exact match score is applied to records in which all profile data matches or when a foreign key is present for both pairs and the associated values are identical.

An exact pairwise comparison scoring match between two records.

This example shows an exact match between two records.

Excellent matches

An excellent match score is applied to records that, even with certain types of profile data not matching, are an obvious match.

An excellent pairwise comparison scoring match between two records

This example shows an excellent match between two records.

High matches

A high match score is applied to records that, even with some profile data not matching, after some deductive reasoning, appear to be records that match.

A high pairwise comparison scoring match between two records

This example shows a high match between two records. The last names and zip codes are exact matches. The first names do not match, but do share a common nickname. The email addresses do not match, but are identical before the @ symbol.

Moderate matches

A moderate match score is applied to records that have weak or fuzzy matches between highly unique customer attributes, such as email, phone, and address.

A moderate pairwise comparison scoring match between two records.

This example shows a moderate match between two records.

Weak matches

A weak match score is applied to records that match on non-unique customer attributes, such as name, state, and zip code, but cannot be easily associated with the same unique individual.

A weak pairwise comparison scoring match between two records.

This example shows a weak match between two records.

No conflicts

A no conflict score is applied to records in which core profile data between records does not match or when a separation key is present for both pairs and the associated values are conflicting.

No conflicts between two records.

This example shows no conflicts between two records.

All connections

After pairwise comparisons are completed and scored, the connections that scored below threshold (moderate, weak, and no conflict) are dropped. What remains is a group of records that identifies a unique person and to which an Amperity ID is assigned.

All of the pairwise comparisons that scored above threshold.

This example shows all of the pairwise comparisons that scored above threshold (exact, excellent, and high). (The score at which record pairs fall below threshold is configurable. Moderate is the default threshold at which record pairs are dropped.)

Hierarchical comparison

A hierarchical comparison is a step in the Stitch process that occurs after pairwise scoring to closely examine each group of records to identify edge cases, such as married couples with overlapping profile (PII) data or children with the same name as a parent who live at the same address.

A hierarchical comparison that identifies enough conflicting data will allow Amperity to assert that a group of records should be split into two (or more) groups of records.

A hierarchical comparison identifies a cluster as actually being two individuals, and then splits them.

This example shows a group of records that has been identified by hierarchical comparison to represent two individuals, after which they are split into two groups of records.

Stable ID assignment

An Amperity ID is a patented unique identifier that is assigned to clusters of customer records. A single Amperity ID represents a single individual. Unlike other systems, the Amperity ID is reassessed every day for the most comprehensive view of your customers.

Important

Stable ID assignment is about minimizing unnecessary changes in Amperity ID assignment to customer records over time.

As new data is input to Amperity, the Stitch process identifies when new or changed data applies to existing clusters of customer records, and then updates those records, maintains the cluster, and retains a stable Amperity ID assignment. A new Amperity ID is only created when new individuals are identified.

A cluster with three unique individuals, each of which were assigned an Amperity ID.

This example shows three unique clusters of records, each of which were assigned an Amperity ID.

Note

In some cases, the Amperity ID that is assigned to a cluster does change. This is referred to as jitter and it occurs when new data forces the reassignment of the Amperity ID. For example, a single cluster of records for a customer named Frank Janson. Amperity is provided new data that allows Stitch to identify that there are really two Frank Jansons. One is Frank Janson Sr. and the other is Frank Janson Jr. Stitch results will show jitter when the Amperity ID assignment is updated to reflect the correct association of customer records.

Your data, your customers

Amperity accurately identifies all of your unique customers in your data. All of your unique customers are assigned an Amperity ID.

This is how Amperity finds your customers in your data.

Use the Amperity ID to understand how your customers have interacted with your brands and to determine the best ways your company can identify your best and most valuable customers and continue to engage with them.