Unified Preprocessed Raw table

The Unified Preprocessed Raw table is an output of Stitch. Use this table as part of Stitch QA to view data exactly how values were used by Stitch after preprocessing and blocklisting. This table contains one row for each trivial duplicate. Only the semantics used by Stitch are shown. Semantic values represent pre-processing values and/or bad values that were removed (and replaced by NULL).

The Unified Preprocessed Raw table is automatically added to the Stitch QA database as a passthrough table when it is generated using the “Stitch QA” database template.

Use with Stitch QA

Use the Unified Preprocessed Raw table as part of the Stitch QA process to:

  • Ensure semantic tags are selected and visible to the Queries page

  • Investigate relationships between records, especially when records appear to be assigned (not assigned) to an incorrect (correct) Amperity ID, which defines a single cluster of records

  • Investigate blocklisted values

Tip

The Unified Coalesced and Unified Preprocessed Raw tables are frequently used together during Stitch QA because many investigation patterns will compare the values in one table to the values in the other.

Profile semantics and foreign keys

Use the Customer 360 page to verify that rows with profile semantic tags and foreign keys are selected to be available to the Queries page. Has the list of profile and/or foreign key semantics changed since the last time you have performed Stitch QA? This will occur when a data source is added that requires a new foreign key or in a situation where one of the less frequently used profile semantic tags is applied to a new data source.

Open the table in the Database Explorer and verify that all rows in the table that are associated with a semantic tag or a foreign key have a checkmark in the left column. Rows without a checkmark will not make the associated FIELD available to the Queries page. If rows do not have a checkmark, edit the table and apply the checkmark, save the table, activate the Stitch QA database, and then run the database to refresh the table.

Relationships between records

Use the Queries page to investigate unexpected relationships between individual records that share the same Amperity ID and to investigate why records that look like they should belong to the same cluster of records were split.

Use a combination of the Unified Coalesced and Unified Preprocessed Raw tables to review the data that is associated with these records to help understand why groups of records were matched or why certain records were assigned to a cluster.

In some cases, you may need to investigate and compare entire clusters to better understand why (or why not) these groups of records were maintained (or split) during hierarchical comparison, a step in the Stitch process that occurs after pairwise comparison and scoring.

Global blocklist values

The Unified Coalesced table provides additional data about individual records, such as blocklisted values, component IDs, or if a record belongs to a supersized cluster.

In some cases, groups of records may have values removed by the global bad-values blocklist instead of the tenant-specific bad-values blocklist.

Compare records in the Unified Coalesced table to records in the Unified Preprocessed Raw table to identify which values were removed.

Use with Customer 360

Important

The Unified Preprocessed Raw table should not be used outside of the Customer 360 page. It should not be used to send data to downstream or with workflows.

Cleaned vs. uncleaned data

You can compare cleaned data to uncleaned data to improve the quality of data in the customer 360 database.

  • The Unified Preprocessed Raw table contains a row from every table that is associated with an Amperity ID and contains the results of Stitch processing. Values in this table are “cleaned” data.

  • The Unified Coalesced table contains the data exactly as it was processed by Stitch. Values in this table are “uncleaned data”.

Use a query similar to the following to join fields in the Unified Preprocessed Raw table to the Unified Coalesced table, and then compare the results.

SELECT
  upr.amperity_id AS "upr_amperity_id"
  ,uc.has_blv
  ,uc.address AS "uc_address"
  ,upr.address AS "upr_address"
  -- ,uc.birthdate AS "uc_birthdate"
  -- ,upr.birthdate AS "upr_birthdate"
  ,uc.city AS "uc_city"
  ,upr.city AS "upr_city"
  ,uc.email AS "uc_email"
  ,upr.email AS "upr_email"
  ,upr.login_trimmed AS "upr_login_trimmed"
  -- ,uc.gender AS "uc_gender"
  -- ,upr.gender AS "upr_gender"
  ,uc.given_name AS "uc_given_name"
  ,upr.given_name AS "upr_given_name"
  ,uc.phone AS "uc_phone"
  ,upr.phone AS "upr_phone"
  ,upr.po_box AS "upr_po_box"
  ,uc.postal AS "uc_postal"
  ,upr.postal AS "upr_postal"
  ,uc.generational_suffix AS "uc_generational_suffix"
  ,upr.sk_generational_suffix AS "upr_sk_generational_suffix"
  ,upr.sk_given_name AS "upr_sk_given_name"
  ,uc.state AS "uc_state"
  ,upr.state AS "upr_state"
  ,uc.surname AS "uc_surname"
  ,upr.surname AS "upr_surname"
  ,upr.datasource AS "upr_datasource"
  ,upr.pk AS "pk"
  ,upr.supersized_id AS "upr_supersized_id"
  -- ,upr.component_id AS "upr_component_id"
  -- ,upr.int_id AS "upr_int_id"
FROM Unified_Coalesced uc
JOIN Unified_Preprocessed_Raw upr
ON uc.rep_ds=upr.datasource AND uc.rep_pk=upr.pk
WHERE UPPER(uc.address) <> upr.address
-- WHERE UPPER(uc.given_name) <> up.given_name
-- WHERE UPPER(uc.gender) <> upr.gender
-- WHERE UPPER(uc.postal) <> upr.postal
-- WHERE UPPER(uc.email) <> upr.email
-- WHERE UPPER(uc.city) <> upr.city
-- WHERE UPPER(uc.phone) <> upr.phone
-- WHERE uc.amperity_id = 'amperity-id'
LIMIT 1000

Compare semantic values

Use this query to compare raw values and cleaned values for any set of semantics that were used by Stitch.

Within the SELECT statement, semantics are grouped in pairs. For example:

,uc.phone AS "uc_phone"
,upr.phone AS "upr_phone"

Comment and uncomment semantic pairs as needed, depending on the semantic values you want to compare.

Explore cleaned values

Use the WHERE clauses to explore only cleaned values. Comment and uncomment this group of clauses, as necessary:

WHERE UPPER(uc.address) <> upr.address
-- WHERE UPPER(uc.given_name) <> up.given_name
-- WHERE UPPER(uc.gender) <> upr.gender
-- WHERE UPPER(uc.postal) <> upr.postal
-- WHERE UPPER(uc.email) <> upr.email
-- WHERE UPPER(uc.city) <> upr.city
-- WHERE UPPER(uc.phone) <> upr.phone

Filter by Amperity IDs

Filter by Amperity IDs to take a closer look a specific group of records, such as for overclustering and underclustering.

Use a series of WHERE clauses to compare specific Amperity IDs:

-- WHERE uc.amperity_id = 'amperity-id-1'
-- WHERE uc.amperity_id = 'amperity-id-2'
-- WHERE uc.amperity_id = 'amperity-id-3'

This can help identify situations where the bad-values blocklist isn’t blocking specific values. Compare the values in the Unified Preprocessed Raw table to the values in the Unified Coalesced table and verify that values were correctly added to the blv_ columns.

Column reference

The Unified Preprocessed Raw table contains the following columns:

Column name

Data type

Description

Amperity ID

String

The unique identifier that is assigned to clusters of customer records that all represent the same individual. The Amperity ID does not replace primary and foreign keys, but exists alongside them within unified profiles.

Note

The Amperity ID is a universally unique identifier (UUID) that is represented by 36 characters spread across five groups separated by hyphens: 8-4-4-4-12.

For example:

123e4567-e89b-12d3-a456-426614174000

Address

String

The address that is associated with the location of a customer, such as “123 Main Street”.

Also in: Customer 360, Merged Customers, Unified Coalesced, Unified Customer

Birthdate

Date

The date of birth that is associated with a customer.

Also in: Customer 360, Merged Customers, Unified Coalesced, Unified Customer

City

String

The city that is associated with the location of a customer.

Also in: Customer 360, Merged Customers, Unified Coalesced, Unified Customer

Component ID

Integer

An identifier that represents a set of records that are transitively connected with a score above threshold as an outcome of blocking and initial scoring. Records that share a component ID, but have different Amperity IDs, were split during hierarchical comparison.

Tip

Use this field during the Stitch QA process to help identify why certain records were grouped (or not grouped) together.

Also in: Unified Preprocessed Raw

Datasource

String

The name of the data source from which this customer record originated.

Tip

The combination of PK and Datasource uniquely identifies a row in the Unified Preprocessed Raw table, which correlates to a single row in a domain table.

Email

String

The email address that is associated with a customer. A customer may have more than one email address.

Also in: Customer 360, Merged Customers, Unified Coalesced, Unified Customer

FK [Name]

String

The fk-[namespace] semantic tag identifies a field as a foreign key. A foreign key semantic tag must be namespaced. For example: fk-customer, fk-interaction, fk-audience, or fk-brand.

A column is added for each foreign key that is defined in the Sources page.

Note

If foreign keys are linked together by a trivial duplicate they will appear in the Unified Preprocessed Raw table as a comma-separated list.

Also in: Unified Coalesced, Unified Customer

Gender

String

The gender that is associated with a customer.

Also in: Customer 360, Merged Customers, Unified Coalesced, Unified Customer

Given Name

String

The first name that is associated with a customer.

Also in: Customer 360, Merged Customers, Unified Coalesced, Unified Customer

Int ID

Integer

An identifier that represents a unique combination of datasource and pk (primary key) and their position in the Stitch record sort order after Stitch has identified (and removed) trivial duplicates.

Important

This identifier is based on Stitch record sort order and is determined each time Stitch runs. Records will not always have the same position in the record sort order; you should not expect this ID to be stable over time.

Tip

Use this field during the Stitch QA process to help identify why certain records were grouped (or not grouped) together.

Login Trimmed

String

Phone

String

The phone number that is associated with a customer. A customer may have more than one phone number.

Also in: Customer 360, Merged Customers, Unified Coalesced, Unified Customer

PK

String

The combination of data source and primary key allows Amperity to uniquely identify every row in every data table across the entirety of customer data input to Amperity.

Tip

The combination of PK and Datasource uniquely identifies a row in the Unified Preprocessed Raw table, which correlates to a single row in a domain table.

PO Box

String

Additional address information, such as an apartment number or a post office box, that is associated with the location of a customer, such as “Apt #9”.

Postal

String

The zip code or postal code that is associated with the location of a customer.

Also in: Customer 360, Merged Customers, Unified Coalesced, Unified Customer

SK Generational Suffix

String

SK Given Name

String

State

String

The state or province that is associated with the location of a customer.

Also in: Customer 360, Merged Customers, Unified Coalesced, Unified Customer

Supersized ID

Integer

The identifier for a supersized record.

Indicates when a rough heuristic is applied to the first grouping of records (rep_pk) to partition supersized records into smaller components. Supersized records occur when more than 500 groups are associated with the first grouping of records.

Also in: Unified Coalesced

Surname

String

The last name that is associated with a customer.

Also in: Customer 360, Merged Customers, Unified Coalesced, Unified Customer