Unified_Preprocessed_Raw Table

The Unified_Preprocessed_Raw table is an output of Stitch. Use this table as part of Stitch QA to view data exactly how values were used by Stitch after preprocessing and blocklisting. This table contains one row for each trivial duplicate. Only the semantics used by Stitch are shown. Semantic values represent pre-processing values and/or bad values that were removed (and replaced by NULL).

The Unified_Preprocessed_Raw table is automatically added to the Stitch QA database as a passthrough table when it is generated using the “Stitch QA” database template.

Use with Stitch QA

Use the Unified_Preprocessed_Raw table as part of the Stitch QA process to:

  • Ensure semantic tags are selected and visible to the Queries tab

  • Investigate relationships between records, especially when records appear to be assigned (not assigned) to an incorrect (correct) Amperity ID, which defines a single cluster of records

  • Investigate blocklisted values

Tip

The Unified_Coalesced and Unified_Preprocessed_Raw tables are frequently used together during Stitch QA because many investigation patterns will compare the values in one table to the values in the other.

Profile semantics and foreign keys

Use the Customer 360 tab to verify that rows with profile semantic tags and foreign keys are selected to be available to the Queries tab. Has the list of profile and/or foreign key semantics changed since the last time you have performed Stitch QA? This will occur when a data source is added that requires a new foreign key or in a situation where one of the less frequently used profile semantic tags is applied to a new data source.

Open the table in the Database Explorer and verify that all rows in the table that are associated with a semantic tag or a foreign key have a checkmark in the left column. Rows without a checkmark will not make the associated FIELD available to the Queries tab. If rows do not have a checkmark, edit the table and apply the checkmark, save the table, activate the Stitch QA database, and then run the database to refresh the table.

Relationships between records

Use the Queries tab to investigate unexpected relationships between individual records that share the same Amperity ID and to investigate why records that look like they should belong to the same cluster of records were split.

Use a combination of the Unified_Coalesced and Unified_Preprocessed_Raw tables to review the data that is associated with these records to help understand why groups of records were matched or why certain records were assigned to a cluster.

In some cases, you may need to investigate and compare entire clusters to better understand why (or why not) these groups of records were maintained (or split) during hierarchical comparison, a step in the Stitch process that occurs after pairwise comparison and scoring.

Global blocklist values

The Unified_Coalesced table provides additional data about individual records, such as blocklisted values, component IDs, or if a record belongs to a supersized cluster.

In some cases, groups of records may have values removed by the global bad-values blocklist instead of the tenant-specific bad-values blocklist.

Compare records in the Unified_Coalesced table to records in the Unified_Preprocessed_Raw table to identify which values were removed.

Cleaned vs. uncleaned data

You can compare cleaned data to uncleaned data to improve the quality of data in the customer 360 database.

  • The Unified_Preprocessed_Raw table contains a row from every table that is associated with an Amperity ID and contains the results of Stitch processing. Values in this table are “cleaned” data.

  • The Unified_Coalesced table contains the data exactly as it was processed by Stitch. Values in this table are “uncleaned data”.

Use a query similar to the following to join fields in the Unified_Preprocessed_Raw table to the Unified_Coalesced table, and then compare the results.

SELECT
  upr.amperity_id AS "upr_amperity_id"
  ,uc.has_blv
  ,uc.address AS "uc_address"
  ,upr.address AS "upr_address"
  -- ,uc.birthdate AS "uc_birthdate"
  -- ,upr.birthdate AS "upr_birthdate"
  ,uc.city AS "uc_city"
  ,upr.city AS "upr_city"
  ,uc.email AS "uc_email"
  ,upr.email AS "upr_email"
  ,upr.login_trimmed AS "upr_login_trimmed"
  -- ,uc.gender AS "uc_gender"
  -- ,upr.gender AS "upr_gender"
  ,uc.given_name AS "uc_given_name"
  ,upr.given_name AS "upr_given_name"
  ,uc.phone AS "uc_phone"
  ,upr.phone AS "upr_phone"
  ,upr.po_box AS "upr_po_box"
  ,uc.postal AS "uc_postal"
  ,upr.postal AS "upr_postal"
  ,uc.generational_suffix AS "uc_generational_suffix"
  ,upr.sk_generational_suffix AS "upr_sk_generational_suffix"
  ,upr.sk_given_name AS "upr_sk_given_name"
  ,uc.state AS "uc_state"
  ,upr.state AS "upr_state"
  ,uc.surname AS "uc_surname"
  ,upr.surname AS "upr_surname"
  ,upr.datasource AS "upr_datasource"
  ,upr.pk AS "pk"
  ,upr.supersized_id AS "upr_supersized_id"
  -- ,upr.component_id AS "upr_component_id"
  -- ,upr.int_id AS "upr_int_id"
FROM Unified_Coalesced uc
JOIN Unified_Preprocessed_Raw upr
ON uc.rep_ds=upr.datasource AND uc.rep_pk=upr.pk
WHERE UPPER(uc.address) <> upr.address
-- WHERE UPPER(uc.given_name) <> up.given_name
-- WHERE UPPER(uc.gender) <> upr.gender
-- WHERE UPPER(uc.postal) <> upr.postal
-- WHERE UPPER(uc.email) <> upr.email
-- WHERE UPPER(uc.city) <> upr.city
-- WHERE UPPER(uc.phone) <> upr.phone
-- WHERE uc.amperity_id = 'amperity-id'
LIMIT 1000

Compare semantic values

Use this query to compare raw values and cleaned values for any set of semantics that were used by Stitch.

Within the SELECT statement, semantics are grouped in pairs. For example:

,uc.phone AS "uc_phone"
,upr.phone AS "upr_phone"

Comment and uncomment semantic pairs as needed, depending on the semantic values you want to compare.

Explore cleaned values

Use the WHERE clauses to explore only cleaned values. Comment and uncomment this group of clauses, as necessary:

WHERE UPPER(uc.address) <> upr.address
-- WHERE UPPER(uc.given_name) <> up.given_name
-- WHERE UPPER(uc.gender) <> upr.gender
-- WHERE UPPER(uc.postal) <> upr.postal
-- WHERE UPPER(uc.email) <> upr.email
-- WHERE UPPER(uc.city) <> upr.city
-- WHERE UPPER(uc.phone) <> upr.phone

Filter by Amperity IDs

Filter by Amperity IDs to take a closer look a specific group of records, such as for overclustering and underclustering.

Use a series of WHERE clauses to compare specific Amperity IDs:

-- WHERE uc.amperity_id = 'amperity-id-1'
-- WHERE uc.amperity_id = 'amperity-id-2'
-- WHERE uc.amperity_id = 'amperity-id-3'

This can help identify situations where the bad-values blocklist isn’t blocking specific values. Compare the values in the Unified_Preprocessed_Raw table to the values in the Unified_Coalesced table and verify that values were correctly added to the blv_ columns.

Column reference

The Unified_Preprocessed_Raw table contains the following columns:

Column Name

Data type

Description

amperity_id

String

The unique identifier that is assigned to clusters of customer records that all represent the same individual. The Amperity ID does not replace primary and foreign keys, but exists alongside them within unified profiles.

Note

The Amperity ID is a universally unique identifier (UUID) that is represented by 36 characters spread across five groups separated by hyphens: 8-4-4-4-12.

For example:

123e4567-e89b-12d3-a456-426614174000

address

String

The address that is associated with the location of an individual customer record. For example: 123 Main Street.

Also in: Customer_360, Merged_Customers, Unified_Coalesced, Unified_Customer

birthdate

Date

The date of birth that is associated with an individual customer record.

Also in: Customer_360, Merged_Customers, Unified_Coalesced, Unified_Customer

city

String

The city that is associated with the location of an individual customer record.

Also in: Customer_360, Merged_Customers, Unified_Coalesced, Unified_Customer

component_id

Integer

An identifier that represents a set of records that are transitively connected with a score above threshold as an outcome of blocking and initial scoring. Records that share a component ID, but have different Amperity IDs, were split during hierarchical comparison.

Tip

Use this field during the Stitch QA process to help identify why certain records were grouped (or not grouped) together.

Also in: Unified_Preprocessed_Raw

datasource

String

The name of the data source from which this customer record originated.

Tip

The combination of pk and datasource uniquely identifies a row in the Unified_Preprocessed_Raw table, which correlates to a single row in a domain table.

email

String

The email address that is associated with an individual customer record. A customer record may be associated with multiple email addresses.

Also in: Customer_360, Merged_Customers, Unified_Coalesced, Unified_Customer

fk_[name]

String

The fk-[namespace] semantic tag identifies a field as a foreign key. A foreign key semantic tag must be namespaced. For example: fk-customer, fk-interaction, fk-audience, or fk-brand.

A column is added for each foreign key that is defined in the Sources tab.

Note

If foreign keys are linked together by a trivial duplicate they will appear in the Unified_Preprocessed_Raw table as a comma-separated list.

Also in: Unified_Coalesced, Unified_Customer

gender

String

The gender that is associated with an individual customer record.

Also in: Customer_360, Merged_Customers, Unified_Coalesced, Unified_Customer

given_name

String

The first name that is associated with an individual customer record.

Also in: Customer_360, Merged_Customers, Unified_Coalesced, Unified_Customer

int_id

Integer

An identifier that represents a unique combination of datasource and pk (primary key) and their position in the Stitch record sort order after Stitch has identified (and removed) trivial duplicates.

Important

This identifier is based on Stitch record sort order and is determined each time Stitch runs. Records will not always have the same position in the record sort order; you should not expect this ID to be stable over time.

Tip

Use this field during the Stitch QA process to help identify why certain records were grouped (or not grouped) together.

login_trimmed

String

phone

String

The phone number that is associated with an individual customer record. A customer record may be associated with multiple phone numbers.

Also in: Customer_360, Merged_Customers, Unified_Coalesced, Unified_Customer

pk

String

The combination of data source and primary key allows Amperity to uniquely identify every row in every data table across the entirety of customer data input to Amperity.

Tip

The combination of pk and datasource uniquely identifies a row in the Unified_Preprocessed_Raw table, which correlates to a single row in a domain table.

po_box

String

Additional address information, such as an apartment number or a post office box, that is associated with the location of an individual customer record. For example: Apt #9.

postal

String

The zip code or postal code that is associated with the location of an individual customer record.

Also in: Customer_360, Merged_Customers, Unified_Coalesced, Unified_Customer

sk_generational_suffix

String

sk_given_name

String

state

String

The state or province that is associated with the location of an individual customer record.

Also in: Customer_360, Merged_Customers, Unified_Coalesced, Unified_Customer

supersized_id

Integer

The identifier for a supersized record.

Indicates when a rough heuristic is applied to the first grouping of records (rep_pk) to partition supersized records into smaller components. Supersized records occur when more than 500 groups are associated with the first grouping of records.

Also in: Unified_Coalesced

surname

String

The last name that is associated with an individual customer record.

Also in: Customer_360, Merged_Customers, Unified_Coalesced, Unified_Customer