Unmatched semantic values

Use this query to investigate situations where unique semantic values are equal, but Amperity IDs are not.

An undercluster, or a false negative, occurs when distinct records are incorrectly split from a cluster of records. Each undercluster affects the precision of identity resolution and should be investigated to understand why it occurred.

When to use

  1. Run this query to discover pairs of records that contain matching semantic values, but with different assigned Amperity IDs. For example: the same email address, but two different Amperity IDs. This query helps verify that Stitch correctly split records, and then also helps to discover underclustering.

  2. Run with email as the semantic value (typically ~10 minutes).

  3. Run with phone as the semantic value (typically ~10 minutes).

  4. Run with address as the semantic value (typically ~10 minutes).

  5. If other individual semantic values are useful for your tenant, run with those semantic values.

  6. Run for each semantic value with various JOIN conditions, such as given names matching or given names plus surnames matching (typically ~15 minutes).

  7. Run for each semantic with varying semantic frequency limit values (typically ~15-30 minutes).

  8. Look for examples of underclustering.

  9. Look for values that should be added to the bad-values blocklist.

  10. Look for examples of records that were missed during blocking.

  11. Look for values that may have been cleared during preprocessing.

Configure query

  1. From the Queries page, open the Stitch QA folder, and then select this query.

    Tip

    Add the Stitch QA queries template folder if it does not already exist.

    Note

    These steps describe how to complete this template to find unmatched semantic values for email addresses only. You may modify this template in the SQL Query Editor directly prior to running it to support other semantics (i.e. changing email to phone everywhere) or you could create additional queries, one per semantic, so that each query may remain fully customized and tuned for that use case.

  2. Update the list of matching fields to contain any additional semantic or custom fields required by your tenant:

    SELECT DISTINCT
      ,t1.amperity_id AS amp_id_a
      ,t2.amperity_id AS amp_id_b
      ,t1.email AS email_a
      ,t2.email AS email_b
      ,t1.given_name AS given_name_a
      ,t2.given_name AS given_name_b
      ,t1.surname AS surname_a
      ,t2.surname AS surname_b
      -- ADD MATCHING PAIRS FOR SEMANTICS
    

    For example, to add address:

    SELECT DISTINCT
      ,t1.amperity_id AS amp_id_a
      ,t2.amperity_id AS amp_id_b
      ,t1.email AS email_a
      ,t2.email AS email_b
      ,t1.given_name AS given_name_a
      ,t2.given_name AS given_name_b
      ,t1.surname AS surname_a
      ,t2.surname AS surname_b
      ,t1.address AS address_a
      ,t2.address AS address_b
    
  3. Update the matching records for all foreign keys:

    -- ,t1.fk_field_name AS fk_field_name_a
    -- ,t2.fk_field_name AS fk_field_name_b
    

    For example:

    ,t1.fk_customer_id AS fk_customer_id_a
    ,t2.fk_customer_id AS fk_customer_id_b
    ,t1.fk_campaign AS fk_campaign_a
    ,t2.fk_campaign AS fk_campaign_b
    
  4. Update the high-frequency value–“100”–for this semantic:

    FROM
      Unified_Coalesced AS t1
      LEFT JOIN (
        SELECT email, TRUE AS high_freq_email
        FROM Unified_Coalesced
        GROUP BY email
        HAVING COUNT(email) > 100 ) AS hf
        ON LOWER(t1.email) = LOWER(hf.email)
      INNER JOIN
    
  5. Update the INNER JOIN for any required additional filtering to be done prior to returning matching conditions:

    INNER JOIN
      Unified_Coalesced AS t2
      ON t1.amperity_id <> t2.amperity_id
      AND LOWER(t1.email) = LOWER(t2.email)
      AND hf.high_freq_email IS NULL
      -- MORE JOIN CONDITIONS MAY BE ADDED
      -- AND LOWER(t1.matching_field) = LOWER(t2.matching_field)
    

    For example, to require given name and surname to match when email addresses do not match:

    INNER JOIN
      Unified_Coalesced AS t2
      ON t1.amperity_id <> t2.amperity_id
      AND LOWER(t1.email) = LOWER(t2.email)
      AND hf.high_freq_email IS NULL
      AND LOWER(t1.given_name) = LOWER(t2.given_name)
      AND LOWER(t1.surname) = LOWER(t2.surname)
    
  6. If using the bad-values blocklist you may apply those blocklist values to the results of this query:

    -- EXCLUDE ENTIRE ROW
    -- AND t1.has_blv IS NULL
    -- EXCLUDE BY FIELD
    -- AND t1.blv_surname IS NULL
    -- AND t1.blv_given_name IS NULL
    -- AND t1.blv_email IS NULL
    -- AND t1.blv_phone IS NULL
    -- AND t1.blv_address IS NULL
    

    Uncomment the following line to exclude on the entire row:

    -- EXCLUDE ENTIRE ROW
    AND t1.has_blv IS NULL
    -- EXCLUDE BY FIELD
    -- AND t1.blv_surname IS NULL
    -- AND t1.blv_given_name IS NULL
    -- AND t1.blv_email IS NULL
    -- AND t1.blv_phone IS NULL
    -- AND t1.blv_address IS NULL
    

    Uncomment the following line to exclude on email addresses only:

    -- EXCLUDE ENTIRE ROW
    -- AND t1.has_blv IS NULL
    -- EXCLUDE BY FIELD
    -- AND t1.blv_surname IS NULL
    -- AND t1.blv_given_name IS NULL
    AND t1.blv_email IS NULL
    -- AND t1.blv_phone IS NULL
    -- AND t1.blv_address IS NULL
    

    Uncomment the appropriate row (or rows) for this query’s use case.

  7. Click Run Query and debug any issues that may arise.

  8. Click Activate.