Many Amperity IDs

Use this query to explore cases where semantic values are associated with many different Amperity IDs, which may result in underclustering.

The results of this query will help:

  • Determine whether Amperity has incorrectly split entities.

  • Understand the cardinality of semantic values.

  • Identify values to add to the bad-values blocklist.

An undercluster, or a false negative, occurs when distinct records are incorrectly split from a cluster of records. Each undercluster affects the precision of identity resolution and should be investigated to understand why it occurred.

When to use

  1. Run this query to identify cases where semantic values are associated with many Amperity IDs.

  2. Run with email as the semantic.

  3. Run with phone as the semantic.

  4. Run with address as the semantic.

  5. Run with other semantics, as necessary.

  6. Examine the results (~30 minutes). Look for examples of underclustering and for values that can be added to the bad-values blocklist.

Configure query

  1. From the Queries page, open the Stitch QA folder, and then select this query.

    Tip

    Add the Stitch QA queries template folder if it does not already exist.

  2. Run the query three times, once each for the following semantics: email, phone, and address. Update the highlighted lines prior to running the query.

    For email:

    SELECT
      LOWER(email),
      COUNT(DISTINCT (amperity_id)) num_amperity_ids
    FROM
      Unified_Coalesced
    WHERE
      email IS NOT NULL
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 100
    

    For phone:

    SELECT
      LOWER(phone),
      COUNT(DISTINCT (amperity_id)) num_amperity_ids
    FROM
      Unified_Coalesced
    WHERE
      phone IS NOT NULL
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 100
    

    For address:

    SELECT
      LOWER(address),
      COUNT(DISTINCT (amperity_id)) num_amperity_ids
    FROM
      Unified_Coalesced
    WHERE
      address IS NOT NULL
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 100
    

    Run for other semantics, as necessary.

  3. If you need to run the query against multiple semantics, comment out the SELECT statement used for single semantics and then uncomment the following SELECT statement:

    /*
    SELECT
      -- UPDATE FOR SEMANTICS YOU WANT TO USE
      LOWER(semantic)
      ,LOWER(semantic)
      ,COUNT(DISTINCT (amperity_id)) num_amperity_ids
    FROM
      Unified_Coalesced
    WHERE
      -- UPDATE FOR SEMANTICS YOU WANT TO USE
      semantic IS NOT NULL
      AND semantic IS NOT NULL
    GROUP BY 1
    -- UPDATE ORDERING FOR NUMBER OF COMBINED SEMANTICS
    ORDER BY 2,3 DESC
    LIMIT 100
    */
    

    Update the list of semantic values in both the SELECT statement and the WHERE clause, and then specify the ordering for combined semantics.

    For example, to run this statement for email, phone, and address:

    SELECT
      LOWER(email)
      ,LOWER(phone)
      ,LOWER(address)
      ,COUNT(DISTINCT (amperity_id)) num_amperity_ids
    FROM
      Unified_Coalesced
    WHERE
      -- UPDATE FOR SEMANTICS YOU WANT TO USE
      email IS NOT NULL
      AND phone IS NOT NULL
      AND address IS NOT NULL
    GROUP BY 1
    -- UPDATE ORDERING FOR NUMBER OF COMBINED SEMANTICS
    ORDER BY 2,3 DESC
    LIMIT 100
    
  4. Keep track of values that indicate underclustering and values that should be added to the bad-values blocklist.

  5. Use the following query to use those values to find the related Amperity IDs:

    SELECT
      *
    FROM Unified_Coalesced
    WHERE lower(email) = 'value'
    

    Replace email with the correct semantic and value with one of the values retured by the other SELECT statements in this query.

    Tip

    You can create a separate query that contains only this SELECT statement instead of commenting out the SELECT statements in this query that find a single semantic or a combination of semantics.

  6. Click Run Query and debug any issues that may arise.

  7. Click Activate.