Many semantic values

Use this query to explore cases where a single Amperity ID is associated with many different semantic values or associated with combinations of semantic values that are generally good identifiers.

For example, email, given-name, and surname. Use this query to determine if Amperity has incorrectly clustered different entities together into supersized clusters or by overclustering.

A supersized cluster is a cluster of records with more than 64 matching records. A supersized cluster does not typically represent a unique individual and is not worthy of further analysis.

An overcluster–a false positive–occurs when distinct records are incorrectly added to a cluster of records. Each overcluster affects the precision of identity resolution. Investigate to understand why they occur.

When to use

  1. Run this query to identify cases where a single Amperity ID is associated with many semantic values.

  2. Run with email as the semantic.

  3. Run with phone as the semantic.

  4. Run with given-name and surname as the semantic.

  5. Run with other semantics, as necessary.

  6. Examine the results (~30 minutes). Look for examples of supersized clusters or for examples of overclustering.

Configure query

  1. From the Queries page, open the Stitch QA folder, and then select this query.

    Tip

    Add the Stitch QA queries template folder if it does not already exist.

  2. You can run this query against multiple semantics. For example, email:

     1SELECT
     2  LOWER(email),
     3  COUNT(DISTINCT (amperity_id)) num_amperity_ids
     4FROM
     5  Unified_Coalesced
     6WHERE
     7  email IS NOT NULL
     8GROUP BY 1
     9ORDER BY 2 DESC
    10LIMIT 100
    

    Run for other semantics, as necessary.

  3. You can run this query against multiple semantics, comment out the SELECT statement used for single semantics and then uncomment the following SELECT statement:

     1/*
     2SELECT
     3  amperity_id
     4  -- UPDATE FOR SEMANTICS YOU WANT TO USE
     5  ,COUNT(DISTINCT LOWER(semantic)) num_semantic
     6  ,COUNT(DISTINCT LOWER(semantic)) num_semantic
     7FROM
     8  Unified_Coalesced
     9GROUP BY 1
    10ORDER BY 2,3 DESC
    11LIMIT 100
    12*/
    

    For example, to run this statement for given-name and surname:

    1SELECT
    2  amperity_id
    3  ,COUNT(DISTINCT LOWER(given_name)) num_given_name
    4  ,COUNT(DISTINCT LOWER(surname)) num_surname
    5FROM
    6  Unified_Coalesced
    7GROUP BY 1
    8ORDER BY 2,3 DESC
    9LIMIT 100
    
  4. Click Run Query and debug any issues that may arise.

  5. Click Activate.