Many semantic values

Use this query to explore cases where a single Amperity ID is associated with many different semantic values or associated with combinations of semantic values that are generally good identifiers.

For example, email, given-name, and surname. Use this query to determine if Amperity has incorrectly clustered different entities together into supersized clusters or by overclustering.

A supersized cluster is a cluster of records that is discovered during the Stitch process that has more than 64 matching records. A supersized cluster does not typically represent a unique individual and is not worthy of further analysis.

An overcluster, or a false positive, occurs when distinct records are incorrectly added to a cluster of records. Each overcluster affects the precision of identity resolution and should be investigated to understand why it occurred.

When to use

  1. Run this query to identify cases where a single Amperity ID is associated with many semantic values.

  2. Run with email as the semantic.

  3. Run with phone as the semantic.

  4. Run with given-name and surname as the semantic.

  5. Run with other semantics, as necessary.

  6. Examine the results (~30 minutes). Look for examples of supersized clusters or for examples of overclustering.

Configure query

  1. From the Queries page, open the Stitch QA folder, and then select this query.

    Tip

    Add the Stitch QA queries template folder if it does not already exist.

  2. You can run this query against multiple semantics. For example, email:

    SELECT
      LOWER(email),
      COUNT(DISTINCT (amperity_id)) num_amperity_ids
    FROM
      Unified_Coalesced
    WHERE
      email IS NOT NULL
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 100
    

    Run for other semantics, as necessary.

  3. You can run this query against multiple semantics, comment out the SELECT statement used for single semantics and then uncomment the following SELECT statement:

    /*
    SELECT
      amperity_id
      -- UPDATE FOR SEMANTICS YOU WANT TO USE
      ,COUNT(DISTINCT LOWER(semantic)) num_semantic
      ,COUNT(DISTINCT LOWER(semantic)) num_semantic
    FROM
      Unified_Coalesced
    GROUP BY 1
    ORDER BY 2,3 DESC
    LIMIT 100
    */
    

    For example, to run this statement for given-name and surname:

    SELECT
      amperity_id
      ,COUNT(DISTINCT LOWER(given_name)) num_given_name
      ,COUNT(DISTINCT LOWER(surname)) num_surname
    FROM
      Unified_Coalesced
    GROUP BY 1
    ORDER BY 2,3 DESC
    LIMIT 100
    
  4. Click Run Query and debug any issues that may arise.

  5. Click Activate.