Many semantic values¶
Use this query to explore cases where a single Amperity ID is associated with many different semantic values or associated with combinations of semantic values that are generally good identifiers.
For example, email, given-name, and surname. Use this query to determine if Amperity has incorrectly clustered different entities together into supersized clusters or by overclustering.
A supersized cluster is a cluster of records with more than 64 matching records. A supersized cluster does not typically represent a unique individual and is not worthy of further analysis.
An overcluster–a false positive–occurs when distinct records are incorrectly added to a cluster of records. Each overcluster affects the precision of identity resolution. Investigate to understand why they occur.
When to use¶
Run this query to identify cases where a single Amperity ID is associated with many semantic values.
Run with email as the semantic.
Run with phone as the semantic.
Run with given-name and surname as the semantic.
Run with other semantics, as necessary.
Examine the results (~30 minutes). Look for examples of supersized clusters or for examples of overclustering.
Configure query¶
From the Queries page, open the Stitch QA folder, and then select this query.
Tip
Add the Stitch QA queries template folder if it does not already exist.
You can run this query against multiple semantics. For example, email:
1SELECT 2 LOWER(email), 3 COUNT(DISTINCT (amperity_id)) num_amperity_ids 4FROM 5 Unified_Coalesced 6WHERE 7 email IS NOT NULL 8GROUP BY 1 9ORDER BY 2 DESC 10LIMIT 100
Run for other semantics, as necessary.
You can run this query against multiple semantics, comment out the SELECT statement used for single semantics and then uncomment the following SELECT statement:
1/* 2SELECT 3 amperity_id 4 -- UPDATE FOR SEMANTICS YOU WANT TO USE 5 ,COUNT(DISTINCT LOWER(semantic)) num_semantic 6 ,COUNT(DISTINCT LOWER(semantic)) num_semantic 7FROM 8 Unified_Coalesced 9GROUP BY 1 10ORDER BY 2,3 DESC 11LIMIT 100 12*/
For example, to run this statement for given-name and surname:
1SELECT 2 amperity_id 3 ,COUNT(DISTINCT LOWER(given_name)) num_given_name 4 ,COUNT(DISTINCT LOWER(surname)) num_surname 5FROM 6 Unified_Coalesced 7GROUP BY 1 8ORDER BY 2,3 DESC 9LIMIT 100
Click Run Query and debug any issues that may arise.
Click Activate.