Many Amperity IDs¶
Use this query to explore cases where semantic values are associated with many different Amperity IDs, which may result in underclustering.
The results of this query will help:
Determine whether Amperity has incorrectly split entities.
Understand the cardinality of semantic values.
Identify values to add to the bad-values blocklist.
An undercluster–a false negative–occurs when distinct records are incorrectly split from a cluster of records. Each undercluster affects the precision of identity resolution. Investigate to understand why they occur.
When to use¶
Run this query to identify cases where semantic values are associated with many Amperity IDs.
Run with email as the semantic.
Run with phone as the semantic.
Run with address as the semantic.
Run with other semantics, as necessary.
Examine the results (~30 minutes). Look for examples of underclustering and for values that can be added to the bad-values blocklist.
Configure query¶
From the Queries page, open the Stitch QA folder, and then select this query.
Tip
Add the Stitch QA queries template folder if it does not already exist.
Run the query three times, once each for the following semantics: email, phone, and address. Update the highlighted lines prior to running the query.
For email:
1SELECT 2 LOWER(email), 3 COUNT(DISTINCT (amperity_id)) num_amperity_ids 4FROM 5 Unified_Coalesced 6WHERE 7 email IS NOT NULL 8GROUP BY 1 9ORDER BY 2 DESC 10LIMIT 100
For phone:
1SELECT 2 LOWER(phone), 3 COUNT(DISTINCT (amperity_id)) num_amperity_ids 4FROM 5 Unified_Coalesced 6WHERE 7 phone IS NOT NULL 8GROUP BY 1 9ORDER BY 2 DESC 10LIMIT 100
For address:
1SELECT 2 LOWER(address), 3 COUNT(DISTINCT (amperity_id)) num_amperity_ids 4FROM 5 Unified_Coalesced 6WHERE 7 address IS NOT NULL 8GROUP BY 1 9ORDER BY 2 DESC 10LIMIT 100
Run for other semantics, as necessary.
If you need to run the query against multiple semantics, comment out the SELECT statement used for single semantics and then uncomment the following SELECT statement:
1/* 2SELECT 3 -- UPDATE FOR SEMANTICS YOU WANT TO USE 4 LOWER(semantic) 5 ,LOWER(semantic) 6 ,COUNT(DISTINCT (amperity_id)) num_amperity_ids 7FROM 8 Unified_Coalesced 9WHERE 10 -- UPDATE FOR SEMANTICS YOU WANT TO USE 11 semantic IS NOT NULL 12 AND semantic IS NOT NULL 13GROUP BY 1 14-- UPDATE ORDERING FOR NUMBER OF COMBINED SEMANTICS 15ORDER BY 2,3 DESC 16LIMIT 100 17*/
Update the list of semantic values in both the SELECT statement and the WHERE clause, and then specify the ordering for combined semantics.
For example, to run this statement for email, phone, and address:
1SELECT 2 LOWER(email) 3 ,LOWER(phone) 4 ,LOWER(address) 5 ,COUNT(DISTINCT (amperity_id)) num_amperity_ids 6FROM 7 Unified_Coalesced 8WHERE 9 -- UPDATE FOR SEMANTICS YOU WANT TO USE 10 email IS NOT NULL 11 AND phone IS NOT NULL 12 AND address IS NOT NULL 13GROUP BY 1 14-- UPDATE ORDERING FOR NUMBER OF COMBINED SEMANTICS 15ORDER BY 2,3 DESC 16LIMIT 100
Keep track of values that indicate underclustering and values that should be added to the bad-values blocklist.
Use the following query to use those values to find the related Amperity IDs:
1SELECT 2 * 3FROM Unified_Coalesced 4WHERE lower(email) = 'value'
Replace email with the correct semantic and value with one of the values retured by the other SELECT statements in this query.
Tip
You can create a separate query that contains only this SELECT statement instead of commenting out the SELECT statements in this query that find a single semantic or a combination of semantics.
Click Run Query and debug any issues that may arise.
Click Activate.