Many Amperity IDs¶
Use this query to explore cases where semantic values are associated with many different Amperity IDs, which may result in underclustering.
The results of this query will help:
Determine whether Amperity has incorrectly split entities.
Understand the cardinality of semantic values.
Identify values to add to the bad-values blocklist.
An undercluster, or a false negative, occurs when distinct records are incorrectly split from a cluster of records. Each undercluster affects the precision of identity resolution and should be investigated to understand why it occurred.
When to use¶
Run this query to identify cases where semantic values are associated with many Amperity IDs.
Run with email as the semantic.
Run with phone as the semantic.
Run with address as the semantic.
Run with other semantics, as necessary.
Examine the results (~30 minutes). Look for examples of underclustering and for values that can be added to the bad-values blocklist.
Configure query¶
From the Queries page, open the Stitch QA folder, and then select this query.
Tip
Add the Stitch QA queries template folder if it does not already exist.
Run the query three times, once each for the following semantics: email, phone, and address. Update the highlighted lines prior to running the query.
For email:
SELECT LOWER(email), COUNT(DISTINCT (amperity_id)) num_amperity_ids FROM Unified_Coalesced WHERE email IS NOT NULL GROUP BY 1 ORDER BY 2 DESC LIMIT 100
For phone:
SELECT LOWER(phone), COUNT(DISTINCT (amperity_id)) num_amperity_ids FROM Unified_Coalesced WHERE phone IS NOT NULL GROUP BY 1 ORDER BY 2 DESC LIMIT 100
For address:
SELECT LOWER(address), COUNT(DISTINCT (amperity_id)) num_amperity_ids FROM Unified_Coalesced WHERE address IS NOT NULL GROUP BY 1 ORDER BY 2 DESC LIMIT 100
Run for other semantics, as necessary.
If you need to run the query against multiple semantics, comment out the SELECT statement used for single semantics and then uncomment the following SELECT statement:
/* SELECT -- UPDATE FOR SEMANTICS YOU WANT TO USE LOWER(semantic) ,LOWER(semantic) ,COUNT(DISTINCT (amperity_id)) num_amperity_ids FROM Unified_Coalesced WHERE -- UPDATE FOR SEMANTICS YOU WANT TO USE semantic IS NOT NULL AND semantic IS NOT NULL GROUP BY 1 -- UPDATE ORDERING FOR NUMBER OF COMBINED SEMANTICS ORDER BY 2,3 DESC LIMIT 100 */
Update the list of semantic values in both the SELECT statement and the WHERE clause, and then specify the ordering for combined semantics.
For example, to run this statement for email, phone, and address:
SELECT LOWER(email) ,LOWER(phone) ,LOWER(address) ,COUNT(DISTINCT (amperity_id)) num_amperity_ids FROM Unified_Coalesced WHERE -- UPDATE FOR SEMANTICS YOU WANT TO USE email IS NOT NULL AND phone IS NOT NULL AND address IS NOT NULL GROUP BY 1 -- UPDATE ORDERING FOR NUMBER OF COMBINED SEMANTICS ORDER BY 2,3 DESC LIMIT 100
Keep track of values that indicate underclustering and values that should be added to the bad-values blocklist.
Use the following query to use those values to find the related Amperity IDs:
SELECT * FROM Unified_Coalesced WHERE lower(email) = 'value'
Replace email with the correct semantic and value with one of the values retured by the other SELECT statements in this query.
Tip
You can create a separate query that contains only this SELECT statement instead of commenting out the SELECT statements in this query that find a single semantic or a combination of semantics.
Click Run Query and debug any issues that may arise.
Click Activate.