Many Amperity IDs¶

Use this query to explore cases where semantic values are associated with many different Amperity IDs, which may result in underclustering.

The results of this query will help:

Determine whether Amperity has incorrectly split entities.
Understand the cardinality of semantic values.
Identify values to add to the bad-values blocklist.

An undercluster–a false negative–occurs when distinct records are incorrectly split from a cluster of records. Each undercluster affects the precision of identity resolution. Investigate to understand why they occur.

When to use¶

Run this query to identify cases where semantic values are associated with many Amperity IDs.
Run with email as the semantic.
Run with phone as the semantic.
Run with address as the semantic.
Run with other semantics, as necessary.
Examine the results. Look for examples of underclustering and for values that can be added to the bad-values blocklist.

Configure query¶

From the Queries page, open the Stitch QA folder, and then select this query.

Tip

Add the Stitch QA queries template folder if it does not already exist.

Run the query three times, once each for the following semantics: email, phone, and address. Update the highlighted lines prior to running the query.

For email:

SELECT
  LOWER(email),
  COUNT(DISTINCT (amperity_id)) num_amperity_ids
FROM Unified_Coalesced
WHERE email IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100

For phone:

SELECT
  LOWER(phone),
  COUNT(DISTINCT (amperity_id)) num_amperity_ids
FROM Unified_Coalesced
WHERE phone IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100

For address:

SELECT
  LOWER(address),
  COUNT(DISTINCT (amperity_id)) num_amperity_ids
FROM Unified_Coalesced
WHERE address IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100

Run for other semantics, as necessary.

If you need to run the query against multiple semantics, comment out the SELECT statement used for single semantics and then uncomment the following SELECT statement:

/*
SELECT
  -- UPDATE FOR SEMANTICS YOU WANT TO USE
  LOWER(semantic)
  ,LOWER(semantic)
  ,COUNT(DISTINCT (amperity_id)) num_amperity_ids
FROM Unified_Coalesced
WHERE
  -- UPDATE FOR SEMANTICS YOU WANT TO USE
  semantic IS NOT NULL
  AND semantic IS NOT NULL
GROUP BY 1
-- UPDATE ORDERING FOR NUMBER OF COMBINED SEMANTICS
ORDER BY 2,3 DESC
LIMIT 100
*/

Update the list of semantic values in both the SELECT statement and the WHERE clause, and then specify the ordering for combined semantics.

For example, to run this statement for email, phone, and address:

SELECT
  LOWER(email)
  ,LOWER(phone)
  ,LOWER(address)
  ,COUNT(DISTINCT (amperity_id)) num_amperity_ids
FROM Unified_Coalesced
WHERE email IS NOT NULL
-- UPDATE FOR SEMANTICS YOU WANT TO USE
AND phone IS NOT NULL
AND address IS NOT NULL
GROUP BY 1
-- UPDATE ORDERING FOR NUMBER OF COMBINED SEMANTICS
ORDER BY 2,3 DESC
LIMIT 100

Keep track of values that indicate underclustering and values that should be added to the bad-values blocklist.
Use the following query to use those values to find the related Amperity IDs:
```
1SELECT
2  *
3FROM Unified_Coalesced
4WHERE lower(email) = 'value'
```
Replace email with the correct semantic and value with one of the values retured by the other SELECT statements in this query.

Tip

You can create a separate query that contains only this SELECT statement instead of commenting out the SELECT statements in this query that find a single semantic or a combination of semantics.
Click Run Query and debug any issues that may arise.
Click Activate.

Choose topic collections

Many Amperity IDs¶

When to use¶

Configure query¶