Deduplication keys per Amperity ID

Use this query to return a rough distribution of deduplication keys to Amperity IDs. The deduplication key may be set to a foreign key, a customer key, or a concatenated value.

There are no obvious “good” or “bad” results, but it’s useful to take a closer look at all clusters that return 2 (or more) deduplication keys, as this shows evidence of records that have been connected, but were previously independent, and may indicate overclustering.

  • Foreign keys are interesting because Amperity deterministically matches on foreign keys in most situations. When there are 2 (or more) deduplication keys related to foreign keys, this is an indication of records that were connected beyond what the customer could have connected deterministically.

  • Customer keys are interesting because they show the before and after of entity distribution.

  • Concatenated values, such as datasource + fk can help focus the results on a per-source level.

An overcluster, or a false positive, occurs when distinct records are incorrectly added to a cluster of records. Each overcluster affects the precision of identity resolution and should be investigated to understand why it occurred.

Note

This query is really two queries.

  • The first query returns the distribution of deduplication keys by Amperity ID with one row for the number of Amperity IDs that have 2 (or more) deduplication keys.

  • The second query returns a list of Amperity IDs along with the number of deduplication keys associated to each Amperity ID, sorted from most to least.

When to use

  1. Run this query when foreign keys or customer keys are present in the data.

  2. If customer keys are present, use ck as the deduplication key.

  3. Run this query for each data source.

  4. Investigate clusters with 2 (or more) deduplication keys (~30-60 minutes).

  5. Look for examples of overclustering.

Configure query

  1. From the Queries page, open the Stitch QA folder, and then select this query.

    Tip

    Add the Stitch QA queries template folder if it does not already exist.

  2. Specify the deduplication key to use for the query. The default deduplication key is a concatenation of data source and primary key:

    COUNT(distinct CONCAT(datasource, pk)) AS dk_count
    

    The deduplication key can be updated to be any combination of foreign key, customer key, primary key, and data source. For example, for a foreign key named fk-campaign you could use:

    COUNT(distinct fk_campaign) as dk_count
    

    or you could concatenate the foreign key and data source:

    COUNT(distinct CONCAT(datasource, fk_campaign)) as dk_count
    

    If customer keys are present, you could concatenate the customer key to data sources and primary keys:

    COUNT(distinct CONCAT(datasource, COALESCE(ck, pk))) AS dk_count
    
  3. By default, the deduplication key count is 2 (or more). You can adjust this value higher:

    SELECT
      *
    FROM
      dk_distribution
    -- UPDATE BELOW FOR DESIRED dk_count
    WHERE dk_count > 1
    ORDER BY dk_count DESC
    
  4. Click Run Segment to return a list of Amperity IDs, along with the number of deduplication keys associated to each Amperity ID.

  5. To return the distribution of deduplication keys by Amperity ID, first use /* and */ to comment out the size investigation SELECT statement:

    /*SELECT
      *
    FROM
      dk_distribution
    -- UPDATE BELOW FOR DESIRED dk_count
    WHERE dk_count > 1
    ORDER BY dk_count DESC*/
    

    then remove the comments from the distribution SELECT statement, and then click Run Segment.

  6. Click Activate.