Cluster Size Distribution

Use this query to understand the overall distribution of clusters. A good Stitch result will have an overwhelming majority of small clusters; however, there can be a long-tail of supersized clusters. The query defaults to using primary keys, but should be configured to use customer keys when they are available.

A supersized cluster is a cluster of records that is discovered during the Stitch process that has more than ~100 matching records. When a cluster has more than ~100 records, this is more often an indicator for abandonment of continued analysis than one of an indicator of interest for further analysis.

Note

The presence of supersized clusters may indicate overclustering.

An overcluster, or a false positive, occurs when distinct records are incorrectly added to a cluster of records. Each overcluster affects the precision of identity resolution and should be investigated to understand why it occurred.

When to use

  1. Run this query to identify if distinct entities are clustered together.

  2. Review the cluster size distribution value specified in the query, and then adjust this size limit as appropriate.

  3. Look for supersize clusters or other indicators of overclustering.

    Important

    The presence of clusters that exist outside the 1-20 range is an indicator of overclustering. If a lot of clusters are present at the high end of this range, it’s important to investigate those cases to identify if distinct entities were clustered together.

  4. If present, investigate each cluster individually.

Configure query

  1. From the Queries tab, open the Stitch QA folder, and then select this query.

    Tip

    Add the Stitch QA queries template folder if it does not already exist.

  2. When customer keys are available, update the common table expression to use both customer and primary keys, instead of just primary keys. Change the following:

    SELECT DISTINCT amperity_id,
      pk
    FROM Unified_Coalesced) GROUP BY amperity_id
    

    to:

     SELECT DISTINCT amperity_id,
       CONCAT(datasource, COALESCE(ck, pk)) AS pk
     FROM Unified_Coalesced) GROUP BY amperity_id
    
  3. You may adjust the query to count clusters of any size. The default cluster sizes are:

    WHEN cluster_size = 1 THEN '1'
    WHEN cluster_size = 2 THEN '2'
    WHEN cluster_size = 3 THEN '3'
    WHEN cluster_size = 4 THEN '4'
    WHEN cluster_size = 5 THEN '5'
    WHEN cluster_size > 5 AND cluster_size <= 10 THEN '6-10'
    WHEN cluster_size > 10 AND cluster_size <= 20 THEN '11-20'
    WHEN cluster_size > 20 AND cluster_size <= 63 THEN '21-63'
    -- Supersized clusters
    WHEN cluster_size > 63 AND cluster_size <= 100 THEN '64-100'
    WHEN cluster_size > 100 AND cluster_size <= 1000 THEN '101-1000'
    WHEN cluster_size > 1000 AND cluster_size <= 10000 THEN '1001-10000'
    WHEN cluster_size > 10000 THEN '10001+'
    

    For example, to use cluster sizes of 6, 7, 8, 9, update to:

    WHEN cluster_size = 1 THEN '1'
    WHEN cluster_size = 2 THEN '2'
    WHEN cluster_size = 3 THEN '3'
    WHEN cluster_size = 4 THEN '4'
    WHEN cluster_size = 5 THEN '5'
    WHEN cluster_size = 6 THEN '6'
    WHEN cluster_size = 7 THEN '7'
    WHEN cluster_size = 8 THEN '8'
    WHEN cluster_size = 9 THEN '9'
    WHEN cluster_size > 5 AND cluster_size <= 10 THEN '6-10'
    WHEN cluster_size > 10 AND cluster_size <= 20 THEN '11-20'
    WHEN cluster_size > 20 AND cluster_size <= 63 THEN '21-63'
    -- Supersized clusters
    WHEN cluster_size > 63 AND cluster_size <= 100 THEN '64-100'
    WHEN cluster_size > 100 AND cluster_size <= 1000 THEN '101-1000'
    WHEN cluster_size > 1000 AND cluster_size <= 10000 THEN '1001-10000'
    WHEN cluster_size > 10000 THEN '10001+'
    

    or to use a cluster size of 5000:

    WHEN cluster_size = 1 THEN '1'
    WHEN cluster_size = 2 THEN '2'
    WHEN cluster_size = 3 THEN '3'
    WHEN cluster_size = 4 THEN '4'
    WHEN cluster_size = 5 THEN '5'
    WHEN cluster_size > 5 AND cluster_size <= 10 THEN '6-10'
    WHEN cluster_size > 10 AND cluster_size <= 20 THEN '11-20'
    WHEN cluster_size > 20 AND cluster_size <= 63 THEN '21-63'
    -- Supersized clusters
    WHEN cluster_size > 63 AND cluster_size <= 100 THEN '64-100'
    WHEN cluster_size > 100 AND cluster_size <= 1000 THEN '101-1000'
    WHEN cluster_size > 1000 AND cluster_size <= 5000 THEN '1001-5000'
    WHEN cluster_size > 5000 AND cluster_size <= 10000 THEN '5001-10000'
    WHEN cluster_size > 10000 THEN '10001+'
    

    Note

    If you update the cluster sizes, make the same updates to the CASE statement below that is used for ordering rows correctly.

  4. Run the query, and then review the cluster size distribution.

  5. The threshold for supersized clusters is specified in the CASE statements to be greater than 63. Uncomment the following SELECT statement:

    SELECT *
    FROM cluster_sizes
    WHERE cluster_size > 63
    ORDER BY cluster_size DESC
    

    and then re-run the query to return only supersized clusters.

    Tip

    Depending on the type of investigation you want to perform, re-commenting this part of the query may be helpful.

  6. Click Run Query and debug any issues that may arise.

  7. Click Activate.