Gender Prediction

Important

In many cases, product affinity or other behaviors can provide greater insight into personalization than simply relying on name. However, improperly using gender prediction can create a negative customer experience for any brand. It’s important to understand how this attribute will be used prior to implementation and to work closely with partners to ensure this generated data is used responsibly.

Gender prediction can be a helpful part of the effort to apply personalizion to marketing campaigns, email lists, and websites. When gender is known, it can be used as a signal for tailoring communications, recommendations, and product lists based on observed preferences that are common to people within that gender.

Important

Gender prediction must be enabled for use in Amperity. Contact your support representative via the Amperity Support Portal (or send email to support@amperity.com) to request adding gender prediction capabilities to your tenant.

Warning

When used carefully, gender prediction can have a low downside risk due to false positives. However, gender prediction should not be used for 1:1 personalization, especially for the purpose of predicting pronouns (he, him, she, her, they, them) because the benefits of correctly predicting gender is, in most cases, outweighed by the high downside risks of being wrong.

Configure Gender Prediction

Gender prediction is not automatically output by Stitch, however this functionality can be added by leveraging existing data tagged with the given-name semantic. First add a feed that contains data usable for predicting gender, and then update the customer 360 database to use SQL to associate predictions to an Amperity ID.

To configure Amperity for gender prediction

  1. Download the gender_name_ratios.csv file.

  2. Add a feed named “Gender” with a new source named “Predictions”. Upload the gender_name_ratios.csv file.

  3. Assign the primary key to the given_name column. (Do not make this table available to Stitch or apply any semantic tags to fields.)

  4. Activate the feed.

  5. Run Stitch.

  6. From the Customer 360 tab, edit the customer 360 database.

  7. Click Add Table, and then name the table Predictions_Gender.

  8. Choose SQL as the build mode. Add the following SELECT statement:

    SELECT
      mc.amperity_id,
      ratios.predicted_gender
    FROM Merged_Customers AS mc
    LEFT JOIN Predictions_Gender AS ratios
      ON UPPER(COALESCE(mc.given_name, split(mc.full_name,' ')[0])) = ratios.given_name
    

Accuracy Threshold

The default accuracy threshold for gender prediction is ~95%. This means that for any given name it has a 20:1 likelihood of being associated with a specific gender. If greater accuracy is required for a use case, add a custom threshold to the query:

WITH ratios AS (SELECT * FROM Predictions_Gender WHERE gender_name_ratio >= 100)

SELECT
  mc.amperity_id,
  ratios.predicted_gender
FROM Merged_Customers AS mc
LEFT JOIN Predictions_Gender AS ratios
  ON UPPER(COALESCE(mc.given_name, split(mc.full_name,' ')[0])) = ratios.given_name

where 100 represents a 99% accuracy threshold for gender prediction.

Gender Name Ratios

The source of the data in the gender_name_ratios.csv file is from United States Social Security Administration records for the popularity and frequency of baby names between 1880-2018 <https://www.ssa.gov/oact/babynames/limits.html>.

These records describe more than 351 million baby names, along with their associated gender. These records were used to generate the gender_name_ratios.csv file, which is similar to:

given_name,predicted_gender,gender_name_ratio,male_count,female_count
EMILIA,F,7178.6,5,35893
THERESE,F,7025.0,5,35125
AILEEN,F,6969.8,5,34849
...
LINDSEY,F,20.2,7710,156111
MORRISON,M,20.2,1496,74
ROLLA,M,20.1,1306,65

The most important column is gender_name_ratio, which describes what proportion of given_name is associated with one gender versus the other.

The following filters were applied to this data set prior to generation:

  1. Only names with a gender name ratio greater than 20 were included. This ensures that any prediction has a ~95% chance of being correct based on the given name.

  2. Only names with at least 1000 male or female examples were included, which filters out very uncommon names.