Genderprediction - Notebook by Abhishek Zope (abhishekzope686)

Learn practical skills, build real-world projects, and advance your career

Updated 4 years ago

1. Sound it out!

Grey and Gray. Colour and Color. Words like these have been the cause of many heated arguments between Brits and Americans. Accents (and jokes) aside, there are many words that are pronounced the same way but have different spellings. While it is easy for us to realize their equivalence, basic programming commands will fail to equate such two strings.

More extreme than word spellings are names because people have more flexibility in choosing to spell a name in a certain way. To some extent, tradition sometimes governs the way a name is spelled, which limits the number of variations of any given English name. But if we consider global names and their associated English spellings, you can only imagine how many ways they can be spelled out.

One way to tackle this challenge is to write a program that checks if two strings sound the same, instead of checking for equivalence in spellings. We'll do that here using fuzzy name matching.

import fuzzy
fuzzy.nysiis
fuzzy.nysiis('Tufoule') == fuzzy.nysiis('Tufool')

True

2. Authoring the authors

The New York Times puts out a weekly list of best-selling books from different genres, and which has been published since the 1930’s. We’ll focus on Children’s Picture Books, and analyze the gender distribution of authors to see if there have been changes over time. We'll begin by reading in the data on the best selling authors from 2008 to 2017.

import pandas as pd
author_df = pd.read_csv('datasets/nytkids_yearly.csv', delimiter=';')
first_name = []
for name in author_df['Author']:
    first_name.append(name.split()[0])
author_df['first_name'] = first_name
author_df.head()

3. It's time to bring on the phonics... again!

When we were young children, we were taught to read using phonics; sounding out the letters that compose words. So let's relive history and do that again, but using python this time. We will now create a new column or list that contains the phonetic equivalent of every first name that we just extracted.

To make sure we're on the right track, let's compare the number of unique values in the first_name column and the number of unique values in the nysiis coded column. As a rule of thumb, the number of unique nysiis first names should be less than or equal to the number of actual first names.