This doesn't make sense

Hello, folks. I hope you’re doing great.

For the Course Project, I decided to analyze Football (soccer) matches from 1872 to 2020. I’m trying to answer two specific questions: if being home team have some effect on winning (people often said the fans are the 13th player) and if altitude (as in elevation, in meters above the sea level) does play a role or not when two teams compete agains each other (say Bolivia vs Uruguay). Anyway…

For the Elevation question, I’m using the dataset ‘results.cvs’ (footbal matches) and ‘global_data’ (cities and elevation) and trying to merge both dataframes using .merge using ‘city’. Ideally, I would like to extract the altitude of a city from global_data and asign it to the same city in results_df.

Thing is when I merge both DF, I end up having MORE rows than originally had.

Results_df, +41.000 records = date, home_team, away_team, home_score, away_score, tournament city, country, neutral.
City_data_df, +13.000 records = city, continent, elevation
Merged_df = 56.000 records (WHAAAAT)

Even if i drop rows based on NaNs in the date colum, I end up having almos 4.000 more records than I should.

Can you look it up my notebook and tell me what I’m doing wrong? I’ve googled and stackoverflowed but I can’t find the reason why this is happening.

Thanks in advance and sorry for the clickbaity title :slight_smile:

The reason for the extra rows is because there are cities with the same name on different continents in the city_data.csv file. If you run the following code in your notebook, you will notice that the number of rows doubles due to two cities named London in the dataset.

test_df = results_df.loc[results_df['city'] == 'London']
merged_df = test_df.merge(city_data_df, on="city")

The solution is to add a continent column to the results dataframe and then merge using the combined city and continent columns which you can see with the following example.

test_df = results_df.loc[results_df['city'] == 'London']
test_df['Continente'] = "Europe"
merged_df = test_df.merge(city_data_df, on=['city', 'Continente'])

Hey, man. Thanks for your time!
I’m gonna test it and see how it goes

1 Like