Pandas Merge not working as expected

Left_df .shape is (975, 2) column: date, cases
Right_df.shape is (863, 2) column: date, deaths

outer_merge =pd.merge(left_df, right_df, on=‘date’, how=‘outer’, indicator=True)

outer_merge.shape gives (3519, 4).
Why is this? Shouldn’t it give 975+863 i.e 1838. Can anyone explain. Thanks in advance.

  1. Are the dates in the same format?
  2. Are you sure you want to use an outer join?
  3. You may need to remove on='date'. This is probably why you have 4 columns (date_x, date_y, cases and deaths - just a guess).

covid_cases_df.info() is:

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 975 entries, 0 to 974
Data columns (total 2 columns):

Column Non-Null Count Dtype


0 date 975 non-null datetime64[ns]
1 cases 975 non-null int64
dtypes: datetime64ns, int64(1)
memory usage: 15.4 KB

covid_deaths_df.info() is:

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 863 entries, 0 to 862
Data columns (total 2 columns):

Column Non-Null Count Dtype


0 date 863 non-null datetime64[ns]
1 deaths 862 non-null float64
dtypes: datetime64ns, float64(1)
memory usage: 13.6 KB

outer_merge =pd.merge(covid_cases_df, covid_deaths_df, on=‘date’, how=‘outer’, indicator=True)
outer_merge.shape is (3519, 4).

How do I merge 2 datasets on date as common column. ?

left_df.merge(right_df)

Just try this. No idea why you try to use outer join and why you need an indicator.

It won’t make sense to use outer join to be honest. You want to have cases merged based on the date. So they have to be present in both dataframes.

I want to join the cases_df and death_df into one. Based on a common column ‘date’.
So if there is a case on a date, it has to add the death entry on that date by merging.

I tried this: merged_df = covid_cases_df.merge(covid_deaths_df)

Int64Index: 3446 entries, 0 to 3445
Data columns (total 3 columns):
0 date 3446 non-null datetime64[ns]
1 cases 3446 non-null int64
2 deaths 3442 non-null float64
dtypes: datetime64ns, float64(1), int64(1)

Can you explain why I am getting 3446 rows. ?

Trying to comprehend what’s going on, because this number doesn’t make sense, given the 975 and 863 rows from both dataframes.

With inner join (as seen in the code you’ve posted) you should get at most 863 rows (all dates from deaths dataframe are present in cases). If there would be missing values then you would get even smaller dataframe.

Are you sure they aren’t modified somewhere?
Try checking them before doing merge.

Suspicion:
I have a feeling that the date column contains not only dates, but also times at which the number of cases/deaths was recorded.

Exactly. This is what I am confused about as the merged dataset numbers don’t make sense. I will try looking at the dataset again. Appreciate your help.

covid_cases_df[‘date’].describe() gives:

count 975
unique 269
top 2020-04-13 00:00:00
freq 4
first 2020-01-16 00:00:00
last 2020-10-10 00:00:00
Name: date, dtype: object

I have converted the date to date-time dtype. Also df.info() shows dtype is converted from object to datetime. So how is the describe() output showing dtype still as object and I can see the time 00:00:00. How to work around this. Any advice?

Hi @rl2000 , the answer is in your question itself because describe method gives count(no.of entries in the column), unique (no.of unique values),top,freq,first and last as a pandas series and it is evident from the data itself that the values from describe method can not be expressed in form of datetime or any other datatype.

Blind rule is you apply describe method on any data type column the result we get the data type as object as it is evident from the values of the describe method.

1 Like

@rl2000, @Sebgolos interesting question but the answer lies in full outer join. Which means for every unique date value in left_df if present in right df a row is added and also if date value is not present in right_df a row is added with death column value as null . Then for every unique date value in right_df which is not present in left_df a row is added with cases column value as null.

From the above discussion we can conclude that the extra columns in your dataframe are due to this non-matching date values present in both left_df and right_df. We can also infer the non matching date values.

You can check the number of null values in columns by chaining isnull().value_counts() on columns cases, deaths in the outer_merge dataframe.

Hint: outer_merge['cases'].isnull().value_counts()
outer_merge['deaths'].isnull().value_counts()
null value count is given by the number beside True after applying above hints

1 Like

@maheshvarma-dommaraj
Even with outer join this wouldn’t give above 3k rows.

With fully disjoint sets (both dataframes contain dates that aren’t present in the other one), it would still give 1838 rows.
The new dataframe would contain all dates from cases with empty deaths column (NA everywhere) and all dates from deaths with empty cases column. This gives 975 + 863 rows.

@maheshvarma-dommaraj Thanks. Makes sense. Is this what you are saying:
merged_df = df1.merge(df2)

As I am a visual person I have tried to articulate your reply below:

index     df1    df2
   0       key   match
   1        1  ------1 --------match 
   2        2        8
   3        3        9
   4        NaN      1-----------match with key - 1

So index 0  key 1 will be repeated as 2 rows. 
It will be repeated with every match it finds in df2. 
And hence I am getting so many rows on merging.

left: use only keys from left frame
Therefore, the actual number of rows in the merged df is not  the same as the number of rows in the left df. @Sebgolos Not sure if I am right . This is why I think my rows have increased.

There won’t be duplicated rows if the date is present in both dataframes.

I highly doubt something like this will happen:

| date       | cases | deaths |
|------------|-------|--------|
| 2020-03-01 | 5     | NA     |
| 2020-03-01 | NA    | 7      |
| 2020-03-01 | 5     | 7      |
| 2020-03-02 | NA    | 5      |
| 2020-03-03 | 10    | NA     |

If there’s a row with the same date then it will merge into one:

| date       | cases | deaths |
|------------|-------|--------|
| 2020-03-01 | 5     | 7      |
| 2020-03-02 | NA    | 5      |
| 2020-03-03 | 10    | NA     |

The problem may arise when your dates have times too (because it’s datetime64 type).
In such case 2020-03-01 12:01:56 and 2020-03-01 12:01:57 are considered to be 2 different values.

@Sebgolos if disjoint sets yes you are correct .
and answer is yes for rows should be less than 1838 when the data frames are grouped by date columns such that each date appears only once but in above problem i dont think that is the case.
But in case of not disjoint sets and data frames are not grouped by date column a single date may appear many times in a data frame the answer is no because , a date from left_df can match so many dates in right_df so my point is for every match a row will be created .
I can provide more clarity if i get the data frame @rl2000 is using for this that would be great for you also…

@rl2000 can you share the link to notebook containing the code and dataset.

@maheshvarma-dommaraj @Sebgolos Here is the link to the dataset. Its on Kaggle. To access it you have to sign in on Kaggle. Not sure if there is another way to share my notebook here.
https://www.kaggle.com/rupall/covid-uk-new