I am using the Solar Power Generation Dataset
I am trying to merge two csv files using pandas( and saving the output so it can be cached) but pandas is taking up too much RAM(>30 GB) and on save too much storage(~30 GB output csv). This is repeatable. The dataset files I am merging are the solar plant’s respective generation and weather data. It also takes a very long time to execute.
The initial files are all less than 6MB
Warning- this can only be done ideally with a local instance with ideally >64 GB RAM and >65GB free storage.
temp1 = pd.read_csv(r"data/solar-power-generation/Plant_1_Generation_Data.csv") temp2 = pd.read_csv("data/solar-power-generation/Plant_1_Weather_Sensor_Data.csv").drop(columns=["SOURCE_KEY"]) merged = pd.merge(temp1,temp2,on="PLANT_ID").drop(columns="PLANT_ID") merged.to_csv(r"data/solar-power-generation/Plant_1_Merged.csv")