Jovian
⭐️
Sign In
Learn data science and machine learning by building real-world projects on Jovian
In [ ]:
!pip install jovian --upgrade --quiet

US Accidents Exploratory Data Analysis

This data is used to analyse and explore accidents in US inorder to prevent accidents in the mere future.

Select a large real-world dataset from Kaggle

In [ ]:
pip install opendatasets --upgrade
Collecting opendatasets Downloading https://files.pythonhosted.org/packages/18/99/aaa3ebec81dc347302e730e0daff61735ed2f3e736129553fb3f9bf67ed3/opendatasets-0.1.10-py3-none-any.whl Requirement already satisfied, skipping upgrade: kaggle in /usr/local/lib/python3.7/dist-packages (from opendatasets) (1.5.10) Requirement already satisfied, skipping upgrade: click in /usr/local/lib/python3.7/dist-packages (from opendatasets) (7.1.2) Requirement already satisfied, skipping upgrade: tqdm in /usr/local/lib/python3.7/dist-packages (from opendatasets) (4.41.1) Requirement already satisfied, skipping upgrade: urllib3 in /usr/local/lib/python3.7/dist-packages (from kaggle->opendatasets) (1.24.3) Requirement already satisfied, skipping upgrade: requests in /usr/local/lib/python3.7/dist-packages (from kaggle->opendatasets) (2.23.0) Requirement already satisfied, skipping upgrade: python-dateutil in /usr/local/lib/python3.7/dist-packages (from kaggle->opendatasets) (2.8.1) Requirement already satisfied, skipping upgrade: six>=1.10 in /usr/local/lib/python3.7/dist-packages (from kaggle->opendatasets) (1.15.0) Requirement already satisfied, skipping upgrade: certifi in /usr/local/lib/python3.7/dist-packages (from kaggle->opendatasets) (2020.12.5) Requirement already satisfied, skipping upgrade: python-slugify in /usr/local/lib/python3.7/dist-packages (from kaggle->opendatasets) (4.0.1) Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle->opendatasets) (3.0.4) Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle->opendatasets) (2.10) Requirement already satisfied, skipping upgrade: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-packages (from python-slugify->kaggle->opendatasets) (1.3) Installing collected packages: opendatasets Successfully installed opendatasets-0.1.10
In [ ]:
import opendatasets as od
download_url = 'https://www.kaggle.com/sobhanmoosavi/us-accidents'
od.download(download_url)
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds Your Kaggle username: ahamedbasha786 Your Kaggle Key: ··········
4%|▍ | 13.0M/299M [00:00<00:02, 132MB/s]
Downloading us-accidents.zip to ./us-accidents
100%|██████████| 299M/299M [00:01<00:00, 170MB/s]
In [ ]:
data_filename= './us-accidents/US_Accidents_Dec20.csv'

Perform data preparation & cleaning using Pandas & Numpy

  1. Load Files using Pandas
  2. Look at some information about Data
  3. Fix missing and incorrect data
In [ ]:
import pandas as pd
df=pd.read_csv(data_filename)
In [ ]:
df
Out[]:
In [ ]:
df.columns
Out[]:
Index(['ID', 'Source', 'TMC', 'Severity', 'Start_Time', 'End_Time',
       'Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng', 'Distance(mi)',
       'Description', 'Number', 'Street', 'Side', 'City', 'County', 'State',
       'Zipcode', 'Country', 'Timezone', 'Airport_Code', 'Weather_Timestamp',
       'Temperature(F)', 'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)',
       'Visibility(mi)', 'Wind_Direction', 'Wind_Speed(mph)',
       'Precipitation(in)', 'Weather_Condition', 'Amenity', 'Bump', 'Crossing',
       'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
       'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop',
       'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
       'Astronomical_Twilight'],
      dtype='object')
In [ ]:
len(df)
Out[]:
4232541
In [ ]:
len(df.columns)
Out[]:
49
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4232541 entries, 0 to 4232540 Data columns (total 49 columns): # Column Dtype --- ------ ----- 0 ID object 1 Source object 2 TMC float64 3 Severity int64 4 Start_Time object 5 End_Time object 6 Start_Lat float64 7 Start_Lng float64 8 End_Lat float64 9 End_Lng float64 10 Distance(mi) float64 11 Description object 12 Number float64 13 Street object 14 Side object 15 City object 16 County object 17 State object 18 Zipcode object 19 Country object 20 Timezone object 21 Airport_Code object 22 Weather_Timestamp object 23 Temperature(F) float64 24 Wind_Chill(F) float64 25 Humidity(%) float64 26 Pressure(in) float64 27 Visibility(mi) float64 28 Wind_Direction object 29 Wind_Speed(mph) float64 30 Precipitation(in) float64 31 Weather_Condition object 32 Amenity bool 33 Bump bool 34 Crossing bool 35 Give_Way bool 36 Junction bool 37 No_Exit bool 38 Railway bool 39 Roundabout bool 40 Station bool 41 Stop bool 42 Traffic_Calming bool 43 Traffic_Signal bool 44 Turning_Loop bool 45 Sunrise_Sunset object 46 Civil_Twilight object 47 Nautical_Twilight object 48 Astronomical_Twilight object dtypes: bool(13), float64(14), int64(1), object(21) memory usage: 1.2+ GB
In [ ]:
df.describe()
Out[]:
In [ ]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

numeric_df = df.select_dtypes(include=numerics)
len(numeric_df.columns)  #How to find numeric columns in a datatype 
Out[]:
15
In [ ]:
df.isna()
Out[]:
In [ ]:
df.isna().sum()
Out[]:
ID                             0
Source                         0
TMC                      1516064
Severity                       0
Start_Time                     0
End_Time                       0
Start_Lat                      0
Start_Lng                      0
End_Lat                  2716477
End_Lng                  2716477
Distance(mi)                   0
Description                    2
Number                   2687949
Street                         0
Side                           0
City                         137
County                         0
State                          0
Zipcode                     1292
Country                        0
Timezone                    4615
Airport_Code                8973
Weather_Timestamp          62644
Temperature(F)             89900
Wind_Chill(F)            1896001
Humidity(%)                95467
Pressure(in)               76384
Visibility(mi)             98668
Wind_Direction             83611
Wind_Speed(mph)           479326
Precipitation(in)        2065589
Weather_Condition          98383
Amenity                        0
Bump                           0
Crossing                       0
Give_Way                       0
Junction                       0
No_Exit                        0
Railway                        0
Roundabout                     0
Station                        0
Stop                           0
Traffic_Calming                0
Traffic_Signal                 0
Turning_Loop                   0
Sunrise_Sunset               141
Civil_Twilight               141
Nautical_Twilight            141
Astronomical_Twilight        141
dtype: int64
In [ ]:
 
In [ ]:
 

Perform exploratory analysis & visualization using Matplotlib & Seaborn

Columns we will analyze:

  1. City
  2. Start Time
  3. Start Lat
  4. Start Lng
  5. Temperature
  6. Weather Condition

Since we are going to use only these, let's remove unnecessary columns.

In [ ]:
df.columns
Out[]:
Index(['ID', 'Source', 'TMC', 'Severity', 'Start_Time', 'End_Time',
       'Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng', 'Distance(mi)',
       'Description', 'Number', 'Street', 'Side', 'City', 'County', 'State',
       'Zipcode', 'Country', 'Timezone', 'Airport_Code', 'Weather_Timestamp',
       'Temperature(F)', 'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)',
       'Visibility(mi)', 'Wind_Direction', 'Wind_Speed(mph)',
       'Precipitation(in)', 'Weather_Condition', 'Amenity', 'Bump', 'Crossing',
       'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
       'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop',
       'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
       'Astronomical_Twilight'],
      dtype='object')
In [ ]:
df.City
Out[]:
0                Dayton
1          Reynoldsburg
2          Williamsburg
3                Dayton
4                Dayton
               ...     
4232536       Riverside
4232537       San Diego
4232538          Orange
4232539     Culver City
4232540        Highland
Name: City, Length: 4232541, dtype: object
In [ ]:
df.City.value_counts()
Out[]:
Houston        114905
Los Angeles     92701
Charlotte       88887
Dallas          77303
Austin          70538
                ...  
Suquamish           1
Pe Ell              1
Pennville           1
Morse               1
Mc Clelland         1
Name: City, Length: 12250, dtype: int64
In [ ]:
df.Weather_Condition
Out[]:
0             Light Rain
1             Light Rain
2               Overcast
3          Mostly Cloudy
4          Mostly Cloudy
               ...      
4232536             Fair
4232537             Fair
4232538    Partly Cloudy
4232539             Fair
4232540             Fair
Name: Weather_Condition, Length: 4232541, dtype: object
In [ ]:
weather=df.Weather_Condition
import seaborn as sns
sns.set_theme()
# Create a visualization
sns.relplot(
    data=weather,
    x="total_bill", y="tip", col="time",
    hue="smoker", style="smoker", size="size"
)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-79-1861b787ccdf> in <module>() 6 data=weather, 7 x="total_bill", y="tip", col="time", ----> 8 hue="smoker", style="smoker", size="size" 9 ) /usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py in inner_f(*args, **kwargs) 44 ) 45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)}) ---> 46 return f(**kwargs) 47 return inner_f 48 /usr/local/lib/python3.7/dist-packages/seaborn/relational.py in relplot(x, y, hue, size, style, data, row, col, col_wrap, row_order, col_order, palette, hue_order, hue_norm, sizes, size_order, size_norm, markers, dashes, style_order, legend, kind, height, aspect, facet_kws, units, **kwargs) 941 data=data, 942 variables=plotter.get_semantics(locals()), --> 943 legend=legend, 944 ) 945 p.map_hue(palette=palette, order=hue_order, norm=hue_norm) /usr/local/lib/python3.7/dist-packages/seaborn/relational.py in __init__(self, data, variables, x_bins, y_bins, estimator, ci, n_boot, alpha, x_jitter, y_jitter, legend) 585 ) 586 --> 587 super().__init__(data=data, variables=variables) 588 589 self.alpha = alpha /usr/local/lib/python3.7/dist-packages/seaborn/_core.py in __init__(self, data, variables) 602 def __init__(self, data=None, variables={}): 603 --> 604 self.assign_variables(data, variables) 605 606 for var, cls in self._semantic_mappings.items(): /usr/local/lib/python3.7/dist-packages/seaborn/_core.py in assign_variables(self, data, variables) 666 self.input_format = "long" 667 plot_data, variables = self._assign_variables_longform( --> 668 data, **variables, 669 ) 670 /usr/local/lib/python3.7/dist-packages/seaborn/_core.py in _assign_variables_longform(self, data, **kwargs) 900 901 err = f"Could not interpret value `{val}` for parameter `{key}`" --> 902 raise ValueError(err) 903 904 else: ValueError: Could not interpret value `total_bill` for parameter `x`
In [ ]:
import seaborn as sns
In [ ]:
sns.set_style("darkgrid")
In [ ]:
top_accident_cities=df.City.value_counts()
top_accident_cities[:5]
Out[]:
Houston        114905
Los Angeles     92701
Charlotte       88887
Dallas          77303
Austin          70538
Name: City, dtype: int64
In [ ]:
top5=top_accident_cities
top5[:5].plot(kind='barh')
Out[]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fdc460f0250>
Notebook Image
In [ ]:
sns.set_style("darkgrid")
sns.distplot(top5)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Out[]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fdc44c64350>
Notebook Image

Start Time

In [ ]:
df.Start_Time
Out[]:
0          2016-02-08 05:46:00
1          2016-02-08 06:07:59
2          2016-02-08 06:49:27
3          2016-02-08 07:23:34
4          2016-02-08 07:39:07
                  ...         
4232536    2019-08-23 18:03:25
4232537    2019-08-23 19:11:30
4232538    2019-08-23 19:00:21
4232539    2019-08-23 19:00:21
4232540    2019-08-23 18:52:06
Name: Start_Time, Length: 4232541, dtype: object
In [ ]:
df.Start_Time= pd.to_datetime(df.Start_Time)
In [ ]:
df.Start_Time[0]
Out[]:
Timestamp('2016-02-08 05:46:00')
In [ ]:
df.Start_Time.dt.hour
Out[]:
0           5
1           6
2           6
3           7
4           7
           ..
4232536    18
4232537    19
4232538    19
4232539    19
4232540    18
Name: Start_Time, Length: 4232541, dtype: int64
In [ ]:
sns.histplot(df.Start_Time.dt.hour,bins=24)
Out[]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fdc39a6c210>
Notebook Image
In [ ]:
sns.distplot(df.Start_Time.dt.dayofweek, bins=7, kde=False, norm_hist=True)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Out[]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fdc39a5b150>
Notebook Image
In [ ]:
df.columns
Out[]:
Index(['ID', 'Source', 'TMC', 'Severity', 'Start_Time', 'End_Time',
       'Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng', 'Distance(mi)',
       'Description', 'Number', 'Street', 'Side', 'City', 'County', 'State',
       'Zipcode', 'Country', 'Timezone', 'Airport_Code', 'Weather_Timestamp',
       'Temperature(F)', 'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)',
       'Visibility(mi)', 'Wind_Direction', 'Wind_Speed(mph)',
       'Precipitation(in)', 'Weather_Condition', 'Amenity', 'Bump', 'Crossing',
       'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
       'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop',
       'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
       'Astronomical_Twilight'],
      dtype='object')
In [ ]:
df.Temperature(F)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-72-50dfa8b80928> in <module>() ----> 1 df.Temperature(F) /usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __getattr__(self, name) 5139 if self._info_axis._can_hold_identifiers_and_holds_name(name): 5140 return self[name] -> 5141 return object.__getattribute__(self, name) 5142 5143 def __setattr__(self, name: str, value) -> None: AttributeError: 'DataFrame' object has no attribute 'Temperature'

Summarize your inferences & write a conclusion

In [ ]:
 
In [ ]:
 

Question and Answers

  1. How many Numeric Columns do we have?
  2. Accidents occur more in warmer or colder areas?
  3. Which top 5 states has more accident cases involved?
  4. Which top 5 city has the more accident cases involved?
  5. List item
In [ ]:
 
In [ ]:
#Answer for 3 Question
top_accident_cities=df.City.value_counts()
top_accident_cities[:5]
Out[]:
Houston        114905
Los Angeles     92701
Charlotte       88887
Dallas          77303
Austin          70538
Name: City, dtype: int64
In [ ]:
top5=top_accident_cities
top5[:5].plot(kind='barh')
Out[]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fdc6e8f41d0>
Notebook Image
In [80]:
import jovian
In [ ]:
jovian.commit()
[jovian] Detected Colab notebook... [jovian] Please enter your API key ( from https://jovian.ai/ ): API KEY: ··········
[jovian] Error: The current API key is invalid or expired.
[jovian] Please enter your API key ( from https://jovian.ai/ ): API KEY:
In [ ]: