TLDR;
This tutorial covers handling missing data in pandas using methods like fillna, interpolate, and dropna. It explains how to replace missing values with specific values, carry forward or backward values from other rows or columns, and perform linear interpolation. Additionally, it demonstrates how to drop rows with missing values based on different criteria and how to reindex a DataFrame to include missing dates.
- Replacing missing values using
fillnawith constant values or column-specific values. - Using forward fill (
ffill) and backward fill (bfill) to propagate values. - Interpolating missing values using linear and time-based methods.
- Dropping rows with missing values using
dropnabased on different conditions. - Reindexing a DataFrame to include missing dates.
Introduction [0:00]
The video introduces the problem of missing data in datasets, particularly when importing data from sources like the internet. It highlights a CSV file containing New York City weather data with missing values and missing dates. The tutorial aims to demonstrate how to handle these missing values in pandas using methods such as fillna, interpolate, and dropna.
Convert string column into the date type [2:30]
The video explains how to convert a string column representing dates into a datetime column using the parse_dates argument when reading the CSV file with pd.read_csv(). This conversion is crucial for performing time-based operations and analysis. The presenter shows how to verify the conversion by checking the data type of the column, ensuring it is a timestamp.
Use date as an index of dataframe usine set_index() method [3:15]
The video details how to set the 'Day' column as the index of the DataFrame using the set_index() method. Setting the date as the index is useful for time series analysis and makes it easier to perform operations based on dates. The inplace=True argument is used to modify the DataFrame directly.
Use fillna() method in dataframe [4:10]
The video explains how to use the fillna() method to replace missing values (NaN) in a DataFrame with a specified value. It demonstrates replacing all NaN values with zero. The presenter also shows how to replace missing values with different values for different columns by passing a dictionary to the fillna() method, where keys are column names and values are the replacement values.
Use fillna(method="ffill") method in dataframe [7:35]
The video explains how to use the fillna() method with the method="ffill" argument to perform forward fill. Forward fill propagates the last valid value forward to fill the missing values. This method is useful when you want to carry forward the previous day's value to fill the missing data.
Use fillna(method="bfill") method in dataframe [8:57]
The video explains how to use the fillna() method with the method="bfill" argument to perform backward fill. Backward fill propagates the next valid value backward to fill the missing values. This is the opposite of forward fill, where the next day's value is copied to the missing data.
"axis" parameter in fillna() method in dataframe [9:56]
The video explains the use of the axis parameter in the fillna() method. By setting axis=columns, the fill operation is performed horizontally, copying values from the previous column in the same row. This is useful when you want to fill missing values based on the values in other columns of the same row.
"limit" parameter in fillna() method in dataframe [11:18]
The video explains the limit parameter in the fillna() method, which controls how many consecutive NaN values are filled. By setting a limit, you can restrict the number of missing values that are filled during forward or backward fill. For example, limit=1 will only fill one consecutive NaN value.
interpolate() to do interpolation in dataframe [13:46]
The video introduces the interpolate() method for estimating missing values using interpolation techniques. It demonstrates linear interpolation, which calculates intermediate values based on the values around the missing data points. This method provides a more accurate estimate compared to simply filling with a constant value.
interpolate() method "time" [15:34]
The video explains how to use the method="time" argument in the interpolate() method. This method considers the time distance between the data points when performing interpolation, providing a more accurate estimate when the data is time-dependent. It ensures that the interpolated values are closer to the values of nearer dates.
dropna() method Drop all the rows which has "na" in dataframe [16:50]
The video introduces the dropna() method, which removes rows containing missing values. By default, it drops any row that has at least one NaN value. This method is useful when you want to remove rows with incomplete data from your DataFrame.
"how" parameter in dropna() method [17:50]
The video explains the how parameter in the dropna() method. Setting how="all" will only drop rows where all values are NaN. This is useful when you want to remove rows that are completely empty but preserve rows with at least some valid data.
"thresh" parameter in dropna() method [18:33]
The video explains the thresh parameter in the dropna() method, which sets a minimum number of non-NaN values required to keep a row. For example, thresh=1 will keep rows that have at least one non-NaN value. This allows you to control the dropping process based on the number of valid values in each row.