pandas_utils
Utility functions for interacting with pandas.
Module
Functions
append_dataframe_to_csv
def append_dataframe_to_csv( csv_file: Union[str, os.PathLike], df: pd.DataFrame,) ‑> pathlib.Path:
Append or write a dataframe to a CSV file.
Handles appending a dataframe to an already existing CSV file that may contain differing columns.
Additionally, handles safe writing to file, where a new file will be created if the desired one is inaccessible for some reason.
Arguments
csv_file
: The CSV file path to append/write to.df
: The dataframe to append.
Returns The actual path the CSV was written to, which may differ from the requested one if that file was inaccessible.
calculate_age
def calculate_age( dob: pd.Timestamp | datetime | date, comparison_date: Optional[pd.Timestamp | datetime | date] = None,) ‑> int:
Given a date of birth, calculate age at a target date.
If no target date is supplied, use today.
Arguments
dob
: Date of birth (should be pandas Timestamp or python datetime/date).comparison_date
: The date to calculate age at. Defaults to today.
Returns The age at the target date.
calculate_ages
def calculate_ages( dobs: pd.Series[pd.Timestamp | datetime | date] | TimestampSeries, comparison_date: Optional[pd.Timestamp | datetime | date] = None,) ‑> pd.Series[int]:
Given a series of date of births, calculate ages at a target date.
If no target date is supplied, use today.
Arguments
dobs
: Series of date of births (should be pandas Timestamps or python datetimes/dates).comparison_date
: The date to calculate age at. Defaults to today.
Returns A series of the ages at the target date.
conditional_dataframe_yielder
def conditional_dataframe_yielder( dfs: Iterable[pd.DataFrame], condition: Callable[[pd.DataFrame], pd.DataFrame], reset_index: bool = True,) ‑> collections.abc.Generator:
Create a generator that conditionally yields rows from a set of dataframes.
This replicates the standard .loc
conditional indexing that can be used on
a whole dataframe in a manner that can be applied to an iterable of dataframes
such as is returned when chunking a CSV file.
Arguments
dfs
: An iterable of dataframes to conditionally yield rows from.condition
: A callable that takes in a dataframe, applied a condition, and returns the edited/filtered dataframe.reset_index
: Whether the index of the yielded dataframes should be reset. If True, a standard integer index is used that is consistent between the yielded dataframes (e.g. if yielded dataframe 10 ends with index 42, yielded dataframe 11 will start with index 43).
dataframe_iterable_join
def dataframe_iterable_join( joiners: Iterable[pd.DataFrame], joinee: pd.DataFrame, reset_joiners_index: bool = False,) ‑> collections.abc.Generator:
Performs a dataframe join against a collection of dataframes.
This replicates the standard .join()
method that can be used on a whole
dataframe in a manner that can be applied to an iterable of dataframes such
as is returned when chunking a CSV file.
This is equivalent to:
joiner.join(joinee)
Arguments
joiners
: The collection of dataframes that should be joined against the joinee.joinee
: The single dataframe that the others should be joined against.reset_joiners_index
: Whether the index of the joiners dataframes should be reset as they are processed. If True, a standard integer index is used that is consistent between the yielded dataframes (e.g. if yielded dataframe 10 ends with index 42, yielded dataframe 11 will start with index 43).
rewrite_csv_with_new_columns
def rewrite_csv_with_new_columns( csv_file: Union[str, os.PathLike], new_column_index: pd.Index,) ‑> pathlib.Path:
Rewrite an existing dataframe CSV with a new set of columns.
This is of use when new columns need to be added to the CSV file. The function will read, chunked, the original CSV, change the column index and write it out to a new file. At the end of writing the new file, it will replace the original CSV.
Additionally, handles safe writing to file, where a new file will be created if the desired one is inaccessible for some reason.
Arguments
csv_file
: The CSV file to rewrite.new_column_index
: A pandas Index representing the new set of columns to use.
Returns The actual path the CSV was written to, which may differ from the requested one if that file was inaccessible.