Skip to main content

pandas_utils

Utility functions for interacting with pandas.

Module

Functions

append_dataframe_to_csv

def append_dataframe_to_csv(    csv_file: Union[str, os.PathLike], df: pd.DataFrame,)> pathlib.Path:

Append or write a dataframe to a CSV file.

Handles appending a dataframe to an already existing CSV file that may contain differing columns.

Additionally, handles safe writing to file, where a new file will be created if the desired one is inaccessible for some reason.

Arguments

  • csv_file: The CSV file path to append/write to.
  • df: The dataframe to append.

Returns The actual path the CSV was written to, which may differ from the requested one if that file was inaccessible.

calculate_age

def calculate_age(    dob: pd.Timestamp | datetime | date,    comparison_date: Optional[pd.Timestamp | datetime | date] = None,)> int:

Given a date of birth, calculate age at a target date.

If no target date is supplied, use today.

Arguments

  • dob: Date of birth (should be pandas Timestamp or python datetime/date).
  • comparison_date: The date to calculate age at. Defaults to today.

Returns The age at the target date.

calculate_ages

def calculate_ages(    dobs: pd.Series[pd.Timestamp | datetime | date] | TimestampSeries,    comparison_date: Optional[pd.Timestamp | datetime | date] = None,)> pd.Series[int]:

Given a series of date of births, calculate ages at a target date.

If no target date is supplied, use today.

Arguments

  • dobs: Series of date of births (should be pandas Timestamps or python datetimes/dates).
  • comparison_date: The date to calculate age at. Defaults to today.

Returns A series of the ages at the target date.

conditional_dataframe_yielder

def conditional_dataframe_yielder(    dfs: Iterable[pd.DataFrame],    condition: Callable[[pd.DataFrame], pd.DataFrame],    reset_index: bool = True,)> collections.abc.Generator[pandas.core.frame.DataFrame, None, None]:

Create a generator that conditionally yields rows from a set of dataframes.

This replicates the standard .loc conditional indexing that can be used on a whole dataframe in a manner that can be applied to an iterable of dataframes such as is returned when chunking a CSV file.

Arguments

  • dfs: An iterable of dataframes to conditionally yield rows from.
  • condition: A callable that takes in a dataframe, applied a condition, and returns the edited/filtered dataframe.
  • reset_index: Whether the index of the yielded dataframes should be reset. If True, a standard integer index is used that is consistent between the yielded dataframes (e.g. if yielded dataframe 10 ends with index 42, yielded dataframe 11 will start with index 43).

dataframe_iterable_join

def dataframe_iterable_join(    joiners: Iterable[pd.DataFrame],    joinee: pd.DataFrame,    reset_joiners_index: bool = False,)> collections.abc.Generator[pandas.core.frame.DataFrame, None, None]:

Performs a dataframe join against a collection of dataframes.

This replicates the standard .join() method that can be used on a whole dataframe in a manner that can be applied to an iterable of dataframes such as is returned when chunking a CSV file.

This is equivalent to:

joiner.join(joinee)

Arguments

  • joiners: The collection of dataframes that should be joined against the joinee.
  • joinee: The single dataframe that the others should be joined against.
  • reset_joiners_index: Whether the index of the joiners dataframes should be reset as they are processed. If True, a standard integer index is used that is consistent between the yielded dataframes (e.g. if yielded dataframe 10 ends with index 42, yielded dataframe 11 will start with index 43).

rewrite_csv_with_new_columns

def rewrite_csv_with_new_columns(    csv_file: Union[str, os.PathLike], new_column_index: pd.Index,)> pathlib.Path:

Rewrite an existing dataframe CSV with a new set of columns.

This is of use when new columns need to be added to the CSV file. The function will read, chunked, the original CSV, change the column index and write it out to a new file. At the end of writing the new file, it will replace the original CSV.

Additionally, handles safe writing to file, where a new file will be created if the desired one is inaccessible for some reason.

Arguments

  • csv_file: The CSV file to rewrite.
  • new_column_index: A pandas Index representing the new set of columns to use.

Returns The actual path the CSV was written to, which may differ from the requested one if that file was inaccessible.