datasplitters
Classes for splitting data.
Classes
DatasetSplitter
class DatasetSplitter():
Parent class for different types of dataset splits.
Subclasses
Static methods
create
def create( splitter_name: str, **kwargs: Any,) ‑> DatasetSplitter:
Create a DataSplitter of the requested type.
splitter_name
def splitter_name() ‑> str:
Returns string name for splitter type.
Methods
create_dataset_splits
def create_dataset_splits(self, data: pd.DataFrame) ‑> tuple:
Returns indices for data sets.
get_filenames
def get_filenames( self, datasource: FileSystemIterableSource, split: DataSplit,) ‑> list[str]:
Returns a list of filenames for a given split.
Only used for file system sources.
Arguments
datasource
: AFileSystemIterableSource
object.split
: The relevant split to return filenames for.
Returns A list of filenames.
PercentageSplitter
class PercentageSplitter( validation_percentage: int = 10, test_percentage: int = 10, shuffle: bool = True, time_series_sort_by: Optional[Union[list[str], str]] = None,):
Splits data into sets based on percentages.
The default split is 80% of the data is used training, and 10% for each validation and testing, respectively.
Arguments
validation_percentage
: The percentage of data to be used for validation. Defaults to 10.test_percentage
: The percentage of data to be used for testing. Defaults to 10.time_series_sort_by
: A string/list of strings to be used for sorting time series. The strings should correspond to feature names from the dataset. This sorts the dataframe by the values of those features ensuring the validation and test sets come after the training set data to remove potential bias during training and evaluation. Defaults to None.shuffle
: A bool indicating whether we shuffle the data for the splits. Defaults to True.
Ancestors
Variables
- static
shuffle : bool
- static
test_percentage : int
- static
time_series_sort_by : Union[list[str], str, ForwardRef(None)]
- static
validation_percentage : int
Static methods
create
def create( splitter_name: str, **kwargs: Any,) ‑> DatasetSplitter:
Inherited from:
Create a DataSplitter of the requested type.
splitter_name
def splitter_name() ‑> str:
Class method for splitter name.
Returns The string name for splitter type.
Methods
create_dataset_splits
def create_dataset_splits(self, data: pd.DataFrame) ‑> tuple:
Create splits in dataset for training, validation and test sets.
Arguments
data
: The dataframe type object to be split.
Returns A tuple of arrays, each containing the indices from the data to be used for training, validation, and testing, respectively.
get_filenames
def get_filenames( self, datasource: FileSystemIterableSource, split: DataSplit,) ‑> list[str]:
Inherited from:
DatasetSplitter.get_filenames :
Returns a list of filenames for a given split.
Only used for file system sources.
Arguments
datasource
: AFileSystemIterableSource
object.split
: The relevant split to return filenames for.
Returns A list of filenames.
SplitterDefinedInData
class SplitterDefinedInData( column_name: str = 'BITFOUNT_SPLIT_CATEGORY', training_set_label: str = 'TRAIN', validation_set_label: str = 'VALIDATION', test_set_label: str = 'TEST', infer_data_split_labels: bool = False,):
Splits data into sets based on value in each row.
The splitting is done based on the values in a user specified column.
Arguments
column_name
: The column name for which contains the labels for splitting. Defaults to "BITFOUNT_SPLIT_CATEGORY".training_set_label
: The label for the data points to be included in the training set. Defaults to "TRAIN".validation_set_label
: The label for the data points to be included in the validation set. Defaults to "VALIDATION".test_set_label
: The label for the data points to be included in the test set. Defaults to "TEST".
Ancestors
Variables
- static
column_name : str
- static
infer_data_split_labels : bool
- static
test_set_label : str
- static
training_set_label : str
- static
validation_set_label : str
Static methods
create
def create( splitter_name: str, **kwargs: Any,) ‑> DatasetSplitter:
Inherited from:
Create a DataSplitter of the requested type.
splitter_name
def splitter_name() ‑> str:
Class method for splitter name.
Returns The string name for splitter type.
Methods
create_dataset_splits
def create_dataset_splits(self, data: pd.DataFrame) ‑> tuple:
Create splits in dataset for training, validation and test sets.
Arguments
data
: The dataframe type object to be split.
Returns A tuple of arrays, each containing the indices from the data to be used for training, validation, and testing, respectively.
get_filenames
def get_filenames( self, datasource: FileSystemIterableSource, split: DataSplit,) ‑> list[str]:
Inherited from:
DatasetSplitter.get_filenames :
Returns a list of filenames for a given split.
Only used for file system sources.
Arguments
datasource
: AFileSystemIterableSource
object.split
: The relevant split to return filenames for.
Returns A list of filenames.