Skip to main content

csv_source

Module containing CSVSource class.

CSVSource class handles loading of CSV data.

Classes

CSVSource

class CSVSource(    path: Union[os.PathLike, AnyUrl, str],    read_csv_kwargs: Optional[dict[str, Any]] = None,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

Data source for loading csv files.

Arguments

  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • path: The path or URL to the csv file.
  • read_csv_kwargs: Additional arguments to be passed as a dictionary to pandas.read_csv. Defaults to None.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

  • data: A Dataframe-type object which contains the data.
  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.

Variables

  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    Raises: DataNotLoadedError: If the data has not been loaded yet.

  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns True if a task is running.
  • iterable : bool - This returns False if the DataSource does not subclass IterableSource.

    However, this property must be re-implemented in IterableSource, therefore it is not necessarily True if the DataSource inherits from IterableSource.

Methods


get_column

def get_column(    self: BaseSource, col_name: str, *args: Any, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

BaseSource.get_column :

Get a single column from dataset.

Used to iterate over image columns for the purposes of schema generation.

get_column_names

def get_column_names(self, **kwargs: Any)> collections.abc.Iterable:

Inherited from:

BaseSource.get_column_names :

Get the column names as an iterable.

get_data

def get_data(self, **kwargs: Any)> pandas.core.frame.DataFrame:

Loads and returns data from CSV dataset.

Returns A DataFrame-type object which contains the data.

Raises

  • DataSourceError: If the CSV file cannot be opened.

get_dtypes

def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any)> _Dtypes:

Inherited from:

BaseSource.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_project_db_sqlite_columns

def get_project_db_sqlite_columns(self)> list:

Inherited from:

BaseSource.get_project_db_sqlite_columns :

Implement this method to get the required columns.

This is used by the "run on new data only" feature. This is used to add data to the task table in the project database.

get_project_db_sqlite_create_table_query

def get_project_db_sqlite_create_table_query(self)> str:

Inherited from:

BaseSource.get_project_db_sqlite_create_table_query :

Implement this method to return the required columns and types.

This is used by the "run on new data only" feature. This should be in the format that can be used after a "CREATE TABLE" statement and is used to create the task table in the project database.

get_values

def get_values(self, col_names: list[str], **kwargs: Any)> dict:

Get distinct values from columns in CSV dataset.

Arguments

  • col_names: The list of the columns whose distinct values should be returned.
  • ****kwargs**: Additional keyword arguments.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

load_data

def load_data(self, **kwargs: Any)> None:

Inherited from:

BaseSource.load_data :

Load the data for the datasource.

Raises

  • TypeError: If data format is not supported.