csv_source
Module containing CSVSource class.
CSVSource class handles loading of CSV data.
Classes
CSVSource
class CSVSource( path: Union[os.PathLike, AnyUrl, str], read_csv_kwargs: Optional[dict[str, Any]] = None, data_splitter: Optional[DatasetSplitter] = None, seed: Optional[int] = None, modifiers: Optional[dict[str, DataPathModifiers]] = None, ignore_cols: Optional[Union[str, Sequence[str]]] = None,):
Data source for loading csv files.
Arguments
data_splitter
: Approach used for splitting the data into training, test, validation. Defaults to None.ignore_cols
: Column/list of columns to be ignored from the data. Defaults to None.modifiers
: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.path
: The path or URL to the csv file.read_csv_kwargs
: Additional arguments to be passed as a dictionary topandas.read_csv
. Defaults to None.seed
: Random number seed. Used for setting random seed for all libraries. Defaults to None.
Attributes
data
: A Dataframe-type object which contains the data.data_splitter
: Approach used for splitting the data into training, test, validation.seed
: Random number seed. Used for setting random seed for all libraries.
Ancestors
Variables
-
data : pandas.core.frame.DataFrame
- A property containing the underlying dataframe if the data has been loaded.Raises: DataNotLoadedError: If the data has not been loaded yet.
-
hash : str
- The hash associated with this BaseSource.This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.
Returns: The hexdigest of the DataFrame hash.
is_initialised : bool
- Checks ifBaseSource
was initialised.
is_task_running : bool
- Returns True if a task is running.
-
iterable : bool
- This returns False if the DataSource does not subclassIterableSource
.However, this property must be re-implemented in
IterableSource
, therefore it is not necessarily True if the DataSource inherits fromIterableSource
.
Methods
get_column
def get_column( self: BaseSource, col_name: str, *args: Any, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:
Inherited from:
Get a single column from dataset.
Used to iterate over image columns for the purposes of schema generation.
get_column_names
def get_column_names(self, **kwargs: Any) ‑> collections.abc.Iterable:
Inherited from:
Get the column names as an iterable.
get_data
def get_data(self, **kwargs: Any) ‑> pandas.core.frame.DataFrame:
Loads and returns data from CSV dataset.
Returns A DataFrame-type object which contains the data.
Raises
DataSourceError
: If the CSV file cannot be opened.
get_dtypes
def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any) ‑> _Dtypes:
Inherited from:
Implement this method to get the columns and column types from dataset.
get_project_db_sqlite_columns
def get_project_db_sqlite_columns(self) ‑> list:
Inherited from:
BaseSource.get_project_db_sqlite_columns :
Implement this method to get the required columns.
This is used by the "run on new data only" feature. This is used to add data to the task table in the project database.
get_project_db_sqlite_create_table_query
def get_project_db_sqlite_create_table_query(self) ‑> str:
Inherited from:
BaseSource.get_project_db_sqlite_create_table_query :
Implement this method to return the required columns and types.
This is used by the "run on new data only" feature. This should be in the format that can be used after a "CREATE TABLE" statement and is used to create the task table in the project database.
get_values
def get_values(self, col_names: list[str], **kwargs: Any) ‑> dict:
Get distinct values from columns in CSV dataset.
Arguments
col_names
: The list of the columns whose distinct values should be returned.- **
**kwargs
**: Additional keyword arguments.
Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.
load_data
def load_data(self, **kwargs: Any) ‑> None:
Inherited from:
Load the data for the datasource.
Raises
TypeError
: If data format is not supported.