Data Ingestion

Upload datasets to Modela for automatic processing

The first step towards developing your machine learning model is preparing data that will be used in training. The resources in the Data category provide you with standard processes for defining the schema of incoming datasets, automatically generating an exploratory data analysis, transforming data, and more.

Data Sources

Before storing data for analysis and training, Modela requires an exact definition of the file format and the columns present in a dataset. The Data Source resource provides you with an interface for automatically generating a schema from a dataset composed of one or more columns. It can also attach validation rules that will reject the storage of the dataset if it fails to pass validation unit tests. Create a Data Source by navigating to the Data Sources tab on the Data Product sidebar.

File Format

Define the file format which can be a flat file (CSV, Excel) or a database table (in which case, the file format is ignored).

When specifying CSV or Excel file types, you will have the option to specify additional parameters determining the format of the file and the exact location of the data. Change these parameters in accordance with the format of the raw data.

Infer Schema

Generate a schema by uploading your data and selecting Infer. This will produce a table with the columns of your dataset.

Each column will have a type which is guessed based off the contents of the file. You should examine these types to ensure that they align with your dataset. You must specify the target column, which is the column that contains the prediction label. Additionally, each column has toggleable settings for the following:

  • Ignore: Indicate if the column will be ignored and not used in training
  • Protected: Indicate if the column contains information protected in the real world (e.g. age, gender, race)
  • PPI: Indicate if the column contains protected personal information

Validation Rules

A common production requirement is that ingested training data must reach a certain level of quality. If you want to add validation rules to your data, read the article for data validation.

Datasets

After creating your Data Source, your dataset is ready for storage and analysis within a Dataset resource. Creating a Dataset will begin a procedure of different Kubernetes Jobs that will fully ingest it into the system.

  • Validate: Check all of the validation rules from the data source to ensure quality.
  • Profile: Generate a statistical analysis and full exploratory data analysis of the dataset.
  • Report: Generate a PDF report about the dataset.

Create a Dataset

Navigate to Datasets in the Data Product sidebar, and select Add Dataset.

Configure the standard resource metadata and select the Data Source relevant to your Dataset. Enabling Fast Mode will skip the validation, profiling, and reporting phases to make the Dataset completed and available for training faster.

Upload Data

Select the remote location of your dataset with SQL or upload a file.

DatasetUpload

After uploading, you can optionally set the sampling parameters, and submit the dataset for ingestion.

View Profile

Once you have submitted your dataset and its been profiled, you can view the dataset profile by editing the resource and navigating to the Profile tab. Dataset statistics, data types, column statistics, and visualizations will be automatically loaded.