Data Ingestion
The first step towards developing your machine learning model is preparing data that will be used in training. The resources in the Data category provide you with standard processes for defining the schema of incoming datasets, automatically generating an exploratory data analysis, transforming data, and more.
Data Sources
Before storing data for analysis and training, Modela requires an exact definition of the file format and the columns present in a dataset. The Data Source resource provides you with an interface for automatically generating a schema from a dataset composed of one or more columns. It can also attach validation rules that will reject the storage of the dataset if it fails to pass validation unit tests. Create a Data Source by navigating to the Data Sources tab on the Data Product sidebar.
File Format
When specifying CSV or Excel file types, you will have the option to specify additional parameters determining the format of the file and the exact location of the data. Change these parameters in accordance with the format of the raw data.
Infer Schema
Each column will have a type which is guessed based off the contents of the file. You should examine these types to ensure that they align with your dataset. You must specify the target column, which is the column that contains the prediction label. Additionally, each column has toggleable settings for the following:
- Ignore: Indicate if the column will be ignored and not used in training
- Protected: Indicate if the column contains information protected in the real world (e.g. age, gender, race)
- PPI: Indicate if the column contains protected personal information
Validation Rules
A common production requirement is that ingested training data must reach a certain level of quality. If you want to add validation rules to your data, read the article for data validation.
Datasets
After creating your Data Source, your dataset is ready for storage and analysis within a Dataset resource. Creating a Dataset will begin a procedure of different Kubernetes Jobs that will fully ingest it into the system.
- Validate: Check all of the validation rules from the data source to ensure quality.
- Profile: Generate a statistical analysis and full exploratory data analysis of the dataset.
- Report: Generate a PDF report about the dataset.
Create a Dataset
Configure the standard resource metadata and select the Data Source relevant to your Dataset. Enabling Fast Mode will skip the validation, profiling, and reporting phases to make the Dataset completed and available for training faster.
Upload Data
After uploading, you can optionally set the sampling parameters, and submit the dataset for ingestion.
View Profile
Once you have submitted your dataset and its been profiled, you can view the dataset profile by editing the resource and navigating to the Profile tab. Dataset statistics, data types, column statistics, and visualizations will be automatically loaded.
Feedback
Was this page helpful?
Glad to hear it!
Sorry to hear that.