Lesson 2: Data Preparation

Banner.

Data Preparation

Data preparation is a very vital aspect of a data science project because real-world data is typically messy and needs some cleaning and preparation. Data preparation involves transforming a dataset and making it ready for analysis. Data preparation is sometimes called preprocessing.

Banner.

Data Preparation Tasks

There are different tasks involved in data preparation:

Data Cleaning

Remove noise and handle inconsistencies: Eliminate inaccurate or irrelevant data points, correct errors and ensure data uniformity. I usually check unique values for each variable that would be needed for analysis downstream to ensure that the unique values are acceptable.

For example, sometimes, you may find the values of gender column spelled as Male and sometimes male. Such values need to be normalized. If you find a negative value under an income column, that is a value that needs to be handled as such values can throw off the results of your analysis.

Filtering: Selectively remove or keep certain data points based on specific criteria. It’s often used to clean a dataset by removing irrelevant, redundant, or noisy data, or by focusing on specific subsets of data that are more relevant to the analysis or model. Filtering could be based on a condition, time, threshold, or features.

Check and handle missing values: Impute missing data or handle them appropriately. Missing data can distort the distribution of features in a dataset if not handled correctly. If missing values are not correctly treated, the results of the data analysis can be misleading. Hence, handling missing values in a dataset is crucial for ensuring the quality and accuracy of your analysis or model.

Missing values can be handled in various removing dropping columns or rows or through imputation. Eliminate rows or columns with missing data can reduce the sample size of your data, and can cause your data to be skewed if the missingness mechanism is not MCAR. Imputation methods for handling messing data include:

  • Simple imputation: Replace missing values with estimates such as mean, median or mode. Use simple imputation if the missingness mechanism is MCAR and if at least 70-80% of the data for the feature in question is present. This ensures that the imputed mean, median or mode is representative of the distribution of that feature.
  • Forward/backward fill: Use previous (forward fill) or subsequent (backward fill) available values to fill missing entries. This imputation methods are commonly used in time series data preparation.
  • Linear interpolation: This involves the estimation of a missing values based on the linear relationship between the nearest known data points.
  • Model-based: Use models to predict and fill in missing values.
  • Indicator variables: Add an indicator value to represent missingness, especially for a categorical column.
  • Multiple imputation: Create multiple imputed datasets, analyze them separately, and combine the results to account for uncertainty in missing data.

Data Transformation

Scaling: Scaling is a technique used to adjust the range and distribution of numerical features in a dataset. These methods are crucial for machine learning models, especially those that rely on distance-based algorithms like k-NN. Scaling ensures that features are on a comparable scale to improve model performance and prevent bias toward certain features with larger numerical ranges.

Categorical encoding: Converting categorical data into numerical representations (e.g., one-hot encoding, label encoding) is necessary because categorical values are typically represented as text, and algorithms require numerical formats for processing.

Binning: Dividing continuous data into discrete intervals, ranges or “bins.” For example continuous age data can be converted into age groups such as children, teens, young adults.

Other transformations: Other transformations may include converting data from long to wide format or vice versa. Pivoting also transforms a dataset, as well as grouping and aggregating data through a group-by mechanism. Sometimes, columns, especially categorical ones, can be concatenated.

Feature Engineering

Feature selection: Reducing the number of features by selecting the most relevant features. The goal of feature selection is to improve the performance of machine learning models by removing irrelevant, redundant, or noisy features. Feature selection reduces the complexity of the model, hence can reduce overfitting, improve accuracy, and increase computational efficiency.

Feature extraction: Reducing the number of features by creating new features from existing ones without loosing important information from the original data. For example, principal component analysis (pca) is a type of feature extraction.

Feature generation:: Creating new features from existing ones to improve model performance. For example, a date column can be used to extract year, month, day, hour, etc. Lag and rolling aggregates can be derived from numerical columns especially when preparing time series data for traditional machine learning modeling.

Data splitting

Data splitting is the process of dividing a dataset into separate subsets for different purposes, primarily for training, validating, and testing machine learning models. When training a machine learning model, a two-way or three-way split can be used to split data:

A two-way split: This refers to splitting the data into training and test sets. The typical train-test split ratios are 80:20, 70:30, or 60:40. As a rule of thumb, ensure that the ratio you select provides enough data for training.

A three-way split: This refers to splitting the data into training, validation and test sets. The typical train-validation-test split ratios are 60:20:20, 70:15:15 or 80:10:10.

A training set is used to train the machine learning model, a validation set is used to optimize or tune the model hyperparameters, and the test set is used to evaluate the final model’s performance after training and validation.

Summary

To build an effective data science or machine learning model, it is crucial to follow a systematic approach that includes key steps such as data cleaning, transformation, feature engineering, and data splitting. Data cleaning ensures that the dataset is free from errors and inconsistencies, allowing for more reliable analysis. Transformation helps in reshaping the data to a format that is better suited for modeling.

Feature engineering plays a vital role in creating meaningful features that enhance model performance, while proper data splitting ensures that the model is trained and evaluated on distinct datasets, preventing overfitting and ensuring its generalizability. By carefully applying these processes, data scientists can create robust models that are capable of making good predictions and delivering valuable insights.