Lesson 1: Introduction to Data Science

What is Data Science?

Data science is an interdisciplinary field that uses scientific methods, algorithms, and tools from other fields such as statistics and computing to extract knowledge and insights from data for decision making. Data science is a systematic process of extracting meaning from data.

Data science involves data collection, data preparation, exploratory data analysis, model development, model evaluation, extraction of valuable insights from project results for decision-making, interpretation, communication, and consumption of results.

The Data Science Lifecycle

This diagram below illustrates the full lifecycle and the high-level workflow of a data science project.

It is worth noting that data science projects can fall into two main categories: analytics projects, Artificial Intelligence (AI) or Machine Learning (ML) projects. Analytic projects are projects that do not require you to train or use a pre-trained model. An AI/ML project requires you to train/use a machine learning model or use a pretrained model.

Though the lifecycle illustrated above is more specific to a machine learning project, any type of data science project can still go through a similar lifecycle. For example, an analytics project with the goal of creating a dashboard for business users or stakeholders can still undergo development, deployment, maintenance, and monitoring. An AI project that uses a pretrained large language model (LLM) to extract structured data from PDF documents can be developed, deployed to production, monitored, and maintained.

Also, most data scientist focus only on the development aspect of data science projects. However, the lifecycle of an end-to-end data science project spans across development to deployment and maintenance/monitoring. Implementing an end-to-end data science project usually needs a collaboration from data professionals across various role.

Data Teams

The followings data professionals are usually involved in the data science lifecycle:

Data Engineer: A Data Engineer designs and manages the infrastructure and data storage technologies, such as databases. Data engineers create efficient data pipelines for data collection and curate datasets such that the data is readily accessible to data scientists and analysts. Data engineers are responsible for creating ETL pipelines that extract data from diverse sources, transform it to make it usable, and load it into target systems like data warehouses. The data engineer ensures the data is ready and available for data ingestion.
Data Analyst: A data analyst transform data into data outputs such as summary tables, reports or dashboards that provide actionable insights for decision making.
Data Scientist A Data Scientist prepares and explores datasets of interest and uses the prepared data to build predictive models. A crucial aspect of a Data Scientist’s role is to conduct experiments during model evaluation to find the optimal model for the task at hand. Data Scientists can also develop analytical projects that do not require model building. For example, as a Data Scientist, I developed an automated Python application that extracts department dashboards from Tableau and emails the dashboards as image files to the respective department managers.
Machine Learning Engineer: Machine Learning or MLOPs Engineers package and operationalize (productionize) machine learning projects or models. That is, Machine Learning Engineers deploy machine learning models to production in an automated, reliable, and scalable way. They set up systems to monitor and maintain the model in production, ensuring that a new model is automatically retrained if the model performance in production drops below an acceptable value. The pipeline is also automated such that, if the retrained model outperforms the model in product, the retrained model is automatically deployed. Sometimes, Infrastructure or IT Engineers assist in deploying models to production.

Data Professionals in Various Phases of the Data Science Lifecycle

Hence, different data professionals are involved with different phases of a data science project as follows:

Pre-development phase: Data Engineer
Development phase: Data Analyst and Data Scientist
Deployment phase: ML Engineers
Monitoring and maintenance phase: ML engineers and IT managers

Note that a “full-stack” data scientist is involved in the entire end-to-end lifecycle of the data science project, encompassing development, deployment, monitoring, and maintenance. As a data scientist or aspiring data scientist, it is crucial to develop the skills necessary for operationalizing data science or machine learning projects.

While research data scientists can primarily focus on experimenting with algorithms without the immediate need for operationalizing models, applied data scientists will likely find it less beneficial to develop models that do not ultimately reach production. Therefore, gaining production deployment skills is essential for applied data scientists, unless they work within a team that includes individuals specialized in ML deployment.

The lifecycle in the diagram above is also called the MLOps lifecycle. MLOps stands for Machine Learning Operations, which refers to the automation of the development, deployment, monitoring, and maintenance of machine learning models. MLOps applies the principles of DevOps (Development and Operations) in software engineering to machine learning projects.

Machine Learning Tools

Machine learning frameworks such as MLflow and Kubeflow allow you to create data science or ML pipelines and manage the machine learning lifecycle. Cloud Platforms services such as Amazon SageMaker, Google Vertex AI and Azure Machine learning can also be used to develop, operationalize, monitor, and maintain machine learning models. Additionally, popular frameworks for building machine learning models include TensorFlow, PyTorch, Scikit-learn, Keras, and XGBoost.

Summary

Data science is an interdisciplinary field that uses scientific methods, algorithms, and tools from areas like statistics and computing to extract insights from data for decision-making. A data science project undergoes various phases including data collection, data preparation, data exploration, model development, and evaluation, model deployment, monitoring and maintenance.

Key roles in data science teams include data engineers, data analysts, data scientists, and machine learning engineers. Data engineers handle infrastructure, analysts generate insights, scientists build models, and ML engineers deploy and maintain models. The MLOps lifecycle focuses on efficient development, deployment, monitoring and maintenance of machine learning models using tools like MLflow, Kubeflow, and cloud platforms such as Amazon SageMaker and Azure Machine Learning.