Lesson 1: Introduction to Data Science
What is Data Science?
Data science is an interdisciplinary field that uses scientific methods, algorithms, and tools from other fields such as statistics and computing to extract knowledge and insights from data for decision making. Data science is a systematic process of extracting meaning from data.
Data science involves data collection, data preparation, exploratory data analysis, model development and evaluation, extraction of valuable insights from analytics or/and model results for decision-making interpretation, communication and consumption of analytic or/and model results.
The Data Science Lifecycle
This diagram below illustrates the full lifecycle and the high-level workflow of a data science project.
It is worth noting that data science projects can fall into two main categories: analytic projects, Artificial Intelligence (AI) or Machine Learning (ML) projects. Analytic projects are projects that do not require you to train or use a pre-trained model. An AI/ML project requires you to train/use a machine learning model or use a pretrained model.
Though the lifecycle illustrated above is more specific to a machine learning project, any type of data science project can still go through a similar lifecycle. For example, an analytics project with the goal of creating a dashboard for business users or stakeholders can still undergo development, deployment, maintenance, and monitoring. An AI project that uses a pretrained large language model (LLM) to extract structured data from PDF documents can be developed, deployed to production, monitored, and maintained.
Also, most data scientist focus only on the development aspect of data science projects. However, the lifecycle of an end-to-end data science project spans across development to deployment and maintenance/monitoring. Implementing an end-to-end data science project is usually needs a collaboration from data professionals across various role.
Data Teams
The followings data professionals are usually involved in the data science lifecycle:
Data Engineers: They design and manage the infrastructure and data storage technologies, such as databases. They create efficient data pipelines for data collection and curate datasets that are readily accessible to data scientists and analysts. The data engineer ensures the data is ready and available for data ingestion.
Data Analyst: A data analyst transform data into data outputs such as summary tables, reports or dashboards that provide actionable insights for decision making.
Data Scientist A Data Scientist prepares and explores datasets of interest and uses the prepared data to build predictive models. A crucial aspect of a Data Scientist’s role is to conduct experiments during model evaluation to find the optimal model for the task at hand. Data Scientists can also develop analytical projects that do not require model building. For example, as a Data Scientist, I developed an automated Python application that extracts department dashboards from Tableau and emails the dashboard as an image file to the respective department managers.
Machine Learning Engineer: Machine Learning or MLOPs Engineers package and operationalize (productionize) machine learning projects or models. That is, a Machine Learning Engineer deploys machine learning models to production in an automated, reliable, scalable way. They also set up systems to monitor and maintain the model in production, ensuring that a new model is automatically retrained if model performance in production drops below an acceptable value. Sometimes, infrastructure engineers also deploy models to production.
Note that a “full-stack” data scientist is involved in the entire end-to-end lifecycle of the data science project, encompassing development, deployment, monitoring, and maintenance. As a data scientist or aspiring data scientist, it is crucial to develop the skills necessary for operationalizing data science or machine learning projects.
While research data scientists can primarily focus on experimenting with algorithms without the immediate need for operationalizing models, applied data scientists will likely find it less beneficial to develop models that do not ultimately reach production. Therefore, gaining production deployment skills is essential for applied data scientists, unless they work within a team that includes individuals specifically specialized in ML deployment.
The lifecycle in the diagram above is also called the MLOps lifecycle. MLOps stands for Machine Learning Operations, which refers to the automation of the development, deployment, monitoring, and maintenance of machine learning models. MLOps applies the principles of DevOps (Development and Operations) in software engineering to machine learning projects.
Machine Learning Tools
Machine learning frameworks such as MLflow and Kubeflow allow you to create data science or ML pipelines and manage the machine learning lifecycle. Cloud Platforms services such as Amazon SageMaker, Google Vertex AI and Azure Machine learning can also be used to develop, operationalize, monitor and maintain machine learning models. Additionally, popular frameworks for building machine learning models include TensorFlow, PyTorch, Scikit-learn, Keras, and XGBoost.
Summary
Data science is an interdisciplinary field that uses scientific methods, algorithms, and tools from areas like statistics and computing to extract insights from data for decision-making. A data science project undergoes various phases including data collection, data preparation, data exploration, model development, and evaluation, model deployment, monitoring and maintenance.
Key roles in data science teams include data engineers, data analysts, data scientists, and machine learning engineers. Data engineers handle infrastructure, analysts generate insights, scientists build models, and ML engineers deploy and maintain models. The MLOps lifecycle focuses on efficient development, deployment, monitoring and maintenance of machine learning models using tools like MLflow, Kubeflow, and cloud platforms such as Amazon SageMaker and Azure Machine Learning.