The Data Science Workflow: From Raw Data to Actionable Insights

What Is the Data Science Workflow?

The data science workflow is the end-to-end process of turning raw data into actionable business insights. While every project has its own nuances, the underlying structure is remarkably consistent across industries. The most widely adopted framework for this process is CRISP-DM (Cross-Industry Standard Process for Data Mining), originally published in 1999 and still used by over 40% of data science teams globally according to recent KDnuggets surveys.

CRISP-DM defines six phases: Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, and Deployment. These phases are not strictly linear — data science is inherently iterative, and teams frequently loop back to earlier stages as they discover new patterns, encounter data quality issues, or refine their understanding of the problem. The strength of CRISP-DM is that it anchors every technical decision to a business objective, preventing the common trap of building sophisticated models that solve the wrong problem.

The UK data science market has grown rapidly, and understanding this structured workflow is essential for anyone entering the field or working alongside data teams. The numbers below illustrate the scale of the opportunity.

72,000+

Data Science Roles Advertised in the UK (2025)

52,000

Average UK Data Scientist Salary

40%

Teams Using CRISP-DM Globally

80%

Project Time Spent on Data Preparation

Stage 1: Business Understanding and Problem Definition

Every successful data science project begins not with data, but with a clearly defined business problem. This is the most critical stage of the entire workflow and the one most frequently rushed. A technically brilliant model that answers the wrong question is worthless. Business understanding means translating a vague organisational need — "we want to reduce customer churn" or "we need to forecast demand better" — into a precise, measurable data science problem.

During this phase, you work closely with stakeholders to establish the project scope, define success criteria, and identify constraints. What decision will the model output inform? Who will act on the results? What does "good enough" look like? What data is available, and what are the legal and ethical boundaries around its use? In the UK context, this means considering UK GDPR, the Data Protection Act 2018, and sector-specific regulations from the outset, not as an afterthought.

Stakeholder alignment is essential. Data scientists and business leaders often speak different languages. The data scientist thinks in terms of features, algorithms, and evaluation metrics. The business leader thinks in terms of revenue, risk, and operational efficiency. Bridging this gap requires translating business objectives into measurable technical targets — for example, converting "reduce churn" into "build a classification model that identifies customers with a greater than 70% probability of leaving within 90 days, achieving at least 0.85 precision on the positive class."

 Key Questions to Ask Before Starting Any Data Science Project What specific business decision will this project inform? If you cannot answer this clearly, the project is not ready to start.
How is this decision currently made? Understanding the baseline helps you measure improvement.
What does success look like, and how will we measure it? Define quantitative success metrics before writing any code.
What data is available, and what are the access constraints? Identify data sources, ownership, and any UK GDPR implications early.
Who are the stakeholders, and how will they consume the output? A dashboard, an API, a report, or an automated trigger — the delivery format shapes the entire project.
What is the timeline and budget? A two-week proof of concept requires a fundamentally different approach from a six-month production system.
 

Stage 2: Data Collection and Preparation

Data preparation is where data scientists spend the vast majority of their time. Surveys consistently report that 60% to 80% of a data science project is consumed by collecting, cleaning, and transforming data. This is not glamorous work, but it is the foundation upon which everything else depends. A model trained on dirty, incomplete, or biased data will produce dirty, incomplete, or biased results — no matter how sophisticated the algorithm.

Data collection involves identifying and accessing the raw data sources required for the project. These may include relational databases (SQL Server, PostgreSQL), APIs (REST or GraphQL endpoints), flat files (CSV, JSON, Excel), cloud data warehouses (BigQuery, Snowflake, Amazon Redshift), real-time streaming sources (Apache Kafka, AWS Kinesis), or web scraping. In the UK, web scraping must comply with the Computer Misuse Act 1990 and the website's terms of service, and any personal data collected is subject to UK GDPR.

Once data is collected, the preparation phase begins. This is a systematic process of profiling, cleaning, transforming, and validating the data until it is fit for analysis. The table below outlines the most common data quality issues and the standard approaches to resolving them.

Common Data Quality Issues and Fixes

Issue	Description	Standard Fix
Missing values	Fields with null, empty, or placeholder entries (e.g., "N/A", -999)	Impute with mean/median/mode, use predictive imputation, or drop rows if data loss is acceptable
Duplicate records	Same entity appearing multiple times due to merge errors or ETL bugs	Deduplicate using unique identifiers; apply fuzzy matching for near-duplicates
Inconsistent formatting	Dates in mixed formats (DD/MM/YYYY vs MM/DD/YYYY), inconsistent casing, varied units	Standardise to ISO 8601 for dates, apply consistent casing, convert all values to a single unit
Outliers	Values far outside the expected range (e.g., age of 999, negative salary)	Investigate root cause; apply IQR method, Z-score filtering, or domain-specific business rules
Incorrect data types	Numeric fields stored as strings, dates stored as integers, boolean fields with mixed types	Cast to correct types; validate ranges and constraints after conversion
Encoding issues	Garbled characters from mixed UTF-8 and Latin-1 encoding, especially in UK postcodes and names	Detect encoding with chardet; convert all sources to UTF-8 before merging
Class imbalance	Target variable heavily skewed (e.g., 98% non-fraud, 2% fraud)	Apply SMOTE, undersampling, class weighting, or collect additional minority-class samples

UK-Specific Data Considerations

Working with UK data introduces specific formatting requirements. UK postcodes follow a complex alphanumeric pattern (e.g., SW1A 1AA, EC2R 8AH) that requires dedicated validation regex. Date formats in the UK follow DD/MM/YYYY convention, which is the opposite of US formatting and a common source of parsing errors when working with international datasets. Currency values should be handled as pence integers to avoid floating-point precision issues, and VAT calculations must account for the current standard rate of 20%, the reduced rate of 5%, and zero-rated items.

When working with personal data, UK GDPR requires a lawful basis for processing, and you must implement data minimisation — collecting only what is necessary for the stated purpose. Pseudonymisation and anonymisation techniques should be applied as early as possible in the pipeline. The ICO (Information Commissioner's Office) provides detailed guidance on using personal data for analytics and machine learning.

Stage 3: Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the process of systematically investigating the structure, patterns, and anomalies in your data before building any models. Coined by statistician John Tukey in the 1970s, EDA is fundamentally about letting the data speak before imposing assumptions. It is the stage where you develop intuition about what the data contains, what relationships exist between variables, and what surprises might derail your modelling efforts later.

EDA serves multiple purposes: it validates the data cleaning work from the previous stage, reveals the distributions and relationships that will inform feature engineering, identifies potential data leakage, highlights multicollinearity, and generates hypotheses about predictive features. Skipping or rushing EDA is one of the most common mistakes in data science projects — it leads to models built on incorrect assumptions and features that do not behave as expected in production.

The core techniques of EDA fall into two categories: statistical summaries and visual exploration. Both are essential, and they complement each other. A correlation coefficient tells you the strength and direction of a linear relationship; a scatter plot shows you whether the relationship is actually linear or whether the correlation is being driven by a cluster of outliers.

Key EDA Techniques

Technique	What It Reveals	Tools
Descriptive statistics	Central tendency, spread, and shape of each variable (mean, median, standard deviation, skewness, kurtosis)	pandas .describe(), NumPy, R summary()
Correlation matrix	Linear relationships between numeric features; identifies multicollinearity and redundant features	pandas .corr(), seaborn heatmap, corrplot in R
Distribution plots	Shape of individual variables — normal, skewed, bimodal, or uniform distributions	Histograms, KDE plots, Q-Q plots (matplotlib, seaborn, ggplot2)
Box plots	Median, quartiles, and outliers for numeric variables; comparison across categories	seaborn boxplot, matplotlib, plotly
Pair plots	Pairwise relationships between all numeric features simultaneously; clusters and separability	seaborn pairplot, GGally::ggpairs in R
Value counts and bar charts	Frequency distribution of categorical variables; class balance of the target variable	pandas .value_counts(), matplotlib bar charts
Missing value heatmap	Patterns in missing data — whether missingness is random or systematic	missingno library, seaborn heatmap
Time series decomposition	Trend, seasonality, and residual components for temporal data	statsmodels seasonal_decompose, prophet

EDA Is Not Optional — It Is Where You Build Understanding

Many data scientists, particularly those early in their careers, are tempted to skip EDA and jump straight to modelling. This is a costly mistake. EDA is where you discover that your target variable is heavily imbalanced, that two features are 0.98 correlated (meaning one should be dropped), that a date column contains values from the year 1900 (indicating a default placeholder), or that a categorical feature has 5,000 unique values (requiring careful encoding strategy). Every hour invested in EDA saves multiple hours of debugging models that do not perform as expected.

Stage 4: Feature Engineering and Model Building

Feature engineering is the art and science of creating input variables that make machine learning algorithms work effectively. Raw data rarely comes in a form that algorithms can use directly. Dates need to be decomposed into day-of-week, month, and quarter. Text needs to be vectorised. Categorical variables need to be encoded. Numeric features may need to be scaled, binned, or transformed. The quality of your features has a far greater impact on model performance than the choice of algorithm — as the saying goes, "better data beats better algorithms."

Feature Engineering Techniques

The most common feature engineering techniques include one-hot encoding for nominal categorical variables (e.g., converting "region" into binary columns for each UK region), ordinal encoding for ordered categories (e.g., education level), label encoding for binary categories, feature scaling (standardisation or min-max normalisation) for algorithms sensitive to magnitude (SVM, KNN, neural networks), and log or Box-Cox transformations for heavily skewed numeric features.

Domain-specific feature creation often provides the greatest lift. For a UK e-commerce churn model, raw transaction data might yield features such as: days since last purchase, average order value over 90 days, number of returns as a percentage of orders, weekend vs weekday purchase ratio, and whether the customer has ever used a discount code. These engineered features capture behavioural patterns that raw columns cannot.

Choosing the Right Algorithm

Once features are prepared, the next step is selecting and training a model. The choice of algorithm depends on the problem type (regression, classification, clustering, time series), the size and dimensionality of the data, interpretability requirements, and computational constraints. The table below compares the most commonly used algorithms across data science teams.

Common Machine Learning Algorithms Compared

Algorithm	Problem Type	Strengths	Limitations
Linear Regression	Regression	Simple, fast, highly interpretable; good baseline	Assumes linear relationships; sensitive to outliers
Logistic Regression	Classification	Interpretable coefficients; probability outputs; fast training	Assumes linear decision boundary; struggles with complex patterns
Decision Trees	Both	Easy to interpret and visualise; handles non-linear data well	Prone to overfitting; unstable with small data changes
Random Forest	Both	Robust to overfitting; handles missing values; feature importance ranking	Slower to train; less interpretable than single trees
Gradient Boosting (XGBoost, LightGBM)	Both	State-of-the-art tabular performance; handles imbalanced data well	Requires careful hyperparameter tuning; risk of overfitting on small datasets
Support Vector Machines	Classification	Effective in high-dimensional spaces; robust with clear margins	Computationally expensive on large datasets; requires feature scaling
K-Nearest Neighbours	Both	Simple, no training phase; effective for small datasets	Slow at prediction time; sensitive to irrelevant features and scale
K-Means Clustering	Clustering	Fast, scalable; easy to interpret cluster assignments	Must specify K in advance; assumes spherical clusters
Neural Networks	Both	Captures highly complex non-linear patterns; state-of-the-art for unstructured data	Requires large datasets; computationally expensive; low interpretability

In practice, most UK data science teams follow a progressive complexity approach: start with a simple, interpretable baseline model (logistic regression or decision tree), establish benchmark performance, and only move to more complex models if the business value justifies the additional complexity and reduced interpretability. This is especially important in regulated UK sectors such as financial services, where the FCA expects firms to be able to explain model decisions, and in healthcare, where the MHRA and NHS Digital require auditability.

Stage 5: Model Evaluation and Deployment

Building a model that performs well on training data is the beginning, not the end. The true test of any data science model is how well it generalises to unseen data and how effectively it integrates into business operations. Model evaluation is the rigorous process of measuring performance using appropriate metrics, validating generalisability, and testing the model under conditions that simulate real-world deployment.

Evaluation Metrics

The choice of evaluation metric must align with the business objective, not with what looks best on paper. For classification problems, accuracy is often misleading — a model that predicts "no fraud" for every transaction achieves 98% accuracy on a dataset where only 2% of transactions are fraudulent, but it is completely useless. Instead, you should evaluate precision (of the cases the model flagged as positive, how many were correct?), recall (of all actual positives, how many did the model catch?), and the F1-score (the harmonic mean of precision and recall). For regression problems, the key metrics are Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared.

Cross-validation is the standard technique for estimating how a model will perform on unseen data. K-fold cross-validation splits the dataset into K subsets, trains the model K times (each time holding out a different subset for testing), and averages the results. Stratified K-fold preserves the class distribution in each fold, which is critical for imbalanced datasets. Time-series data requires a different approach — walk-forward validation — where the training set always precedes the test set chronologically.

Moving From Evaluation to Production

Deployment is where many data science projects stall. A model in a Jupyter notebook is not delivering business value — it needs to be packaged, deployed, monitored, and maintained. Common deployment patterns include REST APIs (using Flask, FastAPI, or Django REST Framework), batch prediction pipelines (Airflow, Prefect, or cloud-native schedulers), embedded models in existing applications, and real-time scoring via streaming platforms.

Model drift is a critical post-deployment concern. The real world changes — customer behaviour shifts, market conditions evolve, and the statistical relationships your model learned from historical data may degrade over time. Monitoring for data drift (changes in input feature distributions) and concept drift (changes in the relationship between features and the target) is essential. UK financial services firms are expected by the FCA to maintain ongoing model risk management, including regular back-testing and revalidation schedules.

 Model Deployment Checklist for UK Data Science Teams Model validation: Cross-validation results consistent with holdout test set performance; no evidence of data leakage.
Bias and fairness audit: Model tested for discriminatory outcomes across protected characteristics under the Equality Act 2010.
UK GDPR compliance: Data processing lawful basis documented; automated decision-making subject to Article 22 review; data subject rights accommodated.
Interpretability: SHAP values, LIME explanations, or feature importance rankings available for stakeholder review.
Monitoring pipeline: Automated alerts for data drift, concept drift, and performance degradation configured.
Rollback plan: Previous model version retained and deployable within minutes if the new model underperforms.
Documentation: Model card documenting training data, features, known limitations, evaluation results, and intended use cases.
A/B testing plan: New model tested against the current baseline on a subset of live traffic before full rollout.
 

The Iterative Nature of the Workflow

It is important to emphasise that the data science workflow is not a one-pass, linear process. Evaluation results frequently send you back to earlier stages. A model with poor recall might indicate that key features are missing from the data (back to data collection), that the target variable is poorly defined (back to business understanding), or that feature engineering did not capture the right patterns (back to feature engineering). CRISP-DM explicitly acknowledges this iterative nature — the arrows in the framework point both forwards and backwards between stages.

Mature data science teams build this iteration into their project timelines. They plan for multiple modelling cycles, budget time for data quality issues that will inevitably surface, and treat the first model as a learning exercise rather than the final deliverable. The most successful UK data science projects are those where the team spends more time understanding the problem and preparing the data than they spend on the modelling itself.

Start Your Data Science Career

The data science workflow is a structured, learnable process — and mastering it opens doors to one of the UK's fastest-growing and highest-paying career paths. Our accredited Data Science course takes you from foundational statistics through to model deployment, with hands-on projects using real-world datasets and industry-standard tools.

Explore Our Data Science Course

We value your privacy