5 Scikit-learn Pipeline Tricks to Supercharge Your Workflow - MachineLearningMastery.com

https://machinelearningmastery.com/5-scikit-learn-pipeline-tricks-to-supercharge-your-workflow/ · scraped

![](https://machinelearningmastery.com/wp-content/uploads/2025/08/mlm-ipc-5-scikit-learn-pipeline-tricks-supercharge.png) 5 Scikit-learn Pipeline Tricks to Supercharge Your Workflow Image by Editor | ChatGPT ## Introduction Perhaps one of the most underrated yet powerful features that scikit-learn has to offer, pipelines are a great ally for building effective and modular machine learning workflows. They streamline the entire process — from data preparation and feature engineering to modeling, fine-tuning, and validation — while mitigating the risk of data leakage, making the code reproducible, and keeping it cleaner and easier to maintain. In this article, we describe and exemplify — through concise but intermediate- to advanced-level use cases — five pipeline tricks to level up your ongoing machine learning projects. ## Initial Setup The following code and its elements will be used in several of the examples listed later on; therefore, applying these preparatory steps first is advisable. Note that we will predominantly use the popular Titanic Survivorship dataset hereinafter: | 123456789101112131415161718192021 | import pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_split, GridSearchCVfrom sklearn.preprocessing import OneHotEncoder, StandardScalerfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LogisticRegressionfrom sklearn.impute import SimpleImputerfrom sklearn.datasets import fetch_openml# Loading Dataset (Titanic Survivorship)titanic = fetch_openml("titanic", version=1, as_frame=True)X = titanic.data[["pclass", "sex", "age", "fare"]]y = titanic.target == "1"# Split the dataset into training and test subsetsX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)# Select features by typenum_features = ["age", "fare"]cat_features = ["pclass", "sex"] | Time to get started with the core of this hands-on article! ## 1. ColumnTransformer to Handle Mixed Data Types In the first example, we will instantiate a ColumnTransformer object to define a robust data preprocessing pipeline. This class allows different transformations to be flexibly applied to different feature subsets in a unified manner, facilitating the processing of mixed data types and missing values without burdensome operations or repetitive code. | 12345678910111213141516171819 | # Instantiate a ColumnTransformer for robust data preprocessingpreprocessor = ColumnTransformer([("num", Pipeline([("imputer", SimpleImputer(strategy="median")),("scaler", StandardScaler())]), num_features),("cat", Pipeline([("imputer", SimpleImputer(strategy="most_frequent")),("onehot", OneHotEncoder(handle_unknown="ignore"))]), cat_features)])pipe = Pipeline([("preprocessor", preprocessor),("model", LogisticRegression(max_iter=1000))])pipe.fit(X_train, y_train)print("Accuracy:", pipe.score(X_test, y_test)) | This approach jointly processes numerical features, categorical ones, and missing values, integrating them into an overall pipeline before training the logistic regression classifier. ## 2. Feature Engineering with a Custom Transformer Custom transformers in scikit-learn allow us to define our own feature-level transformation steps (be it feature engineering or preprocessing) and inject them directly into a pipeline. One example is TransformerMixin, which requires defining our own fit and transform methods but, in exchange, allows us to use fit_transform() seamlessly. The following code defines a custom transformer class through inheritance. It applies its logic to map the “age” numerical feature into binary (0 or 1) values indicating whether the passenger is an adult. We then incorporate this custom transformer logic into a ColumnTransformer like the one defined in the previous use case. | 1234567891011121314151617181920212223242526272829 | from sklearn.base import BaseEstimator, TransformerMixin# Custom transformer to create binary feature "is_adult" upon ageclass IsAdult(BaseEstimator, TransformerMixin):def fit(self, X, y=None):return selfdef transform(self, X):return (X["age"].fillna(0) >= 18).astype(int).to_frame("is_adult")# Extended ColumnTransformer that incorporates the custom transformerextended = ColumnTransformer([("num", Pipeline([("imputer", SimpleImputer(strategy="median")),("scaler", StandardScaler())]), num_features),("cat", Pipeline([("imputer", SimpleImputer(strategy="most_frequent")),("onehot", OneHotEncoder(handle_unknown="ignore"))]), cat_features),("is_adult", IsAdult(), ["age"])])pipe = Pipeline([("preprocessor", extended),("model", LogisticRegression(max_iter=1000))])pipe.fit(X_train, y_train)print("Accuracy with custom feature:", pipe.score(X_test, y_test)) | ## 3. Hyperparameter Tuning Across the Entire Pipeline This example demonstrates that hyperparameter tuning—finding the best configuration among many options—is not exclusively related to the machine learning model’s settings. It can also apply to choices made in previous preprocessing steps, as shown in this interesting example: | 1234567891011121314151617 | from sklearn.svm import SVCpipe = Pipeline([("preprocessor", preprocessor),("model", SVC())])param_grid = {"preprocessor__num__imputer__strategy": ["mean", "median"],"model__C": [0.1, 1, 10],"model__kernel": ["linear", "rbf"]}search = GridSearchCV(pipe, param_grid, cv=3)search.fit(X_train, y_train)print("Best params:", search.best_params_)print("Best score:", search.best_score_) | Note that preprocessor is the preprocessing pipeline defined in example 1, i.e. a ColumnTransformer. The key in this example is the addition of a hyperparameter to the search grid that is related to a preprocessing step, namely the missing value imputation strategy. In other words, multiple model versions are trained not only based on hyperparameters of the model itself but also regarding specific settings in the preprocessing steps. ## 4. Integrating Feature Selection into a Pipeline Another powerful technique, especially for datasets with many features, is to dynamically perform feature selection within the pipeline to keep the final model simpler. This example automatically selects the most informative preprocessed features before training the model by incorporating the call to the SelectKBest class to select the highest-scoring features into the overarching pipeline (once again, we use the same preprocessor instance defined in earlier examples): | 12345678910 | from sklearn.feature_selection import SelectKBest, f_classifpipe = Pipeline([("preprocessor", preprocessor),("feature_selection", SelectKBest(score_func=f_classif, k=5)),("model", LogisticRegression(max_iter=1000))])pipe.fit(X_train, y_train)print("Test accuracy:", pipe.score(X_test, y_test)) | The SelectKBest class requires a scoring function or criterion to determine the top-k features to retain for model training. Another possible argument for it could be score_func=chi2, which applies a Chi-squared test to select features and is useful when categorical features dominate. ### 5. Stacked Pipelines Our last example shows how to stack multiple pipelines for building an ensemble machine learning solution. Pipelines are a great way to design our own highly customizable ensembles, in cases where different models, sometimes with distinct preprocessing steps, need to be trained and combined without risking data management inconsistencies. In the example below, two “overarching” pipelines are defined: one to preprocess the data and train a logistic regression classifier, and the other to apply the same preprocessing (for simplicity, it could have been different) but train a decision tree instead. | 1234567891011121314151617181920 | from sklearn.ensemble import StackingClassifierfrom sklearn.tree import DecisionTreeClassifierlog_reg_pipe = Pipeline([("preprocessor", preprocessor),("logreg", LogisticRegression(max_iter=1000))])tree_pipe = Pipeline([("preprocessor", preprocessor),("tree", DecisionTreeClassifier(max_depth=5))])stack = StackingClassifier(estimators=[("lr", log_reg_pipe), ("dt", tree_pipe)],final_estimator=LogisticRegression())stack.fit(X_train, y_train)print("Stacked accuracy:", stack.score(X_test, y_test)) | The two pipelines are then stacked using the StackingClassifier class. This class uses a final estimator to learn the best way to combine the base models’ predictions, yielding a stronger, more generalizable model. ## Wrapping Up This article showed five insightful examples of what we can do with scikit-learn pipelines to turbocharge and make our machine learning workflows more effective, customizable, and, in some cases, better-performing. From custom preprocessing pipelines for mixed data types to extending hyperparameter tuning to preprocessing steps, we revealed several tricks and hacks to take your machine learning modeling projects to the next level. ### More On This Topic - 10 Python One-Liners That Will Boost Your Data Science Workflow ![](https://machinelearningmastery.com/wp-content/uploads/2024/10/mlm-10-python-one-liners-data-science-workflow-200x200.png) - 10 Python One-Liners That Will Boost Your Data… ![](https://machinelearningmastery.com/wp-content/uploads/2025/03/mlm-10-python-one-liners-data-prep-200x200.png) - Modeling Pipeline Optimization With scikit-learn ![](https://machinelearningmastery.com/wp-content/uploads/2021/04/GridSearchCV-Computes-a-Score-For-Each-Corner-of-the-Grid.png) - Further Stable Diffusion Pipeline with Diffusers ![](https://machinelearningmastery.com/wp-content/uploads/2024/06/felicia-buitenwerf-8xFgmFnOnAg-unsplash-200x200.jpg) - Building a Robust Machine Learning Pipeline: Best Practices and Common Pitfalls ![](https://machinelearningmastery.com/wp-content/uploads/2024/11/mlm-machine-learning-pipelines-best-practices-common-pitfalls-200x200.png) - Building a Custom Model Pipeline in PyCaret: From… ![](https://machinelearningmastery.com/wp-content/uploads/2025/01/mlm-pycaret-20250114-200x200.png)

▼

Scraped Content

— 1076 words · 2026-05-19 12:29:07 UTC ·

Excerpt

Visibility

Visible to everyone

Reading Status

Related Bookmarks

My Note

Saved!

Annotations

Agent findings

info Long content (1076 words) has no proposition chunks health · Jun 29

suggestion Potential connection: 5 Scikit-learn Pipeline Tricks to Supercharge Your Workflow - MachineLearning... connection · Jun 24

Export as Markdown