A comprehensive guide to building end-to-end ML pipelines using scikit-learn, featuring custom transformers and modular architecture.
In my Data Service Engineering course (COMP9321) at UNSW, I developed a complete machine learning pipeline that achieved a 95/100 score. This project demonstrates best practices in ML engineering, from data preprocessing to model deployment.
The project required building a robust ML pipeline that could:
I adopted a modular approach with clear separation of concerns:
Built reusable transformers following scikit-learn conventions:
from sklearn.base import BaseEstimator, TransformerMixin
class CustomFeatureTransformer(BaseEstimator, TransformerMixin):
def __init__(self, feature_columns=None):
self.feature_columns = feature_columns
def fit(self, X, y=None):
# Learn transformation parameters
return self
def transform(self, X):
# Apply transformation
return X_transformed
Implemented a comprehensive pipeline with multiple stages:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# Preprocessing pipeline
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
('custom', custom_transformer, custom_features)
])
# Complete ML pipeline
ml_pipeline = Pipeline([
('preprocessor', preprocessor),
('feature_selector', feature_selector),
('classifier', classifier)
])
Systematic comparison of multiple algorithms:
Implemented advanced feature engineering:
Numerical Features:
Categorical Features:
Text Features:
Developed a custom feature selection method that adapts to different data types:
class AdaptiveFeatureSelector(BaseEstimator, TransformerMixin):
def __init__(self, selection_method='auto', k=10):
self.selection_method = selection_method
self.k = k
def fit(self, X, y):
if self.selection_method == 'auto':
# Automatically choose best method based on data characteristics
self.method_ = self._select_best_method(X, y)
return self
def transform(self, X):
return self.method_.transform(X)
Implemented stratified cross-validation with custom scoring:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import make_scorer
def custom_scoring_function(y_true, y_pred):
# Custom metric combining multiple objectives
return combined_score
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
custom_scorer = make_scorer(custom_scoring_function)
Systematic hyperparameter tuning using grid search and random search:
from sklearn.model_selection import RandomizedSearchCV
param_distributions = {
'classifier__n_estimators': [100, 200, 500],
'classifier__max_depth': [3, 5, 7, None],
'classifier__min_samples_split': [2, 5, 10],
'preprocessor__num__scaler__method': ['standard', 'minmax', 'robust']
}
random_search = RandomizedSearchCV(
ml_pipeline,
param_distributions,
n_iter=100,
cv=cv_strategy,
scoring=custom_scorer,
n_jobs=-1
)
Built a Flask-based API for model serving:
from flask import Flask, request, jsonify
from flask_restx import Api, Resource
app = Flask(__name__)
api = Api(app)
@api.route('/predict')
class PredictionResource(Resource):
def post(self):
data = request.get_json()
prediction = model.predict(data)
return {'prediction': prediction.tolist()}
Implemented robust model serialization:
import joblib
import pickle
# Save complete pipeline
joblib.dump(ml_pipeline, 'model_pipeline.pkl')
# Load for inference
loaded_pipeline = joblib.load('model_pipeline.pkl')
Problem: Inconsistent data formats and missing values Solution: Robust preprocessing pipeline with automatic data type detection
Problem: Manual feature engineering was time-consuming Solution: Automated feature generation with validation
Problem: Complex models were difficult to explain Solution: SHAP values and feature importance analysis
Problem: Pipeline needed to handle varying data sizes Solution: Memory-efficient processing and batch prediction
This machine learning pipeline project demonstrated the importance of systematic approach to ML engineering. Key takeaways include:
Technical Excellence:
Problem-Solving Skills:
Best Practices:
The 95/100 score reflected not just the model performance, but the overall engineering quality and best practices implementation. This project serves as a foundation for more advanced ML engineering challenges.
This project showcases the intersection of data science and software engineering, demonstrating how proper ML engineering practices lead to robust, scalable solutions.