From Notebook to AI-Augmented MLOps: Predicting Retail Customer Churn in 3 Phases π
By Yves Denis Deffo
Introduction π§
You’ve trained a model. It works great on your laptop. You ship it. Six months later, nobody’s maintained it, the predictions are garbage, and your data scientist has moved on. Sound familiar?

Every ML team at some point. Don’t be this dog. πΆπ₯
That’s exactly the problem this project tackles β head on, in three progressive phases. We’re building a customer churn prediction system for retail, starting from a messy Jupyter notebook and ending with an autonomous AI agent that monitors drift and retrains the model without you lifting a finger.
Here’s the full source: website-projects-articles / mlops / retail-churn-prediction
Let’s go. π
The Big Picture πΊοΈ
Three phases, each leveling up the maturity of the system:
| Phase | Where | What’s new |
|---|---|---|
| 1 β Local π₯οΈ | Laptop | MLflow, DVC, FastAPI, Docker |
| 2 β Cloud βοΈ | AWS | ECS, Terraform, CI/CD, drift alerts |
| 3 β AI Agent π€ | AWS + Claude | Autonomous drift triage and retraining |

Phase 1 β Getting Out of Notebook Hell π₯οΈ
The Dataset
We’re using the classic Telco Customer Churn dataset β 19 features describing retail customers (contract type, monthly charges, tenure, internet service, etc.) with a binary Churn label.
The first move is breaking the notebook monolith into proper modules:
phase-1/src/
βββ preprocess.py # feature engineering
βββ train.py # CLI training pipeline
βββ evaluate.py # metrics
βββ api/main.py # FastAPI prediction service
π οΈ The Training Pipeline
train.py is a CLI script β no notebook kernels dying mid-run, no hidden state, no “just re-run cells 3 through 7”:
# stratified split to preserve churn ratio
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# preprocessor fitted on TRAINING data only β no leakage π―
preprocessor = build_preprocessor()
preprocessor.fit(X_train)
model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
model.fit(preprocessor.transform(X_train), y_train)
The preprocessor handles:
StandardScaleron numerical features (tenure, MonthlyCharges, TotalCharges)OneHotEncoderon 15 categorical features (gender, Contract, PaymentMethod, etc.)
β The Quality Gate
This is the part most teams skip and then regret:
if roc_auc < 0.78:
print(f"β Quality gate failed: ROC-AUC {roc_auc:.4f} < 0.78")
sys.exit(1)
If your model doesn’t clear ROC-AUC β₯ 0.78, the pipeline exits with an error. CI fails. Nothing deploys. This one line saves you from shipping a garbage model at 3am.
π¦ MLflow + DVC
- MLflow tracks every run locally (
mlruns/): params, ROC-AUC, F1, precision, recall. Browse them withmake mlflowβ UI on port 5001. - DVC versions the data and model artifacts so you can reproduce any previous run.
π The FastAPI Service
@app.post("/predict")
async def predict(customer: CustomerFeatures):
prob = model.predict_proba(X)[0][1]
label = "Churned" if prob > 0.5 else "Retained"
risk = "high" if prob > 0.7 else "medium" if prob > 0.4 else "low"
return {"probability": prob, "label": label, "risk_category": risk}
Three-tier risk output β high/medium/low β because “73% churn probability” means more to a business analyst when you call it high risk.
Run it all with Docker Compose: make serve.
Phase 2 β Taking It to AWS βοΈ
Local works great. But you can’t share localhost:8000 with your stakeholders. Time to go cloud.

What if I told you… your laptop was never the deploy target. π
ποΈ The Infrastructure (Terraform)
The whole AWS setup is declarative Terraform β 11 modules, ~$25/month:

Key decisions worth calling out:
- ECS Fargate, not SageMaker β SageMaker is powerful but pricey. FastAPI on Fargate gives you the same HTTP endpoint for a fraction of the cost. πΈ
- RDS t4g.micro β ARM-based, tiny, cheap. PostgreSQL 16 storing every prediction as JSONB for later drift analysis.
- SSM Parameter Store for secrets β
DATABASE_URLnever touches environment variables directly. ECS fetches it at runtime. - GitHub OIDC β no long-lived credentials β the GitHub Actions role assumes an AWS role via OIDC. Zero static access keys to rotate or accidentally commit. π
π CI/CD in 3 Jobs
# .github/workflows/ci-cd.yml
jobs:
evaluate: # train + quality gate + push artifacts to S3
build: # docker build + push to ECR (tagged with commit SHA)
deploy: # update ECS task definition + wait for stability
The SHA tag is clutch β every deploy is traceable to the exact commit that produced it. latest is there for convenience but SHA is what actually matters.
π Drift Monitoring with Evidently
Once predictions start logging to PostgreSQL, drift_report.py compares the live distribution against the training baseline using Evidently:
# load training baseline vs recent predictions from DB
reference = load_reference()
current = load_current() # queries PostgreSQL
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference, current_data=current)
When drift is detected, an SNS message fires to Slack: “X% of features drifted, recommend retraining.”
Manual. Better than nothing. But a human still has to decide what to do. That changes in Phase 3. π
Phase 3 β The AI Agent Takes the Wheel π€
Drift alert fires. You get the Slack ping. You open the drift report. You squint at the p-values. You decide whether to retrain. You pick hyperparameters. You run the pipeline. You check the new metrics.
What if Claude just… did all of that for you?

Sorry manual ops, I’ve moved on. π
π§ What is an Agent?
A plain language model reads text and outputs text. An agent is a language model that can also take actions β call tools, get results, decide what to do next, repeat.
Claude doesn’t run the tools itself. It decides which tool to call with what arguments. Python runs the actual function and feeds the result back. Claude reads it, thinks, decides the next tool call. The loop continues until Claude says it’s done.
π§ The Four Tools
TOOLS = [
get_drift_summary, # parse drift_report.json β feature drift scores
get_current_model_metrics, # query MLflow β ROC-AUC, F1, hyperparams
run_training, # subprocess: train.py with chosen n_estimators, max_depth
save_remediation_report # write markdown audit log
]
π The Agentic Loop
messages = [initial_task]
while True:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
tools=TOOLS,
messages=messages
)
if response.stop_reason == "end_turn":
return response.content[0].text # done β
if response.stop_reason == "tool_use":
results = execute_tools(response)
messages.append(tool_results)
continue # next turn
π― The Decision Framework
Claude’s system prompt gives it explicit thresholds:
| Drift Share | Action |
|---|---|
| < 20% | Monitor only, no retraining |
| 20β40% | Evaluate β retrain only if metrics improve |
| > 40% | Retrain π¨ |
Hyperparameter selection logic is baked in too:
- Numerical features drifted β increase
n_estimators(200β300) - Categorical features drifted β increase
max_depth(10β12) - Both β
n_estimators=200, max_depth=10as starting point
π¬ A Real Agent Run (42% Drift)
Here’s what a typical execution looks like β 5 turns, zero human decisions:

The model that gets deployed is traceable to a specific drift event, specific hyperparameters, and a specific reasoning chain. Every run writes a markdown audit report automatically. π
π‘οΈ Safety
The agent isn’t running unsupervised in production. It runs when you trigger make agent β it handles the triage and execution, but the deploy-to-prod step still has a human gate. Up to 2 retries if the quality gate fails, never exceeds 10 turns total.
Results & Key Lessons π
Running all three phases end to end:
- β ROC-AUC maintained β₯ 0.78 across all phases
- β Zero manual operational steps by Phase 3
- β AWS bill under $30/month (Fargate + t4g.micro RDS + S3)
- β LLM-based drift diagnosis correct in 4/5 test scenarios
The biggest lessons:
1. Quality gates are non-negotiable. That sys.exit(1) is worth more than any amount of manual review.
2. Fit your preprocessor on training data only. Obvious in hindsight, catastrophic if missed.
3. Don’t over-engineer early. Phase 1 is deliberately simple. Phase 2 adds only what Phase 1 can’t do. Phase 3 adds only what Phase 2 can’t do.
4. AI agents aren’t magic β they’re a loop. The power comes from giving Claude good tools and a clear decision framework. The model doesn’t need to be creative; it needs to be consistent.

That feeling when your MLOps setup actually works and the model retrains itself at 3am. πͺ
Conclusion π―
We went from a Jupyter notebook to a production-grade, self-healing ML system in three deliberate phases β no shortcuts, no magic. Each phase is independently useful. You don’t need Phase 3 to benefit from Phase 1.
If you’re managing ML models in production without drift monitoring, you’re flying blind. If you have drift monitoring but it’s just Slack pings, you’re flying with your eyes closed. Phase 3 opens them. π
Grab the full source, start with Phase 1, and level up at your own pace: π github.com/yvesDenis/website-projects-articles/tree/master/mlops/retail-churn-prediction