Thursday, 26 June 2025
MLFlow
๐ฆ What Is MLflow?
MLflow is an open-source platform for managing the complete machine learning lifecycle, including:
• Experiment tracking
• Model versioning & registry
• Reproducible runs
• Model deployment
Originally developed by Databricks, MLflow works with any ML library (e.g., PyTorch, TensorFlow, Hugging Face), and integrates well with SageMaker, Kubernetes, AzureML, and other MLOps platforms.
________________________________________
๐ MLflow Components (4 Key Modules)
Component Description
✅ Tracking Logs experiments: parameters, metrics, artifacts
๐ฆ Projects Defines reusable and shareable ML code (optional)
๐ท️ Model Registry Manages versions of ML models and stages like "Staging", "Production"
๐ Model Serving Deploys models locally, on SageMaker, Azure, or Databricks
________________________________________
✅ Why Use MLflow? (Advantages)
๐ 1. Track Every Experiment Run
• Record hyperparameters, datasets used, metrics (BLEU, F1), models trained, and source code.
• Know which model version worked best.
๐ 2. Compare Performance Across Runs
• UI or CLI lets you sort by test accuracy, BLEU, F1, etc.
• Easily choose the top-performing run.
๐ 3. Ensure Reproducibility
• MLflow logs Python environment, Git SHA, and dependencies.
• Anyone can rerun the experiment later with the same results.
๐งฐ 4. Register & Manage Models
• Assign versions and lifecycle stages: “Staging”, “Production”, “Archived”.
• Automate deployment with CI/CD pipelines.
๐ 5. Deploy Anywhere
• You can deploy to:
o Local Flask server (mlflow models serve)
o AWS SageMaker
o Azure ML
o Kubernetes
________________________________________
๐ What Are Experiments in MLflow?
An Experiment in MLflow is a logical group of runs for a specific task or problem.
✅ Example:
If you are working on a GenAI model for insurance claim triage, your experiment might be:
mlflow.set_experiment("insurance-claims-genai")
Then every mlflow.start_run() is logged under that experiment.
________________________________________
๐ก What’s Logged in an Experiment?
Type Examples
Parameters learning_rate, model_name, max_tokens
Metrics train_loss, test_bleu, accuracy
Artifacts model weights, tokenizer, confusion matrix PNGs
Source Code Git SHA, environment, Python packages
________________________________________
๐ Visualizing Experiments in MLflow UI
You can:
• See a list of runs under an experiment
• Compare BLEU scores, loss curves
• Click each run to view logs, artifacts, and code
๐ Useful when trying different:
• Prompt engineering strategies
• Model sizes (flan-t5, llama2, etc.)
• Learning rates
• Knowledge base grounding quality
________________________________________
๐จ๐ป Real Example (From Your Use Case)
Imagine you're training multiple GenAI models to process insurance claim documents:
with mlflow.start_run():
mlflow.log_param("model_name", "flan-t5-small")
mlflow.log_param("epochs", 3)
mlflow.log_metric("test_bleu", 0.89)
mlflow.pytorch.log_model(model, "model")
You run this 5 times with different models. Later, you:
• Open MLflow UI
• Sort runs by test_bleu
• Pick the highest BLEU score
• Register and deploy that model
________________________________________
๐ Summary
Feature What It Does Why It Matters
๐งช Experiment Tracking Logs params, metrics, artifacts Reproducibility, tuning
๐ Model Registry Versioned model management CI/CD, production rollout
๐ UI & Comparison Side-by-side metrics for multiple runs Better decision-making
๐ Deployment One-click to SageMaker, local, etc. Fast, scalable inference
๐ง Use Case Fit Fine-tuning LLMs, GenAI, RAG, NLP, CV Especially helpful in complex workflows
๐งช What Is an MLflow Experiment?
An MLflow experiment is a collection of runs (training executions) where you test different model configurations to find the best one.
Each experiment:
• Tracks hyperparameters, metrics, artifacts, and code version
• Helps you compare performance across different runs
• Can be visualized in the MLflow UI or queried via API
________________________________________
✅ Insurance GenAI Use Case – MLflow Experiment Examples
✅ Experiment Name: "insurance-claims-genai"
You're trying to generate claim triage summaries using different models, prompt styles, or training settings.
________________________________________
๐ Example 1: Compare Model Architectures
๐ Goal:
Evaluate which base model performs best for summarizing claims
Run model_name test_bleu Status
Run 1 flan-t5-small 0.82 OK
Run 2 flan-t5-base 0.86 Better
Run 3 llama2-7b 0.91 Best
python
CopyEdit
mlflow.set_experiment("insurance-claims-genai")
with mlflow.start_run():
mlflow.log_param("model_name", "flan-t5-base")
mlflow.log_param("epochs", 3)
mlflow.log_param("learning_rate", 5e-5)
mlflow.log_metric("test_bleu", 0.86)
mlflow.pytorch.log_model(model, "model")
________________________________________
๐ Example 2: Evaluate Prompt Engineering Variants
๐ Goal:
Test how different prompt templates affect model output
Run prompt_template test_bleu
A "question: ... context: ..." 0.78
B "You are a claim adjuster. Q: ..." 0.83
C "Claim details: ... What should I do?" 0.87
python
CopyEdit
mlflow.set_experiment("insurance-claims-genai")
with mlflow.start_run():
mlflow.log_param("prompt_template", "Claim details: ... What should I do?")
mlflow.log_param("context_window", 512)
mlflow.log_metric("test_bleu", 0.87)
mlflow.pytorch.log_model(model, "model")
________________________________________
๐ Example 3: Evaluate Retrieval Effectiveness (RAG)
๐ Goal:
Compare impact of Knowledge Base documents on RAG quality
Run kb_documents top_k test_bleu
Run A Auto policy only 5 0.75
Run B Auto + Fraud detection 5 0.81
Run C Full policy + SOPs 10 0.88
python
CopyEdit
mlflow.set_experiment("insurance-claims-genai")
with mlflow.start_run():
mlflow.log_param("knowledge_base", "Auto + Fraud")
mlflow.log_param("top_k_chunks", 5)
mlflow.log_metric("test_bleu", 0.81)
________________________________________
๐ Example 4: Track Data Version and Token Limit
๐ Goal:
Check how model performs on different data snapshots and token limits
Run data_version max_tokens test_bleu
A v1-2024-12 256 0.79
B v2-2025-01 512 0.85
C v2-2025-01 1024 0.84
python
CopyEdit
mlflow.set_experiment("insurance-claims-genai")
with mlflow.start_run():
mlflow.log_param("data_version", "v2-2025-01")
mlflow.log_param("max_tokens", 512)
mlflow.log_metric("test_bleu", 0.85)
________________________________________
๐ How You Use These Experiments
Once you've logged 10–20 of these runs:
• Open the MLflow UI
• Go to experiment "insurance-claims-genai"
• Sort or filter by test_bleu or model_name
• Click the best run → Register it → Deploy to SageMaker
________________________________________
๐ง Pro Interview Tip
If they ask "How do you know your GenAI model is improving?", answer:
“We track multiple experiments in MLflow, comparing metrics like BLEU, latency, and token usage across different model variants and prompt styles. It gives us visibility into what improves performance and what doesn’t. Once we identify the best run, we register and promote the model into staging or production.”
๐น 1. We Track Multiple Experiments in MLflow
You define a single experiment called:
mlflow.set_experiment("insurance-claims-genai")
You run training with different model architectures, prompts, or dataset versions.
________________________________________
๐งช Example 1: Run with flan-t5-small and prompt style A
with mlflow.start_run():
mlflow.log_param("model", "flan-t5-small")
mlflow.log_param("prompt_style", "simple_question_context")
mlflow.log_param("max_tokens", 512)
# BLEU and latency results from test set
mlflow.log_metric("test_bleu", 0.78)
mlflow.log_metric("avg_latency_ms", 180)
mlflow.log_metric("avg_token_usage", 140)
mlflow.pytorch.log_model(model, "model")
________________________________________
๐งช Example 2: Run with flan-t5-base and prompt style B
with mlflow.start_run():
mlflow.log_param("model", "flan-t5-base")
mlflow.log_param("prompt_style", "instruction_following")
mlflow.log_param("max_tokens", 512)
mlflow.log_metric("test_bleu", 0.85)
mlflow.log_metric("avg_latency_ms", 210)
mlflow.log_metric("avg_token_usage", 175)
mlflow.pytorch.log_model(model, "model")
________________________________________
๐ 2. We Compare Metrics like BLEU, Latency, Token Usage
After 5–10 runs like these, you open the MLflow Tracking UI and compare the results:
Run Model Prompt Style BLEU Latency (ms) Token Usage
1 flan-t5-small simple_question 0.78 180 140
2 flan-t5-base instruction_following 0.85 210 175
3 llama2-7b context_augmented 0.91 250 210
You can sort by BLEU score, filter by model name, or plot BLEU vs. latency.
This helps you make data-driven decisions:
• Do bigger models improve accuracy?
• Does a certain prompt style reduce latency?
• Are higher token usage costs justified?
________________________________________
๐ 3. Once We Identify the Best Run...
Say Run #3 (llama2-7b) gives you the best BLEU (0.91) with acceptable latency.
You then promote this run.
๐ Register It:
from mlflow import register_model
register_model(
model_uri="runs:/abc1234567890/model", # Run ID from MLflow
name="insurance-claims-triage-model"
)
๐ Set Model to "Production":
from mlflow.tracking import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
name="insurance-claims-triage-model",
version="3",
stage="Production"
)
________________________________________
๐ 4. Deploy or Serve It
Once promoted, you can:
• Deploy to SageMaker using:
mlflow.sagemaker.deploy(
app_name="insurance-claims-genai",
model_uri="models:/insurance-claims-triage-model/Production",
region_name="us-east-1"
)
• Or serve locally for testing:
bash
CopyEdit
mlflow models serve -m models:/insurance-claims-triage-model/Production -p 5000
________________________________________
๐ Final Outcome:
• You ran 5+ variants (different models/prompt/data)
• Tracked everything in MLflow
• Compared metrics like BLEU, latency, token cost
• Promoted the best run to production
• Exposed it via API Gateway + Lambda
________________________________________
✅ Visual Flow:
Train (Run 1-5) ──▶ Log to MLflow ──▶ Compare BLEU/latency ──▶ Register Best ──▶ Promote to Prod ──▶ Deploy to SageMaker ──▶ API Gateway
________________________________________
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment