Thursday, 26 June 2025

MLFlow

๐Ÿ“ฆ What Is MLflow? MLflow is an open-source platform for managing the complete machine learning lifecycle, including: • Experiment tracking • Model versioning & registry • Reproducible runs • Model deployment Originally developed by Databricks, MLflow works with any ML library (e.g., PyTorch, TensorFlow, Hugging Face), and integrates well with SageMaker, Kubernetes, AzureML, and other MLOps platforms. ________________________________________ ๐Ÿ”„ MLflow Components (4 Key Modules) Component Description ✅ Tracking Logs experiments: parameters, metrics, artifacts ๐Ÿ“ฆ Projects Defines reusable and shareable ML code (optional) ๐Ÿท️ Model Registry Manages versions of ML models and stages like "Staging", "Production" ๐Ÿš€ Model Serving Deploys models locally, on SageMaker, Azure, or Databricks ________________________________________ ✅ Why Use MLflow? (Advantages) ๐Ÿ” 1. Track Every Experiment Run • Record hyperparameters, datasets used, metrics (BLEU, F1), models trained, and source code. • Know which model version worked best. ๐Ÿ“Š 2. Compare Performance Across Runs • UI or CLI lets you sort by test accuracy, BLEU, F1, etc. • Easily choose the top-performing run. ๐Ÿ”„ 3. Ensure Reproducibility • MLflow logs Python environment, Git SHA, and dependencies. • Anyone can rerun the experiment later with the same results. ๐Ÿงฐ 4. Register & Manage Models • Assign versions and lifecycle stages: “Staging”, “Production”, “Archived”. • Automate deployment with CI/CD pipelines. ๐Ÿš€ 5. Deploy Anywhere • You can deploy to: o Local Flask server (mlflow models serve) o AWS SageMaker o Azure ML o Kubernetes ________________________________________ ๐Ÿ“˜ What Are Experiments in MLflow? An Experiment in MLflow is a logical group of runs for a specific task or problem. ✅ Example: If you are working on a GenAI model for insurance claim triage, your experiment might be: mlflow.set_experiment("insurance-claims-genai") Then every mlflow.start_run() is logged under that experiment. ________________________________________ ๐Ÿ’ก What’s Logged in an Experiment? Type Examples Parameters learning_rate, model_name, max_tokens Metrics train_loss, test_bleu, accuracy Artifacts model weights, tokenizer, confusion matrix PNGs Source Code Git SHA, environment, Python packages ________________________________________ ๐Ÿ“ˆ Visualizing Experiments in MLflow UI You can: • See a list of runs under an experiment • Compare BLEU scores, loss curves • Click each run to view logs, artifacts, and code ๐Ÿ“ Useful when trying different: • Prompt engineering strategies • Model sizes (flan-t5, llama2, etc.) • Learning rates • Knowledge base grounding quality ________________________________________ ๐Ÿ‘จ‍๐Ÿ’ป Real Example (From Your Use Case) Imagine you're training multiple GenAI models to process insurance claim documents: with mlflow.start_run(): mlflow.log_param("model_name", "flan-t5-small") mlflow.log_param("epochs", 3) mlflow.log_metric("test_bleu", 0.89) mlflow.pytorch.log_model(model, "model") You run this 5 times with different models. Later, you: • Open MLflow UI • Sort runs by test_bleu • Pick the highest BLEU score • Register and deploy that model ________________________________________ ๐Ÿ“Œ Summary Feature What It Does Why It Matters ๐Ÿงช Experiment Tracking Logs params, metrics, artifacts Reproducibility, tuning ๐Ÿ“‹ Model Registry Versioned model management CI/CD, production rollout ๐Ÿ“Š UI & Comparison Side-by-side metrics for multiple runs Better decision-making ๐ŸŒ Deployment One-click to SageMaker, local, etc. Fast, scalable inference ๐Ÿง  Use Case Fit Fine-tuning LLMs, GenAI, RAG, NLP, CV Especially helpful in complex workflows ๐Ÿงช What Is an MLflow Experiment? An MLflow experiment is a collection of runs (training executions) where you test different model configurations to find the best one. Each experiment: • Tracks hyperparameters, metrics, artifacts, and code version • Helps you compare performance across different runs • Can be visualized in the MLflow UI or queried via API ________________________________________ ✅ Insurance GenAI Use Case – MLflow Experiment Examples ✅ Experiment Name: "insurance-claims-genai" You're trying to generate claim triage summaries using different models, prompt styles, or training settings. ________________________________________ ๐Ÿ” Example 1: Compare Model Architectures ๐Ÿ” Goal: Evaluate which base model performs best for summarizing claims Run model_name test_bleu Status Run 1 flan-t5-small 0.82 OK Run 2 flan-t5-base 0.86 Better Run 3 llama2-7b 0.91 Best python CopyEdit mlflow.set_experiment("insurance-claims-genai") with mlflow.start_run(): mlflow.log_param("model_name", "flan-t5-base") mlflow.log_param("epochs", 3) mlflow.log_param("learning_rate", 5e-5) mlflow.log_metric("test_bleu", 0.86) mlflow.pytorch.log_model(model, "model") ________________________________________ ๐Ÿ” Example 2: Evaluate Prompt Engineering Variants ๐Ÿ” Goal: Test how different prompt templates affect model output Run prompt_template test_bleu A "question: ... context: ..." 0.78 B "You are a claim adjuster. Q: ..." 0.83 C "Claim details: ... What should I do?" 0.87 python CopyEdit mlflow.set_experiment("insurance-claims-genai") with mlflow.start_run(): mlflow.log_param("prompt_template", "Claim details: ... What should I do?") mlflow.log_param("context_window", 512) mlflow.log_metric("test_bleu", 0.87) mlflow.pytorch.log_model(model, "model") ________________________________________ ๐Ÿ” Example 3: Evaluate Retrieval Effectiveness (RAG) ๐Ÿ” Goal: Compare impact of Knowledge Base documents on RAG quality Run kb_documents top_k test_bleu Run A Auto policy only 5 0.75 Run B Auto + Fraud detection 5 0.81 Run C Full policy + SOPs 10 0.88 python CopyEdit mlflow.set_experiment("insurance-claims-genai") with mlflow.start_run(): mlflow.log_param("knowledge_base", "Auto + Fraud") mlflow.log_param("top_k_chunks", 5) mlflow.log_metric("test_bleu", 0.81) ________________________________________ ๐Ÿ” Example 4: Track Data Version and Token Limit ๐Ÿ” Goal: Check how model performs on different data snapshots and token limits Run data_version max_tokens test_bleu A v1-2024-12 256 0.79 B v2-2025-01 512 0.85 C v2-2025-01 1024 0.84 python CopyEdit mlflow.set_experiment("insurance-claims-genai") with mlflow.start_run(): mlflow.log_param("data_version", "v2-2025-01") mlflow.log_param("max_tokens", 512) mlflow.log_metric("test_bleu", 0.85) ________________________________________ ๐Ÿ“ˆ How You Use These Experiments Once you've logged 10–20 of these runs: • Open the MLflow UI • Go to experiment "insurance-claims-genai" • Sort or filter by test_bleu or model_name • Click the best run → Register it → Deploy to SageMaker ________________________________________ ๐Ÿง  Pro Interview Tip If they ask "How do you know your GenAI model is improving?", answer: “We track multiple experiments in MLflow, comparing metrics like BLEU, latency, and token usage across different model variants and prompt styles. It gives us visibility into what improves performance and what doesn’t. Once we identify the best run, we register and promote the model into staging or production.” ๐Ÿ”น 1. We Track Multiple Experiments in MLflow You define a single experiment called: mlflow.set_experiment("insurance-claims-genai") You run training with different model architectures, prompts, or dataset versions. ________________________________________ ๐Ÿงช Example 1: Run with flan-t5-small and prompt style A with mlflow.start_run(): mlflow.log_param("model", "flan-t5-small") mlflow.log_param("prompt_style", "simple_question_context") mlflow.log_param("max_tokens", 512) # BLEU and latency results from test set mlflow.log_metric("test_bleu", 0.78) mlflow.log_metric("avg_latency_ms", 180) mlflow.log_metric("avg_token_usage", 140) mlflow.pytorch.log_model(model, "model") ________________________________________ ๐Ÿงช Example 2: Run with flan-t5-base and prompt style B with mlflow.start_run(): mlflow.log_param("model", "flan-t5-base") mlflow.log_param("prompt_style", "instruction_following") mlflow.log_param("max_tokens", 512) mlflow.log_metric("test_bleu", 0.85) mlflow.log_metric("avg_latency_ms", 210) mlflow.log_metric("avg_token_usage", 175) mlflow.pytorch.log_model(model, "model") ________________________________________ ๐Ÿ“Š 2. We Compare Metrics like BLEU, Latency, Token Usage After 5–10 runs like these, you open the MLflow Tracking UI and compare the results: Run Model Prompt Style BLEU Latency (ms) Token Usage 1 flan-t5-small simple_question 0.78 180 140 2 flan-t5-base instruction_following 0.85 210 175 3 llama2-7b context_augmented 0.91 250 210 You can sort by BLEU score, filter by model name, or plot BLEU vs. latency. This helps you make data-driven decisions: • Do bigger models improve accuracy? • Does a certain prompt style reduce latency? • Are higher token usage costs justified? ________________________________________ ๐Ÿš€ 3. Once We Identify the Best Run... Say Run #3 (llama2-7b) gives you the best BLEU (0.91) with acceptable latency. You then promote this run. ๐Ÿ“Œ Register It: from mlflow import register_model register_model( model_uri="runs:/abc1234567890/model", # Run ID from MLflow name="insurance-claims-triage-model" ) ๐Ÿ“Œ Set Model to "Production": from mlflow.tracking import MlflowClient client = MlflowClient() client.transition_model_version_stage( name="insurance-claims-triage-model", version="3", stage="Production" ) ________________________________________ ๐ŸŒ 4. Deploy or Serve It Once promoted, you can: • Deploy to SageMaker using: mlflow.sagemaker.deploy( app_name="insurance-claims-genai", model_uri="models:/insurance-claims-triage-model/Production", region_name="us-east-1" ) • Or serve locally for testing: bash CopyEdit mlflow models serve -m models:/insurance-claims-triage-model/Production -p 5000 ________________________________________ ๐Ÿ“ˆ Final Outcome: • You ran 5+ variants (different models/prompt/data) • Tracked everything in MLflow • Compared metrics like BLEU, latency, token cost • Promoted the best run to production • Exposed it via API Gateway + Lambda ________________________________________ ✅ Visual Flow: Train (Run 1-5) ──▶ Log to MLflow ──▶ Compare BLEU/latency ──▶ Register Best ──▶ Promote to Prod ──▶ Deploy to SageMaker ──▶ API Gateway ________________________________________

No comments:

Post a Comment