← Labs

Lab 9: Monitoring & Logging for Production AI Systems

ProductionMonitoring & Logging for AI Systems

Lab 9: Monitoring & Logging for Production AI Systems

Learn production monitoring concepts, explore AWS CloudWatch for deployed Lambda functions, and enhance your Prometheus dashboard with detailed health checks and metrics.

Lab Overview

What You'll Do: Understand the 4 Golden Signals, explore AWS CloudWatch metrics and logs for Lambda, set up CloudWatch alarms, and enhance your local Prometheus dashboard with production-grade monitoring features

Lab Collaborators:

  • • Edward Lampoh - Software Developer & Collaborator
  • • Oluwafemi Adebayo, PhD - Academic Professor & Collaborator
Prerequisites Required
Complete Labs 1-8 before starting

Before starting Lab 9, ensure you have:

  • Lab 2 completed with basic Prometheus monitoring
  • Lab 8 completed with Lambda function deployed
  • Flask MLOps service running locally
  • AWS account access with Lambda function from Lab 8

🔍 Quick Test

# Verify local service is running
curl http://localhost:5001/health

# Should return healthy status

Part A: Understanding Production Monitoring

Learn the fundamentals of monitoring production AI systems

1. The 4 Golden Signals of Monitoring

Google's Site Reliability Engineering team identified 4 key metrics every production system should monitor:

1. Latency

What: How long it takes to service a request
Example: Your Lambda function takes 150ms to respond

2. Traffic

What: How much demand is on your system
Example: 1,000 AI chat requests per day

3. Errors

What: Rate of failed requests
Example: 2% of requests fail due to database timeouts

4. Saturation

What: How "full" your service is
Example: Lambda using 450MB of 512MB memory limit

2. Monitoring vs Logging vs Tracing

📊 Monitoring (Metrics)

What: Numeric measurements over time
Example: Request count, response time, error rate
Tools: Prometheus, CloudWatch Metrics

📝 Logging (Events)

What: Text records of what happened
Example: "User john@example.com requested appointment at 2pm"
Tools: CloudWatch Logs, application logs

🔍 Tracing (Request Flow)

What: Following a single request through multiple services
Example: User request → API Gateway → Lambda → Database
Tools: AWS X-Ray, Jaeger

3. Why Monitor Production Systems?

Monitoring helps you:

  • Detect issues early: Know when error rates spike before users complain
  • Understand usage patterns: When are peak hours? Which features are used most?
  • Optimize costs: See where you're spending money (API calls, compute time)
  • Plan capacity: Know when to scale up resources
  • Debug production problems: Trace issues to specific components

⚠️ Without Monitoring:

  • You won't know when your service is down until users report it
  • Debugging production issues is like flying blind
  • Unexpected AWS bills because you didn't track usage
  • No data to optimize performance or user experience

Part B: Explore AWS CloudWatch for Lambda

Learn how AWS automatically monitors your Lambda functions

1. What is AWS CloudWatch?

CloudWatch is AWS's monitoring and logging service that automatically tracks all Lambda functions with zero configuration needed.

What CloudWatch Provides:

  • Metrics: Invocations, duration, errors, throttles
  • Logs: Everything your Lambda prints (console.log, print statements)
  • Alarms: Notifications when metrics exceed thresholds
  • Dashboards: Visual graphs of your metrics
2. View Lambda Metrics in CloudWatch

Access CloudWatch Metrics:

  1. Sign in to AWS Console
  2. Go to Lambda → Your function (mlops-service-lambda)
  3. Click the "Monitor" tab
  4. You'll see automatic metrics graphs:

📈 Invocations

Number of times your Lambda was called (traffic signal)

⏱️ Duration

How long each request took (latency signal)
Note: First request may show 1-3 seconds (cold start), then ~100-200ms

❌ Errors

Failed invocations (error signal)

🚫 Throttles

Requests rejected due to concurrency limits (saturation signal)

3. View Lambda Logs in CloudWatch

Access CloudWatch Logs:

  1. In Lambda "Monitor" tab, click "View CloudWatch logs"
  2. You'll see log streams - one per Lambda execution environment
  3. Click on the latest log stream
  4. View your application logs:

Example Log Output:

START RequestId: abc123 Version: $LATEST
2024-01-15T10:30:00.000Z INFO Received request to /health
2024-01-15T10:30:00.050Z INFO Database connection healthy
2024-01-15T10:30:00.100Z INFO Prometheus metrics updated
END RequestId: abc123
REPORT Duration: 150.00 ms Memory: 128 MB Max Memory: 89 MB

💡 Understanding Cold Starts:

When Lambda hasn't been used for ~10-15 minutes, AWS pauses it. The first request after idle takes 1-3 seconds to "warm up" (load code, start Python). This is normal and expected. Subsequent requests are fast (~100-200ms).

4. Test and Observe Metrics

Generate some traffic to your Lambda:

# Replace with your API Gateway URL from Lab 8
API_URL="https://YOUR_API_GATEWAY_URL"

# Send 10 test requests
for i in {1..10}; do
  curl -s $API_URL/health
  sleep 1
done

Then refresh CloudWatch to see:

  • Invocations count increased by 10
  • Duration showing ~100-200ms per request
  • First request may show higher duration (cold start)
  • Logs showing your requests

Part C: Set Up CloudWatch Alarm for Errors

Get notified when your Lambda function error rate is too high

1. Why Set Up Alarms?

CloudWatch Alarms proactively notify you when metrics exceed thresholds. Instead of constantly checking dashboards, AWS sends you an email/SMS when something is wrong.

Example Scenarios:

  • Error rate > 10%: Something is broken, investigate immediately
  • Duration > 5 seconds: Function is slow, may timeout
  • Throttles > 0: Too many concurrent requests, need more capacity
2. Create SNS Topic for Notifications

SNS (Simple Notification Service) sends alarm notifications via email/SMS.

Create SNS Topic:

  1. Go to AWS Console → Search "SNS" → Open SNS
  2. Click "Topics" in left sidebar → "Create topic"
  3. Configure:
    • Type: Standard
    • Name: mlops-lambda-alerts
  4. Click "Create topic"

Create Email Subscription:

  1. In the topic page, click "Create subscription"
  2. Configure:
    • Protocol: Email
    • Endpoint: Your email address
  3. Click "Create subscription"
  4. Check your email for confirmation link and click it
3. Create Error Rate Alarm

Create CloudWatch Alarm:

  1. Go to CloudWatch → Alarms → "Create alarm"
  2. Click "Select metric"
  3. Navigate: Lambda → By Function Name
  4. Select your function (mlops-service-lambda)
  5. Check the Errors metric → "Select metric"
  6. Configure alarm conditions:
    • Statistic: Sum
    • Period: 5 minutes
    • Threshold type: Static
    • Whenever Errors is: Greater than 5
  7. Click "Next"

Configure notifications:

  1. Select "In alarm" state trigger
  2. Select your SNS topic: mlops-lambda-alerts
  3. Click "Next"

Name the alarm:

  1. Alarm name: mlops-lambda-high-errors
  2. Description: "Alert when Lambda errors exceed 5 in 5 minutes"
  3. Click "Next" → Review → "Create alarm"

✅ What You've Done:

Now if your Lambda function has more than 5 errors within a 5-minute window, you'll get an email notification automatically! This is proactive monitoring - you know about problems before users complain.

4. Test Alarm (Optional)

To test the alarm, you would need to intentionally cause errors. For this lab, it's not necessary - just knowing the alarm exists is enough.

💡 How to Test (Advanced):

  • Temporarily break your Lambda (e.g., wrong DATABASE_URL)
  • Send 6+ requests to trigger errors
  • Wait 5 minutes for alarm to trigger
  • Receive email notification
  • Fix the issue and alarm returns to "OK" state

Part D: Enhance Prometheus Dashboard with Production Features

Add detailed health checks, error tracking, and uptime monitoring to your local service

1. What We're Adding

Lab 2 set up basic Prometheus monitoring. Now we'll add production-grade features:

  • Detailed health endpoint: Service status, uptime, system resources
  • Request tracking: Total requests, failed requests, error rates
  • Uptime monitoring: How long service has been running
  • System resources: Memory and CPU usage
  • Enhanced dashboard: 10 comprehensive metrics cards
2. Update Flask Dependencies

Add psutil for system monitoring:

Update mlops-service/requirements.txt:

# Existing dependencies remain...

# Additional utilities
python-dotenv==1.0.0
psutil==5.9.8  # NEW: For system monitoring

Install the new dependency:

cd mlops-service
pip install psutil==5.9.8
3. Code Enhancements Already Implemented

Your instructor has already enhanced mlops-service/app.py with production monitoring features:

✅ Global Request Tracking

Variables to track: TOTAL_REQUESTS, FAILED_REQUESTS, SERVICE_START_TIME

✅ Helper Functions

get_uptime_seconds() - Calculate service uptime
get_memory_usage_mb() - Get memory usage via psutil
calculate_error_rate() - Calculate error percentage
check_database_connection() - Verify database access

✅ New /health/detailed Endpoint

Returns comprehensive health info: status, uptime, system resources, request stats, error rates, health checks

✅ Enhanced /track Endpoint

Now increments TOTAL_REQUESTS and tracks FAILED_REQUESTS on errors

✅ Structured Logging

Formatted log output with timestamps, log levels, and structured messages

4. Test the Enhanced Endpoints

Start your local Flask service:

cd mlops-service
python app.py

Test the detailed health endpoint:

curl http://localhost:5001/health/detailed | jq

Expected Output:

{
  "status": "healthy",
  "service": "mlops-service-prometheus",
  "version": "1.0.0",
  "timestamp": "2024-01-15T10:30:00.000000",
  "environment": "development",
  "uptime": {
    "seconds": 3600.5,
    "hours": 1.0,
    "days": 0.04,
    "started_at": "2024-01-15T09:30:00.000000"
  },
  "system": {
    "memory_usage_mb": 125.4,
    "cpu_percent": 2.5,
    "process_id": 12345
  },
  "requests": {
    "total": 150,
    "failed": 3,
    "error_rate_percent": 2.0,
    "last_request": "2024-01-15T10:29:00.000000"
  },
  "health_checks": {
    "database_connected": true,
    "prometheus_enabled": true,
    "metrics_tracking": "active"
  }
}
5. View Enhanced Dashboard

The dashboard has been enhanced to display 10 comprehensive metrics:

Open the dashboard:

open http://localhost:5001/

10 Metrics Displayed:

1. Service Status - Healthy/Unhealthy with environment
2. Service Uptime - Hours/days since start
3. Total Requests - All tracked requests
4. Error Rate - Color-coded: green <5%, yellow <10%, red >10%
5. Memory Usage - MB used with CPU%
6. Database Status - Connected/Disconnected
7. Total Tokens - AI tokens consumed
8. Total API Cost - USD spent on AI
9. Appointments - Booking requests
10. Human Handoffs - Escalations needed

💡 Dashboard Features:

  • Auto-refreshes every 10 seconds
  • Manual refresh button for immediate updates
  • Color-coded error rate indicators (green/yellow/red)
  • Links to all monitoring endpoints

Part E: Production Monitoring Best Practices

Learn industry best practices for production AI systems

1. Setting Alert Thresholds

✅ Good Alert Thresholds

  • Error rate > 5-10%: Something is likely broken
  • Response time > 3 seconds: User experience degrading
  • Memory usage > 80%: Risk of crashes

❌ Bad Alert Thresholds

  • Alert on every error: Too noisy, you'll ignore alerts
  • Alert on 50% error rate: Too late, many users affected
  • No alerts at all: You won't know when things break

💡 Rule of Thumb:

Set thresholds that indicate actionable problems - not so sensitive that you get false alarms, not so loose that real issues go unnoticed.

2. Track AI API Costs

AI API calls can get expensive quickly. Always monitor:

  • Total tokens used: Shows API usage volume
  • Cost per request: Helps optimize prompts
  • Daily/weekly trends: Spot unexpected spikes

⚠️ Example Cost Issue:

A student accidentally created an infinite loop of AI calls. Without monitoring, they didn't notice until their AWS bill was $500. Monitoring could have caught this in hours, not days.

3. Use Appropriate Log Levels

DEBUG

Detailed info for debugging (only in development)

INFO

Normal operations (e.g., "Request received", "Metric updated")

WARNING

Something unexpected but not critical (e.g., "Retry attempt 2 of 3")

ERROR

Failures that need attention (e.g., "Database connection failed")

4. Log and Metric Retention

CloudWatch logs and metrics cost money to store. Set retention policies:

  • Logs: Keep 7-30 days (good for debugging recent issues)
  • Metrics: CloudWatch keeps high-resolution for 3 hours, then aggregates
  • Alarms: Keep indefinitely (minimal cost)

💡 Cost Optimization:

For this course project, CloudWatch costs are negligible. But in production with high traffic, log retention settings can save hundreds of dollars per month.

Part F: CloudWatch vs Prometheus Comparison

Understand when to use each monitoring approach

CloudWatch vs Prometheus

AWS CloudWatch

✅ Pros:

  • Automatic for AWS services
  • No setup required for Lambda/EC2
  • Built-in alarms and notifications
  • Integrated with entire AWS ecosystem

❌ Cons:

  • AWS-specific (vendor lock-in)
  • Costs more for high-volume metrics
  • Less flexible than Prometheus

Prometheus

✅ Pros:

  • Open-source and free
  • Works anywhere (AWS, GCP, on-premise)
  • Powerful query language (PromQL)
  • Custom metrics easy to add

❌ Cons:

  • Requires manual setup
  • Need to manage Prometheus server
  • No built-in alerting (need Alertmanager)
When to Use Each

Use CloudWatch when:

  • Using AWS services (Lambda, EC2, RDS)
  • Want zero-setup monitoring
  • Need AWS-integrated alarms
  • Small to medium scale

Use Prometheus when:

  • Multi-cloud or on-premise deployment
  • Need custom application metrics
  • Want full control over monitoring
  • Large scale with complex queries

Use Both (Our Approach):

  • CloudWatch: Monitor deployed Lambda/EC2
  • Prometheus: Track application-specific metrics (AI costs, appointments, handoffs)
  • Best of both worlds: AWS infrastructure + custom business metrics
Troubleshooting

Can't see Lambda metrics in CloudWatch:

  • Make sure you've invoked the Lambda at least once
  • Check you're in the correct AWS region
  • Wait a few minutes for metrics to appear

SNS subscription not confirmed:

  • Check spam folder for confirmation email
  • Try creating subscription again with different email

CloudWatch alarm not triggering:

  • Verify alarm is in "OK" state (not "Insufficient data")
  • Check threshold is actually exceeded
  • Wait full evaluation period (5 minutes)

/health/detailed endpoint returns error:

  • Ensure psutil is installed: pip install psutil==5.9.8
  • Restart Flask service: python app.py
  • Check application logs for specific errors

Dashboard shows 0 for all metrics:

  • Service may have just started (metrics will populate with usage)
  • Try sending test requests: curl http://localhost:5001/health
  • Check browser console for fetch errors

Error rate showing incorrectly:

  • Error rate is calculated as (failed / total) * 100
  • If total requests is 0, error rate shows 0%
  • Send some requests to populate statistics
Lab 9 Summary - What You Learned

Congratulations! You've learned production monitoring concepts and enhanced your MLOps service with production-grade observability. Here's what you accomplished:

✅ Monitoring Concepts Learned

  • 4 Golden Signals: Latency, Traffic, Errors, Saturation
  • Monitoring Types: Metrics, Logs, and Traces
  • CloudWatch: AWS's automatic monitoring for Lambda
  • Alarms: Proactive notifications when thresholds are exceeded
  • Production Best Practices: Alert thresholds, log levels, retention

🚀 What You Built

  • CloudWatch Exploration: Viewed Lambda metrics and logs
  • CloudWatch Alarm: Email notifications for high error rates
  • Enhanced Health Endpoint: Detailed service status with uptime and resources
  • Request Tracking: Total requests, failed requests, error rates
  • Improved Dashboard: 10 comprehensive metrics with color-coded indicators

🔑 Key Takeaways

  • Monitor everything: You can't fix what you don't measure
  • CloudWatch is automatic: AWS Lambda monitoring requires zero setup
  • Set smart thresholds: Not too noisy, not too loose
  • Track AI costs: Prevent surprise bills from runaway API usage
  • Cold starts are normal: First Lambda request after idle takes 1-3 seconds
  • Use both CloudWatch and Prometheus: AWS infrastructure + custom metrics

📝 Test Your Knowledge

Complete the Lab 9 quiz to test your understanding of monitoring and logging for production AI systems.

Take Lab 9 Quiz →

📸 Quiz Submission Checklist:

  • Complete all 5 multiple-choice questions
  • Take a screenshot of your results page showing:
    • Your name
    • Your score (aim for 4/5 or 5/5)
    • Session ID
    • Timestamp
  • Submit screenshot as proof of completion