Production•Monitoring & Logging for AI Systems

Lab 9: Monitoring & Logging for Production AI Systems

Learn production monitoring concepts, explore AWS CloudWatch for deployed Lambda functions, and enhance your Prometheus dashboard with detailed health checks and metrics.

Lab Overview

What You'll Do: Understand the 4 Golden Signals, explore AWS CloudWatch metrics and logs for Lambda, set up CloudWatch alarms, and enhance your local Prometheus dashboard with production-grade monitoring features

Lab Collaborators:

• Edward Lampoh - Software Developer & Collaborator
• Oluwafemi Adebayo, PhD - Academic Professor & Collaborator

Prerequisites Required

Complete Labs 1-8 before starting

You must complete Labs 1-8 with working Lambda deployment before starting Lab 9.

Before starting Lab 9, ensure you have:

Lab 2 completed with basic Prometheus monitoring
Lab 8 completed with Lambda function deployed
Flask MLOps service running locally
AWS account access with Lambda function from Lab 8

🔍 Quick Test

# Verify local service is running
curl http://localhost:5001/health

# Should return healthy status

Lab Scope: Lab 2 covered local development monitoring with Prometheus. Lab 9 focuses on production monitoring - CloudWatch for deployed services, enhanced health checks, error tracking, and uptime monitoring.

Part A: Understanding Production Monitoring

Learn the fundamentals of monitoring production AI systems

1. The 4 Golden Signals of Monitoring

Google's Site Reliability Engineering team identified 4 key metrics every production system should monitor:

1. Latency

What: How long it takes to service a request
Example: Your Lambda function takes 150ms to respond

2. Traffic

What: How much demand is on your system
Example: 1,000 AI chat requests per day

3. Errors

What: Rate of failed requests
Example: 2% of requests fail due to database timeouts

4. Saturation

What: How "full" your service is
Example: Lambda using 450MB of 512MB memory limit

2. Monitoring vs Logging vs Tracing

📊 Monitoring (Metrics)

What: Numeric measurements over time
Example: Request count, response time, error rate
Tools: Prometheus, CloudWatch Metrics

📝 Logging (Events)

What: Text records of what happened
Example: "User john@example.com requested appointment at 2pm"
Tools: CloudWatch Logs, application logs

🔍 Tracing (Request Flow)

What: Following a single request through multiple services
Example: User request → API Gateway → Lambda → Database
Tools: AWS X-Ray, Jaeger

3. Why Monitor Production Systems?

Monitoring helps you:

Detect issues early: Know when error rates spike before users complain
Understand usage patterns: When are peak hours? Which features are used most?
Optimize costs: See where you're spending money (API calls, compute time)
Plan capacity: Know when to scale up resources
Debug production problems: Trace issues to specific components

⚠️ Without Monitoring:

You won't know when your service is down until users report it
Debugging production issues is like flying blind
Unexpected AWS bills because you didn't track usage
No data to optimize performance or user experience

Part B: Explore AWS CloudWatch for Lambda

Learn how AWS automatically monitors your Lambda functions

1. What is AWS CloudWatch?

CloudWatch is AWS's monitoring and logging service that automatically tracks all Lambda functions with zero configuration needed.

What CloudWatch Provides:

Metrics: Invocations, duration, errors, throttles
Logs: Everything your Lambda prints (console.log, print statements)
Alarms: Notifications when metrics exceed thresholds
Dashboards: Visual graphs of your metrics

2. View Lambda Metrics in CloudWatch

Access CloudWatch Metrics:

Sign in to AWS Console
Go to Lambda → Your function (mlops-service-lambda)
Click the "Monitor" tab
You'll see automatic metrics graphs:

📈 Invocations

Number of times your Lambda was called (traffic signal)

⏱️ Duration

How long each request took (latency signal)
Note: First request may show 1-3 seconds (cold start), then ~100-200ms

❌ Errors

Failed invocations (error signal)

🚫 Throttles

Requests rejected due to concurrency limits (saturation signal)

3. View Lambda Logs in CloudWatch

Access CloudWatch Logs:

In Lambda "Monitor" tab, click "View CloudWatch logs"
You'll see log streams - one per Lambda execution environment
Click on the latest log stream
View your application logs:

Example Log Output:

START RequestId: abc123 Version: $LATEST
2024-01-15T10:30:00.000Z INFO Received request to /health
2024-01-15T10:30:00.050Z INFO Database connection healthy
2024-01-15T10:30:00.100Z INFO Prometheus metrics updated
END RequestId: abc123
REPORT Duration: 150.00 ms Memory: 128 MB Max Memory: 89 MB

💡 Understanding Cold Starts:

When Lambda hasn't been used for ~10-15 minutes, AWS pauses it. The first request after idle takes 1-3 seconds to "warm up" (load code, start Python). This is normal and expected. Subsequent requests are fast (~100-200ms).

4. Test and Observe Metrics

Generate some traffic to your Lambda:

# Replace with your API Gateway URL from Lab 8
API_URL="https://YOUR_API_GATEWAY_URL"

# Send 10 test requests
for i in {1..10}; do
  curl -s $API_URL/health
  sleep 1
done

Then refresh CloudWatch to see:

Invocations count increased by 10
Duration showing ~100-200ms per request
First request may show higher duration (cold start)
Logs showing your requests

Success Check: If you see metrics updating and logs appearing, CloudWatch is working!

Part C: Set Up CloudWatch Alarm for Errors

Get notified when your Lambda function error rate is too high

1. Why Set Up Alarms?

CloudWatch Alarms proactively notify you when metrics exceed thresholds. Instead of constantly checking dashboards, AWS sends you an email/SMS when something is wrong.

Example Scenarios:

Error rate > 10%: Something is broken, investigate immediately
Duration > 5 seconds: Function is slow, may timeout
Throttles > 0: Too many concurrent requests, need more capacity

2. Create SNS Topic for Notifications

SNS (Simple Notification Service) sends alarm notifications via email/SMS.

Create SNS Topic:

Go to AWS Console → Search "SNS" → Open SNS
Click "Topics" in left sidebar → "Create topic"
Configure:
- Type: Standard
- Name: mlops-lambda-alerts
Click "Create topic"

Create Email Subscription:

In the topic page, click "Create subscription"
Configure:
- Protocol: Email
- Endpoint: Your email address
Click "Create subscription"
Check your email for confirmation link and click it

Success Check: Subscription status should show "Confirmed"

3. Create Error Rate Alarm

Create CloudWatch Alarm:

Go to CloudWatch → Alarms → "Create alarm"
Click "Select metric"
Navigate: Lambda → By Function Name
Select your function (mlops-service-lambda)
Check the Errors metric → "Select metric"
Configure alarm conditions:
- Statistic: Sum
- Period: 5 minutes
- Threshold type: Static
- Whenever Errors is: Greater than 5
Click "Next"

Configure notifications:

Select "In alarm" state trigger
Select your SNS topic: mlops-lambda-alerts
Click "Next"

Name the alarm:

Alarm name: mlops-lambda-high-errors
Description: "Alert when Lambda errors exceed 5 in 5 minutes"
Click "Next" → Review → "Create alarm"

✅ What You've Done:

Now if your Lambda function has more than 5 errors within a 5-minute window, you'll get an email notification automatically! This is proactive monitoring - you know about problems before users complain.

4. Test Alarm (Optional)

To test the alarm, you would need to intentionally cause errors. For this lab, it's not necessary - just knowing the alarm exists is enough.

💡 How to Test (Advanced):

Temporarily break your Lambda (e.g., wrong DATABASE_URL)
Send 6+ requests to trigger errors
Wait 5 minutes for alarm to trigger
Receive email notification
Fix the issue and alarm returns to "OK" state

Part D: Enhance Prometheus Dashboard with Production Features

Add detailed health checks, error tracking, and uptime monitoring to your local service

1. What We're Adding

Lab 2 set up basic Prometheus monitoring. Now we'll add production-grade features:

Detailed health endpoint: Service status, uptime, system resources
Request tracking: Total requests, failed requests, error rates
Uptime monitoring: How long service has been running
System resources: Memory and CPU usage
Enhanced dashboard: 10 comprehensive metrics cards

2. Update Flask Dependencies

Add psutil for system monitoring:

Update mlops-service/requirements.txt:

# Existing dependencies remain...

# Additional utilities
python-dotenv==1.0.0
psutil==5.9.8  # NEW: For system monitoring

Install the new dependency:

cd mlops-service
pip install psutil==5.9.8

3. Code Enhancements Already Implemented

Your instructor has already enhanced mlops-service/app.py with production monitoring features:

✅ Global Request Tracking

Variables to track: TOTAL_REQUESTS, FAILED_REQUESTS, SERVICE_START_TIME

✅ Helper Functions

• get_uptime_seconds() - Calculate service uptime
• get_memory_usage_mb() - Get memory usage via psutil
• calculate_error_rate() - Calculate error percentage
• check_database_connection() - Verify database access

✅ New /health/detailed Endpoint

Returns comprehensive health info: status, uptime, system resources, request stats, error rates, health checks

✅ Enhanced /track Endpoint

Now increments TOTAL_REQUESTS and tracks FAILED_REQUESTS on errors

✅ Structured Logging

Formatted log output with timestamps, log levels, and structured messages

Note: These enhancements are already in your codebase. You can review mlops-service/app.py to see the implementation.

4. Test the Enhanced Endpoints

Start your local Flask service:

cd mlops-service
python app.py

Test the detailed health endpoint:

curl http://localhost:5001/health/detailed | jq

Expected Output:

{
  "status": "healthy",
  "service": "mlops-service-prometheus",
  "version": "1.0.0",
  "timestamp": "2024-01-15T10:30:00.000000",
  "environment": "development",
  "uptime": {
    "seconds": 3600.5,
    "hours": 1.0,
    "days": 0.04,
    "started_at": "2024-01-15T09:30:00.000000"
  },
  "system": {
    "memory_usage_mb": 125.4,
    "cpu_percent": 2.5,
    "process_id": 12345
  },
  "requests": {
    "total": 150,
    "failed": 3,
    "error_rate_percent": 2.0,
    "last_request": "2024-01-15T10:29:00.000000"
  },
  "health_checks": {
    "database_connected": true,
    "prometheus_enabled": true,
    "metrics_tracking": "active"
  }
}

Success Check: If you see detailed health info with uptime and system metrics, the enhanced endpoint is working!

5. View Enhanced Dashboard

The dashboard has been enhanced to display 10 comprehensive metrics:

Open the dashboard:

open http://localhost:5001/

10 Metrics Displayed:

1. Service Status - Healthy/Unhealthy with environment

2. Service Uptime - Hours/days since start

3. Total Requests - All tracked requests

4. Error Rate - Color-coded: green <5%, yellow <10%, red >10%

5. Memory Usage - MB used with CPU%

6. Database Status - Connected/Disconnected

7. Total Tokens - AI tokens consumed

8. Total API Cost - USD spent on AI

9. Appointments - Booking requests

10. Human Handoffs - Escalations needed

💡 Dashboard Features:

Auto-refreshes every 10 seconds
Manual refresh button for immediate updates
Color-coded error rate indicators (green/yellow/red)
Links to all monitoring endpoints

Part E: Production Monitoring Best Practices

Learn industry best practices for production AI systems

1. Setting Alert Thresholds

✅ Good Alert Thresholds

Error rate > 5-10%: Something is likely broken
Response time > 3 seconds: User experience degrading
Memory usage > 80%: Risk of crashes

❌ Bad Alert Thresholds

Alert on every error: Too noisy, you'll ignore alerts
Alert on 50% error rate: Too late, many users affected
No alerts at all: You won't know when things break

💡 Rule of Thumb:

Set thresholds that indicate actionable problems - not so sensitive that you get false alarms, not so loose that real issues go unnoticed.

2. Track AI API Costs

AI API calls can get expensive quickly. Always monitor:

Total tokens used: Shows API usage volume
Cost per request: Helps optimize prompts
Daily/weekly trends: Spot unexpected spikes

⚠️ Example Cost Issue:

A student accidentally created an infinite loop of AI calls. Without monitoring, they didn't notice until their AWS bill was $500. Monitoring could have caught this in hours, not days.

3. Use Appropriate Log Levels

DEBUG

Detailed info for debugging (only in development)

INFO

Normal operations (e.g., "Request received", "Metric updated")

WARNING

Something unexpected but not critical (e.g., "Retry attempt 2 of 3")

ERROR

Failures that need attention (e.g., "Database connection failed")

Production Tip: Use INFO level in production. DEBUG generates too many logs and increases CloudWatch costs.

4. Log and Metric Retention

CloudWatch logs and metrics cost money to store. Set retention policies:

Logs: Keep 7-30 days (good for debugging recent issues)
Metrics: CloudWatch keeps high-resolution for 3 hours, then aggregates
Alarms: Keep indefinitely (minimal cost)

💡 Cost Optimization:

For this course project, CloudWatch costs are negligible. But in production with high traffic, log retention settings can save hundreds of dollars per month.

Part F: CloudWatch vs Prometheus Comparison

Understand when to use each monitoring approach

CloudWatch vs Prometheus

AWS CloudWatch

✅ Pros:

Automatic for AWS services
No setup required for Lambda/EC2
Built-in alarms and notifications
Integrated with entire AWS ecosystem

❌ Cons:

AWS-specific (vendor lock-in)
Costs more for high-volume metrics
Less flexible than Prometheus

Prometheus

✅ Pros:

Open-source and free
Works anywhere (AWS, GCP, on-premise)
Powerful query language (PromQL)
Custom metrics easy to add

❌ Cons:

Requires manual setup
Need to manage Prometheus server
No built-in alerting (need Alertmanager)

When to Use Each

Use CloudWatch when:

Using AWS services (Lambda, EC2, RDS)
Want zero-setup monitoring
Need AWS-integrated alarms
Small to medium scale

Use Prometheus when:

Multi-cloud or on-premise deployment
Need custom application metrics
Want full control over monitoring
Large scale with complex queries

Use Both (Our Approach):

CloudWatch: Monitor deployed Lambda/EC2
Prometheus: Track application-specific metrics (AI costs, appointments, handoffs)
Best of both worlds: AWS infrastructure + custom business metrics

Troubleshooting

Can't see Lambda metrics in CloudWatch:

Make sure you've invoked the Lambda at least once
Check you're in the correct AWS region
Wait a few minutes for metrics to appear

SNS subscription not confirmed:

Check spam folder for confirmation email
Try creating subscription again with different email

CloudWatch alarm not triggering:

Verify alarm is in "OK" state (not "Insufficient data")
Check threshold is actually exceeded
Wait full evaluation period (5 minutes)

/health/detailed endpoint returns error:

Ensure psutil is installed: pip install psutil==5.9.8
Restart Flask service: python app.py
Check application logs for specific errors

Dashboard shows 0 for all metrics:

Service may have just started (metrics will populate with usage)
Try sending test requests: curl http://localhost:5001/health
Check browser console for fetch errors

Error rate showing incorrectly:

Error rate is calculated as (failed / total) * 100
If total requests is 0, error rate shows 0%
Send some requests to populate statistics

Lab 9 Summary - What You Learned

Congratulations! You've learned production monitoring concepts and enhanced your MLOps service with production-grade observability. Here's what you accomplished:

✅ Monitoring Concepts Learned

4 Golden Signals: Latency, Traffic, Errors, Saturation
Monitoring Types: Metrics, Logs, and Traces
CloudWatch: AWS's automatic monitoring for Lambda
Alarms: Proactive notifications when thresholds are exceeded
Production Best Practices: Alert thresholds, log levels, retention

🚀 What You Built

CloudWatch Exploration: Viewed Lambda metrics and logs
CloudWatch Alarm: Email notifications for high error rates
Enhanced Health Endpoint: Detailed service status with uptime and resources
Request Tracking: Total requests, failed requests, error rates
Improved Dashboard: 10 comprehensive metrics with color-coded indicators

🔑 Key Takeaways

Monitor everything: You can't fix what you don't measure
CloudWatch is automatic: AWS Lambda monitoring requires zero setup
Set smart thresholds: Not too noisy, not too loose
Track AI costs: Prevent surprise bills from runaway API usage
Cold starts are normal: First Lambda request after idle takes 1-3 seconds
Use both CloudWatch and Prometheus: AWS infrastructure + custom metrics

📝 Test Your Knowledge

Complete the Lab 9 quiz to test your understanding of monitoring and logging for production AI systems.

Take Lab 9 Quiz →

📸 Quiz Submission Checklist:

Complete all 5 multiple-choice questions
Take a screenshot of your results page showing:

Your name
Your score (aim for 4/5 or 5/5)
Session ID
Timestamp

Submit screenshot as proof of completion

← Lab 8: Serverless Deployment Back to Labs →