Lab 9: Monitoring & Logging for Production AI Systems
Learn production monitoring concepts, explore AWS CloudWatch for deployed Lambda functions, and enhance your Prometheus dashboard with detailed health checks and metrics.
What You'll Do: Understand the 4 Golden Signals, explore AWS CloudWatch metrics and logs for Lambda, set up CloudWatch alarms, and enhance your local Prometheus dashboard with production-grade monitoring features
Lab Collaborators:
- • Edward Lampoh - Software Developer & Collaborator
- • Oluwafemi Adebayo, PhD - Academic Professor & Collaborator
Before starting Lab 9, ensure you have:
- Lab 2 completed with basic Prometheus monitoring
- Lab 8 completed with Lambda function deployed
- Flask MLOps service running locally
- AWS account access with Lambda function from Lab 8
🔍 Quick Test
# Verify local service is running
curl http://localhost:5001/health
# Should return healthy statusPart A: Understanding Production Monitoring
Learn the fundamentals of monitoring production AI systems
Google's Site Reliability Engineering team identified 4 key metrics every production system should monitor:
1. Latency
What: How long it takes to service a request
Example: Your Lambda function takes 150ms to respond
2. Traffic
What: How much demand is on your system
Example: 1,000 AI chat requests per day
3. Errors
What: Rate of failed requests
Example: 2% of requests fail due to database timeouts
4. Saturation
What: How "full" your service is
Example: Lambda using 450MB of 512MB memory limit
📊 Monitoring (Metrics)
What: Numeric measurements over time
Example: Request count, response time, error rate
Tools: Prometheus, CloudWatch Metrics
📝 Logging (Events)
What: Text records of what happened
Example: "User john@example.com requested appointment at 2pm"
Tools: CloudWatch Logs, application logs
🔍 Tracing (Request Flow)
What: Following a single request through multiple services
Example: User request → API Gateway → Lambda → Database
Tools: AWS X-Ray, Jaeger
Monitoring helps you:
- Detect issues early: Know when error rates spike before users complain
- Understand usage patterns: When are peak hours? Which features are used most?
- Optimize costs: See where you're spending money (API calls, compute time)
- Plan capacity: Know when to scale up resources
- Debug production problems: Trace issues to specific components
⚠️ Without Monitoring:
- You won't know when your service is down until users report it
- Debugging production issues is like flying blind
- Unexpected AWS bills because you didn't track usage
- No data to optimize performance or user experience
Part B: Explore AWS CloudWatch for Lambda
Learn how AWS automatically monitors your Lambda functions
CloudWatch is AWS's monitoring and logging service that automatically tracks all Lambda functions with zero configuration needed.
What CloudWatch Provides:
- Metrics: Invocations, duration, errors, throttles
- Logs: Everything your Lambda prints (console.log, print statements)
- Alarms: Notifications when metrics exceed thresholds
- Dashboards: Visual graphs of your metrics
Access CloudWatch Metrics:
- Sign in to AWS Console
- Go to Lambda → Your function (
mlops-service-lambda) - Click the "Monitor" tab
- You'll see automatic metrics graphs:
📈 Invocations
Number of times your Lambda was called (traffic signal)
⏱️ Duration
How long each request took (latency signal)
Note: First request may show 1-3 seconds (cold start), then ~100-200ms
❌ Errors
Failed invocations (error signal)
🚫 Throttles
Requests rejected due to concurrency limits (saturation signal)
Access CloudWatch Logs:
- In Lambda "Monitor" tab, click "View CloudWatch logs"
- You'll see log streams - one per Lambda execution environment
- Click on the latest log stream
- View your application logs:
Example Log Output:
START RequestId: abc123 Version: $LATEST
2024-01-15T10:30:00.000Z INFO Received request to /health
2024-01-15T10:30:00.050Z INFO Database connection healthy
2024-01-15T10:30:00.100Z INFO Prometheus metrics updated
END RequestId: abc123
REPORT Duration: 150.00 ms Memory: 128 MB Max Memory: 89 MB💡 Understanding Cold Starts:
When Lambda hasn't been used for ~10-15 minutes, AWS pauses it. The first request after idle takes 1-3 seconds to "warm up" (load code, start Python). This is normal and expected. Subsequent requests are fast (~100-200ms).
Generate some traffic to your Lambda:
# Replace with your API Gateway URL from Lab 8
API_URL="https://YOUR_API_GATEWAY_URL"
# Send 10 test requests
for i in {1..10}; do
curl -s $API_URL/health
sleep 1
doneThen refresh CloudWatch to see:
- Invocations count increased by 10
- Duration showing ~100-200ms per request
- First request may show higher duration (cold start)
- Logs showing your requests
Part C: Set Up CloudWatch Alarm for Errors
Get notified when your Lambda function error rate is too high
CloudWatch Alarms proactively notify you when metrics exceed thresholds. Instead of constantly checking dashboards, AWS sends you an email/SMS when something is wrong.
Example Scenarios:
- Error rate > 10%: Something is broken, investigate immediately
- Duration > 5 seconds: Function is slow, may timeout
- Throttles > 0: Too many concurrent requests, need more capacity
SNS (Simple Notification Service) sends alarm notifications via email/SMS.
Create SNS Topic:
- Go to AWS Console → Search "SNS" → Open SNS
- Click "Topics" in left sidebar → "Create topic"
- Configure:
- Type: Standard
- Name:
mlops-lambda-alerts
- Click "Create topic"
Create Email Subscription:
- In the topic page, click "Create subscription"
- Configure:
- Protocol: Email
- Endpoint: Your email address
- Click "Create subscription"
- Check your email for confirmation link and click it
Create CloudWatch Alarm:
- Go to CloudWatch → Alarms → "Create alarm"
- Click "Select metric"
- Navigate: Lambda → By Function Name
- Select your function (
mlops-service-lambda) - Check the Errors metric → "Select metric"
- Configure alarm conditions:
- Statistic: Sum
- Period: 5 minutes
- Threshold type: Static
- Whenever Errors is: Greater than
5
- Click "Next"
Configure notifications:
- Select "In alarm" state trigger
- Select your SNS topic:
mlops-lambda-alerts - Click "Next"
Name the alarm:
- Alarm name:
mlops-lambda-high-errors - Description: "Alert when Lambda errors exceed 5 in 5 minutes"
- Click "Next" → Review → "Create alarm"
✅ What You've Done:
Now if your Lambda function has more than 5 errors within a 5-minute window, you'll get an email notification automatically! This is proactive monitoring - you know about problems before users complain.
To test the alarm, you would need to intentionally cause errors. For this lab, it's not necessary - just knowing the alarm exists is enough.
💡 How to Test (Advanced):
- Temporarily break your Lambda (e.g., wrong DATABASE_URL)
- Send 6+ requests to trigger errors
- Wait 5 minutes for alarm to trigger
- Receive email notification
- Fix the issue and alarm returns to "OK" state
Part D: Enhance Prometheus Dashboard with Production Features
Add detailed health checks, error tracking, and uptime monitoring to your local service
Lab 2 set up basic Prometheus monitoring. Now we'll add production-grade features:
- Detailed health endpoint: Service status, uptime, system resources
- Request tracking: Total requests, failed requests, error rates
- Uptime monitoring: How long service has been running
- System resources: Memory and CPU usage
- Enhanced dashboard: 10 comprehensive metrics cards
Add psutil for system monitoring:
Update mlops-service/requirements.txt:
# Existing dependencies remain...
# Additional utilities
python-dotenv==1.0.0
psutil==5.9.8 # NEW: For system monitoringInstall the new dependency:
cd mlops-service
pip install psutil==5.9.8Your instructor has already enhanced mlops-service/app.py with production monitoring features:
✅ Global Request Tracking
Variables to track: TOTAL_REQUESTS, FAILED_REQUESTS, SERVICE_START_TIME
✅ Helper Functions
• get_uptime_seconds() - Calculate service uptime
• get_memory_usage_mb() - Get memory usage via psutil
• calculate_error_rate() - Calculate error percentage
• check_database_connection() - Verify database access
✅ New /health/detailed Endpoint
Returns comprehensive health info: status, uptime, system resources, request stats, error rates, health checks
✅ Enhanced /track Endpoint
Now increments TOTAL_REQUESTS and tracks FAILED_REQUESTS on errors
✅ Structured Logging
Formatted log output with timestamps, log levels, and structured messages
mlops-service/app.py to see the implementation.Start your local Flask service:
cd mlops-service
python app.pyTest the detailed health endpoint:
curl http://localhost:5001/health/detailed | jqExpected Output:
{
"status": "healthy",
"service": "mlops-service-prometheus",
"version": "1.0.0",
"timestamp": "2024-01-15T10:30:00.000000",
"environment": "development",
"uptime": {
"seconds": 3600.5,
"hours": 1.0,
"days": 0.04,
"started_at": "2024-01-15T09:30:00.000000"
},
"system": {
"memory_usage_mb": 125.4,
"cpu_percent": 2.5,
"process_id": 12345
},
"requests": {
"total": 150,
"failed": 3,
"error_rate_percent": 2.0,
"last_request": "2024-01-15T10:29:00.000000"
},
"health_checks": {
"database_connected": true,
"prometheus_enabled": true,
"metrics_tracking": "active"
}
}The dashboard has been enhanced to display 10 comprehensive metrics:
Open the dashboard:
open http://localhost:5001/10 Metrics Displayed:
💡 Dashboard Features:
- Auto-refreshes every 10 seconds
- Manual refresh button for immediate updates
- Color-coded error rate indicators (green/yellow/red)
- Links to all monitoring endpoints
Part E: Production Monitoring Best Practices
Learn industry best practices for production AI systems
✅ Good Alert Thresholds
- Error rate > 5-10%: Something is likely broken
- Response time > 3 seconds: User experience degrading
- Memory usage > 80%: Risk of crashes
❌ Bad Alert Thresholds
- Alert on every error: Too noisy, you'll ignore alerts
- Alert on 50% error rate: Too late, many users affected
- No alerts at all: You won't know when things break
💡 Rule of Thumb:
Set thresholds that indicate actionable problems - not so sensitive that you get false alarms, not so loose that real issues go unnoticed.
AI API calls can get expensive quickly. Always monitor:
- Total tokens used: Shows API usage volume
- Cost per request: Helps optimize prompts
- Daily/weekly trends: Spot unexpected spikes
⚠️ Example Cost Issue:
A student accidentally created an infinite loop of AI calls. Without monitoring, they didn't notice until their AWS bill was $500. Monitoring could have caught this in hours, not days.
DEBUG
Detailed info for debugging (only in development)
INFO
Normal operations (e.g., "Request received", "Metric updated")
WARNING
Something unexpected but not critical (e.g., "Retry attempt 2 of 3")
ERROR
Failures that need attention (e.g., "Database connection failed")
CloudWatch logs and metrics cost money to store. Set retention policies:
- Logs: Keep 7-30 days (good for debugging recent issues)
- Metrics: CloudWatch keeps high-resolution for 3 hours, then aggregates
- Alarms: Keep indefinitely (minimal cost)
💡 Cost Optimization:
For this course project, CloudWatch costs are negligible. But in production with high traffic, log retention settings can save hundreds of dollars per month.
Part F: CloudWatch vs Prometheus Comparison
Understand when to use each monitoring approach
AWS CloudWatch
✅ Pros:
- Automatic for AWS services
- No setup required for Lambda/EC2
- Built-in alarms and notifications
- Integrated with entire AWS ecosystem
❌ Cons:
- AWS-specific (vendor lock-in)
- Costs more for high-volume metrics
- Less flexible than Prometheus
Prometheus
✅ Pros:
- Open-source and free
- Works anywhere (AWS, GCP, on-premise)
- Powerful query language (PromQL)
- Custom metrics easy to add
❌ Cons:
- Requires manual setup
- Need to manage Prometheus server
- No built-in alerting (need Alertmanager)
Use CloudWatch when:
- Using AWS services (Lambda, EC2, RDS)
- Want zero-setup monitoring
- Need AWS-integrated alarms
- Small to medium scale
Use Prometheus when:
- Multi-cloud or on-premise deployment
- Need custom application metrics
- Want full control over monitoring
- Large scale with complex queries
Use Both (Our Approach):
- CloudWatch: Monitor deployed Lambda/EC2
- Prometheus: Track application-specific metrics (AI costs, appointments, handoffs)
- Best of both worlds: AWS infrastructure + custom business metrics
Can't see Lambda metrics in CloudWatch:
- Make sure you've invoked the Lambda at least once
- Check you're in the correct AWS region
- Wait a few minutes for metrics to appear
SNS subscription not confirmed:
- Check spam folder for confirmation email
- Try creating subscription again with different email
CloudWatch alarm not triggering:
- Verify alarm is in "OK" state (not "Insufficient data")
- Check threshold is actually exceeded
- Wait full evaluation period (5 minutes)
/health/detailed endpoint returns error:
- Ensure psutil is installed:
pip install psutil==5.9.8 - Restart Flask service:
python app.py - Check application logs for specific errors
Dashboard shows 0 for all metrics:
- Service may have just started (metrics will populate with usage)
- Try sending test requests:
curl http://localhost:5001/health - Check browser console for fetch errors
Error rate showing incorrectly:
- Error rate is calculated as (failed / total) * 100
- If total requests is 0, error rate shows 0%
- Send some requests to populate statistics
Congratulations! You've learned production monitoring concepts and enhanced your MLOps service with production-grade observability. Here's what you accomplished:
✅ Monitoring Concepts Learned
- 4 Golden Signals: Latency, Traffic, Errors, Saturation
- Monitoring Types: Metrics, Logs, and Traces
- CloudWatch: AWS's automatic monitoring for Lambda
- Alarms: Proactive notifications when thresholds are exceeded
- Production Best Practices: Alert thresholds, log levels, retention
🚀 What You Built
- CloudWatch Exploration: Viewed Lambda metrics and logs
- CloudWatch Alarm: Email notifications for high error rates
- Enhanced Health Endpoint: Detailed service status with uptime and resources
- Request Tracking: Total requests, failed requests, error rates
- Improved Dashboard: 10 comprehensive metrics with color-coded indicators
🔑 Key Takeaways
- Monitor everything: You can't fix what you don't measure
- CloudWatch is automatic: AWS Lambda monitoring requires zero setup
- Set smart thresholds: Not too noisy, not too loose
- Track AI costs: Prevent surprise bills from runaway API usage
- Cold starts are normal: First Lambda request after idle takes 1-3 seconds
- Use both CloudWatch and Prometheus: AWS infrastructure + custom metrics
📝 Test Your Knowledge
Complete the Lab 9 quiz to test your understanding of monitoring and logging for production AI systems.
📸 Quiz Submission Checklist:
- Complete all 5 multiple-choice questions
- Take a screenshot of your results page showing:
- Your name
- Your score (aim for 4/5 or 5/5)
- Session ID
- Timestamp
- Submit screenshot as proof of completion