Monitoring & Operations
Monitoring Infrastructure
GCP Cloud Monitoring
Dashboard Metrics:
- API response times
- Error rates
- Database query performance
- Server CPU/Memory usage
- Request throughput
Logging
GCP Cloud Logging:
- Application logs (all API calls)
- Error traces and stack traces
- Audit logs (authentication, authorization)
- System logs
Log Levels:
DEBUG: Detailed diagnostic informationINFO: General operational eventsWARN: Warning conditionsERROR: Error conditionsFATAL: Critical failures
Access Logs:
# View recent error logs
gcloud logging read "severity=ERROR" --limit 50
# Search for specific request
gcloud logging read "textPayload~'request-id-123'" --limit 10
Health Checks
Backend Health Endpoint
GET /health
Response:
{
"status": "healthy",
"timestamp": "2024-01-15T10:00:00Z",
"checks": {
"database": "connected",
"firebase": "healthy",
"cache": "operational"
}
}
Uptime Monitoring
- Uptime Robot for external monitoring
- Alert on >5 consecutive failures
- SMS/email notifications
Alerts & Notifications
Critical Alerts
- Error Rate >1%: Immediate page-on-call
- API Response >5s: Page on-call within 5 min
- Database Down: Immediate page on-call
- Storage Full: Email alert
Warning Alerts
- Error Rate >0.5%: Slack notification
- API Response >2s: Slack notification
- High CPU Usage >80%: Slack notification
Alert Configuration
Managed in GCP Cloud Monitoring console with escalation policies.
Incident Response
Incident Severity Levels
| Level | Impact | Response Time | Resolution Target |
|---|---|---|---|
| Critical | All users affected | 5 minutes | 1 hour |
| High | Some users affected | 15 minutes | 4 hours |
| Medium | Feature degraded | 1 hour | 8 hours |
| Low | Minor issue | 24 hours | Next business day |
Incident Response Procedure
- Detect: Automated alerts trigger
- Acknowledge: On-call engineer confirms
- Assess: Determine impact and severity
- Mitigate: Implement temporary fix
- Resolve: Apply permanent solution
- Document: Post-incident review
Runbooks
Database Connection Failure:
- Check database status in Turso console
- Verify firewall rules
- Check auth token validity
- Restart application if needed
High CPU Usage:
- Check active queries in Cloud Profiler
- Identify heavy operations
- Increase CPU allocation if needed
- Optimize slow queries
Storage Issues:
- Check Firebase Storage quota
- Clean up old proctoring images
- Implement retention policy
- Archive old data
Backup & Recovery
Backup Strategy
Database Backups:
- Automatic daily backups via Turso
- 30-day retention policy
- Point-in-time recovery available
- Geographic redundancy
Application Backups:
- Git repository as source of truth
- Deployment artifacts in GCP Artifact Registry
- Docker images with version tags
Restore Procedures
Database Restore:
# List available backups
turso db list-backups my-db
# Restore from specific backup
turso db restore-backup my-db backup-id
Application Rollback:
# View deployment history
gcloud run revisions list --service=exam-portal-backend
# Rollback to previous version
gcloud run services update-traffic exam-portal-backend --to-revisions=PREVIOUS=100
Maintenance Windows
Scheduled Maintenance
- Weekly: Database optimization (Sunday 2 AM UTC)
- Monthly: Security patches (First Saturday of month)
- Quarterly: Major system updates
Notifications
- Email alerts 48 hours before
- In-app notifications 24 hours before
- Status page updates
Performance Tuning
Database Optimization
-- Analyze query performance
EXPLAIN QUERY PLAN
SELECT * FROM attempts
WHERE student_id = ? AND created_at > ?;
-- Rebuild indexes if fragmented
REINDEX idx_attempts_student_id;
Server Optimization
- Monitor and adjust Cloud Run memory allocation
- Review slow query logs
- Implement caching where appropriate
- Load test before major changes
Compliance & Audit
Audit Trail
- All admin actions logged
- Authentication events tracked
- Data access monitoring
- Change logs maintained
Compliance Reports
- Monthly: Access logs audit
- Quarterly: Security assessment
- Annually: SOC 2 audit preparation
Disaster Recovery Plan
Recovery Time Objective (RTO)
- Database: < 1 hour
- Application: < 30 minutes
- Full Service: < 2 hours
Recovery Point Objective (RPO)
- Database: < 1 hour (last backup)
- Application Code: Real-time (Git)
DR Testing
- Quarterly: Database restore drill
- Quarterly: Application failover test
- Documentation updated after each drill