logging-monitoring
Implement observability patterns including structured logging, log levels, correlation IDs, metrics, and distributed tracing. Use when adding structured logging, implementing correlation IDs for request tracing, configuring metrics collection, setting up distributed tracing, or designing alerting ru
Install
mkdir -p .claude/skills/logging-monitoring && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16629" && unzip -o skill.zip -d .claude/skills/logging-monitoring && rm skill.zipInstalls to .claude/skills/logging-monitoring
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
Implement observability patterns including structured logging, log levels, correlation IDs, metrics, and distributed tracing. Use when adding structured logging, implementing correlation IDs for request tracing, configuring metrics collection, setting up distributed tracing, or designing alerting rules.About this skill
Logging & Monitoring
Purpose: Implement observability for production systems. Goal: Structured logs, correlation across requests, actionable metrics. Note: For implementation, see C# Development or Python Development.
When to Use This Skill
- Adding structured logging to applications
- Implementing request correlation IDs
- Configuring metrics collection
- Setting up distributed tracing (OpenTelemetry)
- Designing alerting rules and health checks
Prerequisites
- Logging framework installed
- Monitoring platform access
Decision Tree
Observability concern?
+- What to log?
| +- Request start/end -> INFO with correlation ID
| +- Expected errors -> WARN (validation, not-found)
| +- Unexpected errors -> ERROR with stack trace
| - Debug details -> DEBUG (disabled in production)
+- What NOT to log?
| - PII, passwords, tokens, credit cards -> NEVER
+- Metrics needed?
| +- RED metrics: Rate, Errors, Duration (for services)
| - USE metrics: Utilization, Saturation, Errors (for resources)
+- Distributed tracing?
| - OpenTelemetry for cross-service correlation
- Alerting?
+- SLO-based: alert on error budget burn rate
- Avoid alert fatigue: page only for actionable issues
Structured Logging
Concept
Log structured data (key-value pairs) instead of plain text for better searchability and analysis.
[FAIL] Unstructured (hard to parse):
"User [email protected] logged in from 192.168.1.1 at 2024-01-15 10:30:00"
[PASS] Structured (machine-readable):
{
"event": "user_login",
"user_email": "[email protected]",
"ip_address": "192.168.1.1",
"timestamp": "2024-01-15T10:30:00Z",
"level": "INFO"
}
Benefits
- Searchable: Query by any field
- Filterable: Show only errors, specific users, etc.
- Aggregatable: Count events, calculate averages
- Parseable: Tools can process automatically
Log Levels
Standard Levels
| Level | When to Use | Example |
|---|---|---|
| TRACE | Very detailed debugging | "Entering function with params: {x: 1, y: 2}" |
| DEBUG | Debugging information | "Cache hit for key: user_123" |
| INFO | Normal operations | "User logged in", "Order created" |
| WARN | Unexpected but recoverable | "Retry attempt 2 of 3", "Rate limit approaching" |
| ERROR | Failures requiring attention | "Payment failed", "Database connection lost" |
| FATAL | Application cannot continue | "Out of memory", "Configuration invalid" |
Level Configuration by Environment
Development: DEBUG or TRACE
- See detailed information for debugging
Staging: INFO
- Normal operations plus warnings/errors
Production: INFO (or WARN)
- Reduce noise, focus on significant events
- Keep ERROR/FATAL always enabled
Core Rules
| Practice | Description |
|---|---|
| Structured logging | JSON format with key-value pairs |
| Correlation IDs | Trace requests across services |
| Appropriate levels | DEBUG in dev, INFO+ in prod |
| No sensitive data | Never log passwords, tokens, PII |
| Context in errors | Include what, why, and how to fix |
| Meaningful metrics | Track rate, errors, duration |
| Health checks | Liveness + readiness endpoints |
| Actionable alerts | Include runbooks, reduce noise |
Anti-Patterns
- Log and Forget: Writing logs but never querying or reviewing them -> Set up dashboards and alerts on ERROR/FATAL; review logs in incident postmortems
- PII in Logs: Logging email addresses, passwords, tokens, or credit card numbers -> Scrub sensitive fields before logging; use allowlists for loggable fields
- Unstructured Strings: Logging plain text messages that are hard to parse or search -> Use structured logging (JSON key-value pairs) for all log entries
- Missing Correlation: Logs from different services with no shared request ID -> Propagate W3C trace context or a correlation ID header across all service calls
- Alert Fatigue: Alerting on every warning or non-actionable metric -> Page only on SLO budget burn rate; group related alerts; include runbook links
- Debug in Production: Running production with DEBUG or TRACE level enabled -> Use INFO or WARN in production; enable DEBUG temporarily and only on specific components
- Metric Overload: Tracking hundreds of custom metrics with no clear purpose -> Focus on RED (Rate, Errors, Duration) for services and USE (Utilization, Saturation, Errors) for resources
Observability Tools
| Category | Tools |
|---|---|
| Logging | ELK Stack, Splunk, Datadog Logs, CloudWatch Logs |
| Metrics | Prometheus + Grafana, Datadog, New Relic, CloudWatch |
| Tracing | Jaeger, Zipkin, Datadog APM, Application Insights |
| All-in-One | Datadog, New Relic, Dynatrace, Elastic Observability |
See Also: Error Handling - C# Development - Python Development
Troubleshooting
| Issue | Solution |
|---|---|
| Logs not appearing in monitoring platform | Check log level configuration, verify sink/exporter endpoint |
| Correlation IDs missing across services | Propagate W3C trace context headers in all HTTP calls |
| Alert fatigue from too many notifications | Set meaningful thresholds, group related alerts, add alert suppression windows |