Last updated on
Version: 1.0.0
Monitoring & Logging
Guidance for Node.js services on Azure Container Apps (ACA) using Azure Monitor and workspace-based Application Insights.
Goals
- End-to-end observability (metrics, logs, traces) for every service.
- Fast detection: alert on errors, latency, resource pressure, and restarts.
- Actionability: standard queries, dashboards, and runbooks.
Stack
- Compute: Azure Container Apps.
- Observability: Azure Monitor + Log Analytics workspace + Application Insights (workspace-based).
- Visualization: Azure Monitor Workbooks; optional Azure Managed Grafana.
- Tracing: OpenTelemetry with Azure Monitor exporter.
What to monitor
- Availability: request rate, 4xx/5xx ratio, p95/p99 latency.
- Performance: CPU %, memory working set, container restarts, scale events.
- Reliability: dependency failures (DB, Service Bus, HTTP downstream), retry counts, queue backlog.
- Security: auth failures, permission denials, unexpected public endpoints.
- Platform: ACA revision health, ingress errors.
Setup (platform)
- Create a Log Analytics workspace (same region as ACA).
- Create an Application Insights resource linked to that workspace.
- In ACA Environment > Diagnostic settings, send ContainerAppConsoleLogs and ContainerAppSystemLogs to the workspace.
- For each Container App:
- Set env vars:
APPLICATIONINSIGHTS_CONNECTION_STRING,OTEL_SERVICE_NAME,LOG_LEVEL(info/warn/error),NODE_ENV. - Health probes:
/healthz(liveness),/readyz(readiness) with fast responses. - Scale rules: CPU, RPS, or queue depth (Service Bus/Event Hub) as appropriate.
- Set env vars:
Setup (Node.js app)
- Dependencies:
applicationinsights,pino, optional@opentelemetry/sdk-node,@azure/monitor-opentelemetry-exporter. - Structure logs as single-line JSON; include service name and correlation IDs.
Minimal logging + metrics
import appInsights from 'applicationinsights';
import pino from 'pino';
import express from 'express';
appInsights
.setup(process.env.APPLICATIONINSIGHTS_CONNECTION_STRING)
.setAutoCollectRequests(true)
.setAutoCollectPerformance(true)
.setAutoCollectExceptions(true)
.setSendLiveMetrics(true)
.setDistributedTracingMode(appInsights.DistributedTracingModes.AI_AND_W3C)
.start();
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
base: { service: 'appvity-api' },
messageKey: 'msg',
});
const app = express();
app.get('/healthz', (_req, res) => res.status(200).send('ok'));
app.listen(process.env.PORT || 3000, () => {
logger.info({ msg: 'service-started', port: process.env.PORT || 3000 });
});
Distributed tracing (OpenTelemetry)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { AzureMonitorTraceExporter } from '@azure/monitor-opentelemetry-exporter';
const sdk = new NodeSDK({
serviceName: process.env.OTEL_SERVICE_NAME || 'appvity-api',
traceExporter: new AzureMonitorTraceExporter({
connectionString: process.env.APPLICATIONINSIGHTS_CONNECTION_STRING,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Standard alerts (examples)
- 5xx rate > 2% for 5m.
- p95 latency > 800 ms for 5m.
- CPU > 80% for 5m or memory > 80% for 5m.
- Container restarts > 3 in 15m.
- Queue backlog (Service Bus) > threshold tied to SLA.
- No logs ingested in 10m (heartbeat).
Useful KQL queries
Requests (error rate):
requests
| where timestamp > ago(1h)
| summarize total = count(), errors = countif(success == false) by bin(timestamp, 5m)
| extend error_rate = 100.0 * errors / total
Latency (p95):
requests
| where timestamp > ago(1h)
| summarize p95 = percentile(duration, 95) by bin(timestamp, 5m)
Container logs (by app, level):
ContainerAppConsoleLogs
| where TimeGenerated > ago(1h)
| where ContainerAppName == "appvity-api"
| extend level = tostring(parse_json(Log_s).level)
| summarize count() by level
Exceptions:
exceptions
| where timestamp > ago(1h)
| summarize count() by type, outerMessage, bin(timestamp, 10m)
Dashboards
- Azure Monitor Workbook: latency, error rate, restarts, scale events, dependency failures.
- Optional Grafana (managed): import Azure Monitor and Log Analytics data sources for shared views.
Retention and cost
- Metrics: keep at least 30 days; logs 30–90 days depending on compliance.
- Use sampling in Application Insights when traffic is high (e.g., 20–50%).
- Prefer structured logs; avoid large payloads and secrets.
Runbooks (quick actions)
- High 5xx/latency: check latest revision, dependency health, recent deploys; roll back if needed.
- CPU/memory pressure: inspect scale rules; increase min replicas or tune concurrency.
- No logs ingested: verify diagnostic settings to workspace and ACA permissions.
- Repeated restarts: check probes, startup latency, and configuration/secret changes.
Local development
- Set
APPLICATIONINSIGHTS_CONNECTION_STRINGto a non-prod resource. - Keep
LOG_LEVEL=debuglocally; useinfoor higher in production. - Exercise
/healthzand a sample request flow to ensure traces and logs appear in the workspace.