Monitoring, Alerting and Incident Response

📋 Table of Contents
  1. Monitoring and Alerting
  2. Common Mistakes

Monitoring transforms production from a black box into an observable system. When a seller reports “my listing isn’t showing up,” Application Insights’ distributed tracing lets you find every request that touched that listing — the create command, the publish event, the search queries — and identify exactly where the failure occurred, with timestamps, error messages, and SQL queries. Custom metrics (listing_published, contact_request_sent) let you track business health in addition to technical health.

Monitoring and Alerting

// ── Application Insights custom metrics ───────────────────────────────────
// Install: dotnet add package Microsoft.ApplicationInsights.AspNetCore
//          dotnet add package Microsoft.ApplicationInsights.WorkerService

// In Program.cs:
// builder.Services.AddApplicationInsightsTelemetry();

// In domain event handlers — track business events:
public class TrackListingPublishedHandler
    : INotificationHandler<ListingPublishedEvent>
{
    private readonly TelemetryClient _telemetry;

    public Task Handle(ListingPublishedEvent evt, CancellationToken ct)
    {
        _telemetry.TrackEvent("ListingPublished", new Dictionary<string, string>
        {
            ["listingId"] = evt.ListingId.ToString(),
            ["category"]  = evt.Category.ToString(),
            ["city"]      = evt.City,
            ["price"]     = evt.Price.Amount.ToString("F2"),
        });

        _telemetry.GetMetric("listings.published.count").TrackValue(1);

        return Task.CompletedTask;
    }
}

// In the search query handler — track search patterns:
public class TrackSearchQueryHandler
    : INotificationHandler<SearchExecutedEvent>
{
    private readonly TelemetryClient _telemetry;

    public Task Handle(SearchExecutedEvent evt, CancellationToken ct)
    {
        _telemetry.TrackEvent("ListingSearched", new Dictionary<string, string>
        {
            ["keyword"]      = evt.Keyword ?? "(none)",
            ["category"]     = evt.Category?.ToString() ?? "(all)",
            ["city"]         = evt.City ?? "(all)",
            ["resultCount"]  = evt.ResultCount.ToString(),
            ["hasResults"]   = (evt.ResultCount > 0).ToString(),
        });

        return Task.CompletedTask;
    }
}

// ── Azure Monitor alert rules (via Azure Portal or Bicep) ─────────────────
// Alert 1: High API error rate
// Condition: requests/failed > 5% of total requests for 5 minutes
// Action: Send email + PagerDuty webhook
// Severity: 2 (Warning)

// Alert 2: Database DTU saturation
// Condition: dtu_consumption_percent > 80% for 10 minutes
// Action: Trigger scale-up to GP_Gen5_4 (via Azure Automation runbook)
// Severity: 3 (Informational)

// Alert 3: SignalR disconnection spike
// Condition: signalr/connection_count drops > 20% in 2 minutes
// Action: Investigation email
// Severity: 3

// ── Structured log queries in Azure Monitor Logs (KQL) ────────────────────
// Find all operations for a specific listing:
// traces
// | where customDimensions.listingId == "abc-123-guid"
// | order by timestamp desc
// | project timestamp, message, severityLevel, customDimensions

// Find all failed contact requests in the last hour:
// customEvents
// | where name == "ContactRequestSent"
// | where timestamp > ago(1h)
// | where customDimensions.failed == "true"
// | summarize count() by bin(timestamp, 5m)

// Find slow search queries (> 500ms):
// dependencies
// | where type == "SQL"
// | where duration > 500
// | where data contains "Listings"
// | order by duration desc
// | project timestamp, duration, data
Note: Custom events (TrackEvent("ListingPublished")) give you business-level monitoring alongside technical monitoring. The technical dashboard shows requests/second, error rates, response times. The business dashboard shows listings published per day, contact requests per hour, premium upgrades per week. When these business metrics drop unexpectedly (listings published drops 50%), it may indicate a bug that the technical metrics don’t capture (the API returns 200 but the domain event isn’t fired). Business metrics are the ground truth.
Tip: Set up Azure Monitor availability tests (synthetic monitoring) to ping /api/health from multiple global regions every 5 minutes. If the health check fails from any region, an alert fires before any user reports an issue. Configure the test to fire an alert if the endpoint takes more than 3 seconds to respond — latency spikes often precede full outages. The global multi-region testing catches regional Azure issues that only affect some users.
Warning: Logging too much data in Application Insights can be expensive — Application Insights charges per GB ingested. Set up sampling (typically 10-25% of requests in production) to reduce costs while maintaining statistical validity. Critical events (errors, domain events) should always be tracked (use TrackException and TrackEvent which bypass sampling). Regular request telemetry can be sampled. Configure adaptive sampling in ApplicationInsightsServiceOptions: EnableAdaptiveSampling = true.

Common Mistakes

Mistake 1 — Only technical monitoring, no business metrics (invisible business failures)

❌ Wrong — monitoring only HTTP error rates; publishing a listing returns 200 but the event isn’t dispatched; sellers never notified; no alert.

✅ Correct — TrackEvent for all business operations; alert when listing_published drops significantly from baseline.

Mistake 2 — Alerting on every error (alert fatigue)

❌ Wrong — alert on any single 500 error; 3 alerts per hour for transient timeouts; team ignores alerts after first week.

✅ Correct — alert on error rate threshold (>5% for 5+ minutes); transient errors don’t page; sustained issues do.

🧠 Test Yourself

Application Insights shows 5xx errors spiking at 14:32. The KQL query finds all failed requests trace a SqlException: Timeout expired from a search query. The database DTU alert didn’t fire. What is the most likely cause?