Journal

Writing down the things I learned. To share them with others and my future self.

05 Feb 2023

Monitoring nginx with mtail

I use prometheus to monitor my infrastructure and until now, the nginx serving this blog wasn’t monitored properly. Prometheus is a pull based monitoring system, scraping the monitored components via HTTP requests. Sadly, nginx doesn’t provide a native /metrics endpoint, serving metrics in the OpenMetrics format. Hence, we need a prometheus exporter. There is a nginx exporter provided by nginx INC nginx-prometheus-exporter. This exporter only provides basic information like the active connections, and total served connections.

To get more insights, you need to run nginx plus. Another option is to analyze the logs of nginx and generate Prometheus metrics from this. I can recommend mtail for this task. It’s reliable and performant.
Matt Probert has written a good introduction as GitHub Gist. Based on this example, I build my mtail configuration. The main difference is, that I use prometheus histograms to analyze the distribution of the request size and timing. Moreover, I use fewer labels to reduce the size of the prometheus datastore. Prometheus generates for every combination of labels a new time series in the database. Especially, the nginx_uri used in the GitHub Gist will explode the number of timeseries, if the URI contains an object ID in e.g. a REST API. The prometheus documentation advices to not use user provided IDs in prometheus labels. Also, the nginx_port and nginx_content_typeis irrelevant for my monitoring.

This is my current nginx.mtail program based in the example provided by Matt Probert:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# we stick with nginx var names here and prefix the exported metric with nginx_ by using "as"
# "by" allows us to specify our labelling/grouping
counter request_count             by nginx_host, nginx_method, nginx_status as "nginx_request_count"
histogram request_length_bytes    by nginx_host, nginx_method, nginx_status buckets 1024, 2048, 4096, 10000, 30000, 60000, 120000, 240000 as "nginx_request_length_bytes"
histogram bytes_sent              by nginx_host, nginx_method, nginx_status buckets 1024, 2048, 4096, 10000, 30000, 60000, 120000, 240000 as "nginx_bytes_sent"
counter body_bytes_sent           by nginx_host, nginx_method, nginx_status as "nginx_body_bytes_sent"
histogram request_time_hist       by nginx_host, nginx_method, nginx_status buckets 1, 10, 50, 100, 200, 400, 800, 1000, 2000 as "nginx_request_time_milliseconds"
histogram upstream_connect_time   by nginx_host, nginx_method, nginx_status buckets 1, 10, 50, 100, 200, 400, 800, 1000, 2000 as "nginx_upstream_connect_time_milliseconds"
histogram upstream_header_time    by nginx_host, nginx_method, nginx_status buckets 1, 10, 50, 100, 200, 400, 800, 1000, 2000 as "nginx_upstream_header_time_milliseconds"
histogram upstream_response_time  by nginx_host, nginx_method, nginx_status buckets 1, 10, 50, 100, 200, 400, 800, 1000, 2000 as "nginx_upstream_response_time_milliseconds"
counter nginx_log_nomatch_count

# the following pattern matches exactly the tab-separated fields in our custom access_log format
# if you want to customize...this pattern should be updated to match changes you make in the nginx log format.
/^/ +
/(?P<msec>\d+)\.\d+\t/ +       # settime() needs just the seconds so we exclude the .milliseconds part
/(?P<host>\S+)\t/ +
/(?P<server_port>\S+)\t/ +
/(?P<request_method>\S+)\t/ +
/(?P<uri>\S+)\t/ +
/(?P<content_type>\S+)\t/ +
/(?P<status>\S+)\t/ +
/(?P<request_length>\d+)\t/ +
/(?P<bytes_sent>\d+)\t/ +
/(?P<body_bytes_sent>\d+)\t/ +
/(?P<request_time>\d+\.\d+)\t/ +
/(?P<upstream_connect_time>\S+)\t/ +
/(?P<upstream_header_time>\S+)\t/ +
/(?P<upstream_response_time>\S+)/ +
/$/ {
    settime($msec)
    request_count[$host][$request_method][$status]++
    request_length_bytes[$host][$request_method][$status] = $request_length
    bytes_sent[$host][$request_method][$status] = $bytes_sent
    body_bytes_sent[$host][$request_method][$status] += $body_bytes_sent

    # Especially, for static files, the request time is often below a millisecond, hence 0. We set them to 1, to
    # count them in the histogram
    $request_time*1000.0 < 1 {
      request_time_hist[$host][$request_method][$status] = 1.0
    } else {
      request_time_hist[$host][$request_method][$status] = $request_time*1000.0
    }
    # mtail is happier doing counters with floats/ints.
    # nginx logs '-' when there isn't a value which
    # we check for - and skip updating these counters if found.
    # otherwise we cast the string to a float and increment the counter.
    $upstream_connect_time != "-" {
        upstream_connect_time[$host][$request_method][$status] = float($upstream_connect_time)*1000
    }
    $upstream_header_time != "-" {
        upstream_header_time[$host][$request_method][$status] = float($upstream_header_time)*1000
    }
    $upstream_response_time != "-" {
        upstream_response_time[$host][$request_method][$status] = float($upstream_response_time)*1000
    }
} else {
    # our pattern doesn't match.
    # in this example since we have a single program with a single log file...it should always match
    # or we have a bug.
    # can use this metric to detect when our parser is failing.
    nginx_log_nomatch_count++
}

I use the same nginx access log format as Matt Probert have shown in his gist:

1
log_format prometheus_log '$msec\t$host\t$server_port\t$request_method\t$uri\t$content_type\t$status\t$request_length\t$bytes_sent\t$body_bytes_sent\t$request_time\t$upstream_connect_time\t$upstream_header_time\t$upstream_response_time';

The dashboard I currently use looks like this:

Screenshot of Grafana dashboard for nginx

The Grafana export can be found here