Visualizing Nginx Access Logs using Timeplus and Grafana

Saïd Abiola

Jul 24, 202410 min read

The following blog was written by our Timeplus Community member, Saïd Abiola, who originally published it to the Timeplus Proton /examples folder in GitHub.

 

In my last blog post, I showed how Timeplus Proton can be used to analyze web traffic in production (taking my personal blog as an example), using just the log files generated by an Nginx web server. 

With SQL-based Observability gaining popularity as an alternative to the ELK stack, a major tool in this trend is ClickHouse, known for its fast ingestion and query speeds for log-handling. One example is Uber's Log Analytics platform: when based on ELK, the ingestion speed was ~25.5k docs per second; now, on a single ClickHouse node, it can handle 300K logs per second. Timeplus Proton extends ClickHouse's excellent log-handling features and historical data processing capabilities with streaming data.

Although I explored both real-time analysis and historical analysis, the analysis wasn't visualized. This article remedies the lack of rich visualizations by showing how to use open-source tools like Grafana.


 

Prerequisites

In addition to Timeplus Proton, the previous article used the following technologies:
Amazon Virtual Private Cloud (VPC);
Amazon Elastic Compute Cloud (EC2);
NFS;
IPinfo;
Ghost;
Nginx.

The first two products require access to an AWS account, and the last two need to have been running in production, in order to have meaningful access log data to experiment with.

This article builds on the same analysis of the previous article but is quicker to get started as you only need:
Docker installed;
an IPinfo account.

Timeplus Proton has a handy feature RANDOM STREAM which we will use to generate all the access log data needed for experimentation in this article.

Before I dive into the article proper, I will first revisit the format used by Nginx to record access log data so that the data we'll generate will mimic its shape and properties.

 

The Shape of Nginx's Access Log Data

In the previous article, we briefly touched on the Nginx access logs but did not dwell on where it is stored on disk or how the data it contains should be parsed.


Nginx Access Log Data

The access logs for Nginx are typically written to the file /var/log/nginx/access.log. 
Here's an excerpt from my blog's access logs:
161.35.230.x - - [26/Jun/2023:06:33:53 +0000] "\x00\x0E8uON\x85J\xCF\xC5\x93\x00\x00\x00\x00\x00" 400 182 "-" "-"
51.79.29.xx - - [26/Jun/2023:06:37:04 +0000] "POST / HTTP/1.1" 301 57 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
179.43.177.xxx - - [26/Jun/2023:06:45:28 +0000] "GET / HTTP/1.1" 301 60 "-" "Hello World"
87.236.176.xxx - - [26/Jun/2023:07:58:36 +0000] "GET / HTTP/1.1" 200 6364 "https://www.ayewo.com/" "Mozilla/5.0 (compatible; InternetMeasurement/1.0; +https://internet-measurement.com/)"
87.236.176.xxx - - [26/Jun/2023:07:58:38 +0000] "GET /favicon.png HTTP/2.0" 200 7636 "-" "Mozilla/5.0 (compatible; InternetMeasurement/1.0; +https://internet-measurement.com/)"
93.126.72.xxx - - [26/Jun/2023:08:29:38 +0000] "GET /how-to-install-rsync-on-windows/ HTTP/2.0" 200 6394 "https://www.google.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"

The first line in the access log is from the IP address 161.35.230.x.
We'll look at the Nginx access log format to understand what each field on that line means.


Nginx Access Log Format

Nginx's access log format is specified in /etc/nginx/nginx.conf and it broadly looks like this:
http { 
	log_format compression '$remote_addr - $remote_user 	
		[$time_local] ' 
		'"$request" $status $body_bytes_sent ' 
		'"$http_referer" "$http_user_agent"'; 

	server { 
		access_log /var/log/nginx/access.log; 
		... 
	} 
}

Here's the first line we saw earlier, but this time annotated to show individual fields from the log format.

We can see that the log format uses a single space as separator between fields.

To parse that line reliably, we can use this Python regex:
import re

line = '161.35.230.x - - [26/Jun/2023:06:33:53 +0000] "\x00\x0E8uON\x85J\xCF\xC5\x93\x00\x00\x00\x00\x00" 400 182 "-" "-"'

malicious_pattern = re.compile(r'(?P<remote_addr>\S+).(?P<rfc1413_ident>\S+).(?P<remote_user>\S+).\[(?P<time_local>[^\]]+)\] "(?P<request>[^"]*)" (?P<status>\d+) (?P<body_bytes_sent>\d+) "(?P<http_referer>[^"]*)" "(?P<http_user_agent>[^"]*)"')
match = malicious_pattern.match(line)

The results of parsing the first line are shown in the table below:
#
Field
Value
1
$remote_addr
161.35.230.x
2
$rfc1413_ident
-
3
$remote_user
-
4
$time_local
26/Jun/2023:06:33:53 +0000
5
$request
\x00\x0E8uON\x85J\xCF\xC5\x93\x00\x00\x00\x00\x00
6
$status
400
7
$body_bytes_sent
182
8
$http_referer
-
9
$http_user_agent
-

It has total of 9 fields which will be represented as Timeplus columns later.

It is clear from looking at the contents of the 5th field i.e. $request that this was a maliciously crafted request. The $request didn't specify a valid HTTP protocol method (e.g. GET, POST or OPTIONS) to the server which is why the server responded with a HTTP $status code of 400 (Bad Request).

Let's use the Python regex to parse the second line:
#
Field
Value
1
$remote_addr
51.79.29.xx
2
$rfc1413_ident
-
3
$remote_user
-
4
$time_local
26/Jun/2023:06:37:04 +0000
5
$request
POST / HTTP/1.1
6
$status
301
7
$body_bytes_sent
57
8
$http_referer
-
9
$http_user_agent
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36

Unlike the first line, the $request field parsed from the second line contains a request string that is valid under the HTTP protocol.

Here's the second line again annotated to show the $request field and its 3 parts (in orange):

The $request field also uses a single space as delimiter and can be split into three parts:
POST: indicates the HTTP verb or method
/: indicates the request URI (Uniform Resource Identifier) or path
HTTP/1.1: indicates the HTTP protocol version in use for the request

One of the analysis which we will run later on Timeplus Proton is a fast aggregation query that will show the top requested pages indicated by the $path field in the access logs. We could store each $request field in a single Timeplus column named request, but it would be super convenient if we split the $request field into 3 separate columns named: http_method, path and http_version.

If we update our Python regex as follows:
import re

line = '51.79.29.xx - - [26/Jun/2023:06:37:04 +0000] "POST / HTTP/1.1" 301 57 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"'

pattern = re.compile(r'(?P<remote_addr>\S+).(?P<rfc1413_ident>\S+).(?P<remote_user>\S+).\[(?P<time_local>\S+ \+[0-9]{4})]."(?P<http_method>\S+) (?P<path>\S+) (?P<http_version>\S+)" (?P<status>[0-9]+) (?P<body_bytes_sent>\S+) "(?P<http_referer>.*)" "(?P<http_user_agent>.*)"\s*\Z')
match = pattern.match(line)

We can use the updated Python regex to parse the 2nd line into 11 fields (the 1st line was parsed into 9 fields):
#
Field
Value
1
$remote_addr
51.79.29.xx
2
$rfc1413_ident
-
3
$remote_user
-
4
$time_local
26/Jun/2023:06:37:04 +0000
5
$http_method
POST
6
$path
/
7
$http_version
HTTP/1.1
8
$status
301
9
$body_bytes_sent
57
10
$http_referer
-
11
$http_user_agent
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36

There are several malformed requests in the access logs similar to the one we saw on the 1st line. Many of them will not be parsed reliably using our updated Python regex, so we will add an additional database column called malicious_request to store such malformed $requests in full rather than attempt to split them.

In fact, this was why we had a total of 12 Timeplus columns in the nginx_historical_access_log stream created in the previous article.

Below is a summary of how individual fields from an Nginx access log will map to Timeplus columns:
#
Field
Column
Data Type
Remarks
1
$remote_addr
remote_ip
ipv4
-
2
-
rfc1413_ident
string
-
3
$remote_user
remote_user
string
-
4
$time_local
date_time
datetime64
-
5
$request
http_method
string
1st part of splitting a $request
6
$request
path
string
2nd part of splitting a $request
7
$request
http_version
string
3rd part of splitting a $request
8
$request
malicious_request
string
Stores any $request that fails to parse correctly
9
$status
status
int
-
10
$body_bytes_sent
size
int
-
11
$http_referer
referer
string
-
12
$http_user_agent
user_agent
string
-

With that out of the way, let's dive into this article proper!

 

Docker Environment

The docker-compose.yaml file defines 3 containers:
a Timeplus Proton container;
a Grafana container;
a third container that seeds the Timeplus Proton container with access log data.


DDL Overview

The 3rd container uses the proton client to seed the Timeplus Proton database with access log data from an SQL file:
01_nginx-access-log.sql.

The SQL file contains the following DDL:


Generating Data to Mimic Nginx Access Log Data

On line 3, the CREATE RANDOM STREAM DDL is used to create a random stream named nginx_access_log.

Each column is randomly assigned a default value as you can see on lines 4 - 30.

The table below goes over each of the 12 Timeplus columns and the database functions used to generate test data.

#
Column
Generated Data
1
remote_ip
random_in_type('ipv4'): returns a random ipv4 representing a user's IP address.
2
rfc1413_ident
defaults to '-' but should be a string conforming to RFC1413. Currently unused.
3
remote_user
defaults to '-' since most blog traffic is from unauthenticated users. Currently unused.
4
date_time
random_in_type('datetime64', 365, y -> to_time('2023-6-17') + interval y day): random_in_type('datetime64') between 2023-6-17 & a 1-yr interval i.e. [2023-06-17, 2024-06-17).
5
http_method
['GET', 'POST', 'PUT', 'DELETE', 'HEAD'][rand()%5]: uses rand() to return a random index between [0, 5) in this 5-element array. The array samples 5 of the 39 HTTP verbs.
6
path
['/rss/', '/', '/sitemap.xml', '/favicon.ico', '/robots.txt', ...][rand()%11]: uses rand() to return a random index between [0, 11) in this 11-element array of sample URL subpaths.
7
http_version
['HTTP/1.0', 'HTTP/1.1', 'HTTP/2.0'][rand()%3]: similar to path.
8
status
[200, 301, 302, 304, 400, 404][rand()%6]: similar to path.
9
size
rand()%5000000: uses rand() to generate a random amount of bytes sent, up to ~5MB.
10
referer
['-', 'https://ayewo.com/', '...', 'https://google.com/'][rand()%4]: similar to path.
11
user_agent
['...', '...', ...][rand()%14]: similar to path.
12
malicious_request
if(rand()%100 < 5, '\x16\...', ''): returns a malformed sequence whenever rand()%100 is < 5.

Random Streams

There is an important note about the design of random streams in Timeplus Proton mentioned in the RANDOM STREAM documentation that is worth highlighting here:

Note: When you run a Timeplus SQL query with a random stream, the data will be generated and analyzed by the query engine. Depending on the query, all generated data or aggregated states can be kept in memory during the query time. If you are not querying the random stream, no data is generated or kept in memory.

Essentially, each time you run a query like SELECT remote_ip, rfc1413_ident, remote_user, date_time, http_verb, path, http_ver, status, size, referer, user_agent, malicious_request FROM nginx_access_log LIMIT 1;, you'll get a different (random) result from the stream.

For instance, running that query yields the following output:
SELECT 
	remote_ip, rfc1413_ident, remote_user, date_time, http_verb, 
	path, http_ver, status, size, referer, user_agent, 
	malicious_request
FROM
	nginx_access_log
LIMIT 1

Query id: d4b7f92c-bcd8-42ec-94ef-655eae86861b

┌─remote_ip─────┬─rfc1413_ident─┬─remote_user─┬───────────────date_time─┬─http_verb─┐
│ 181.12.95.134 │ -             │ -           │ 2024-05-02 00:00:00.000 │           │ 
└───────────────┴───────────────┴─────────────┴─────────────────────────┴───────────┘
┌─path─────────┬─http_ver─┬─status─┬──size─┬─referer─┬─user_agent────┬─malicious_request─┐
│ /sitemap.xml │          │    302 │ 21345 │ -       │ Plenary/4.6.2 │                   │
└──────────────┴──────────┴────────┴───────┴─────────┴───────────────┴───────────────────┘

1 row in set. Elapsed: 0.012 sec.


Running it again, yields this output:

SELECT
	remote_ip, rfc1413_ident, remote_user, date_time, http_verb,
 	path, http_ver, status, size, referer, user_agent, 
	malicious_request
FROM
	nginx_access_log
LIMIT 1

Query id: 11d071e4-6028-47da-826d-9db42176f8e5

┌─remote_ip──────┬─rfc1413_ident─┬─remote_user─┬───────────────date_time─┬─http_verb─┐
│ 204.25.223.210 │ -             │ -           │ 2024-05-15 00:00:00.000 │ POST      │ 
└────────────────┴───────────────┴─────────────┴─────────────────────────┴───────────┘
┌─path─────────┬─http_ver─┬─status─┬──size─┬─referer────────────┐
│ /favicon.ico │ HTTP/1.1 │    301 │ 32062 │ https://ayewo.com/ │ 
└──────────────┴──────────┴────────┴───────┴────────────────────┘
┌─user_agent─────────────────────────────────────────┬─malicious_request─┐
│ NetNewsWire (RSS Reader; https://netnewswire.com/) │                   │
└────────────────────────────────────────────────────┴───────────────────┘

1 row in set. Elapsed: 0.013 sec.

Due to this design decision, we will create a snapshot from the random stream to serve as our test data. This snapshot will represent our access log data so we can reliably geolocate the latitude and longitude of each (randomly generated) IP address before attempting to visualize them in Grafana.


Environment Setup

1. Clone the Timeplus Proton repo locally:
git clone https://github.com/timeplus-io/proton.git

2. Change to the examples/nginx-grafana subfolder:
cd proton/examples/nginx-grafana

3. Sign up for a free IPinfo account and create an Access Token. Your Access Token is a 14-character alphanumeric string that will be used from an environment variable IPINFO_KEY to authenticate the lookup in bulk of IPs to their respective geolocations.
export IPINFO_KEY=xxxxxxxxxxxxxx

4. Now start up all 3 containers using the docker-compose.yaml file inside the proton/examples/nginx-grafana subfolder:
docker compose up

Your console should show output similar to the one below:
See output:
Attaching to nginx-grafana-grafana-1, nginx-grafana-nginx-access-log-data-generator-1, nginx-grafana-proton-1
nginx-grafana-proton-1                           | Processing configuration file '/etc/proton-server/config.yaml'.
nginx-grafana-proton-1                           | Merging configuration file '/etc/proton-server/config.d/data-paths.xml'.
nginx-grafana-proton-1                           | Merging configuration file '/etc/proton-server/config.d/docker_related_config.xml'.
nginx-grafana-proton-1                           | Merging configuration file '/etc/proton-server/config.d/logger.xml'.
nginx-grafana-proton-1                           | Merging configuration file '/etc/proton-server/config.d/openssl.xml'.
nginx-grafana-proton-1                           | Merging configuration file '/etc/proton-server/config.d/user-directories.xml'.
nginx-grafana-proton-1                           | Logging information to /var/log/proton-server/proton-server.log
nginx-grafana-proton-1                           | Logging errors to /var/log/proton-server/proton-server.err.log
nginx-grafana-nginx-access-log-data-generator-1  | Using the following config for IPinfo:
nginx-grafana-nginx-access-log-data-generator-1  | {"cache_enabled":true,"token":"xxxxxxxxxxxxxx","open_browser":true}
nginx-grafana-nginx-access-log-data-generator-1  | Extracting only IP addresses from the 1st column in /tmp/nginx_export.csv to /tmp/nginx_export.ipinfo.tmp ...
nginx-grafana-nginx-access-log-data-generator-1  | Removing duplicate IP addresses from /tmp/nginx_export.ipinfo.tmp ...
nginx-grafana-nginx-access-log-data-generator-1  | Combined all IP addresses into a single file: /tmp/nginx_export.ipinfo.
nginx-grafana-nginx-access-log-data-generator-1  | removed '/tmp/nginx_export.ipinfo.tmp'
nginx-grafana-nginx-access-log-data-generator-1  | Extracting the rest of the data from /tmp/nginx_export.csv to /tmp/nginx_import.csv ...
nginx-grafana-nginx-access-log-data-generator-1  | renamed '/tmp/nginx_export.csv' -> '/tmp/nginx_import.csv'
nginx-grafana-nginx-access-log-data-generator-1  | Geo-locating all the IP addresses in bulk using the IPInfo API (https://ipinfo.io):
nginx-grafana-nginx-access-log-data-generator-1  |    Total IP addresses that will be looked up in bulk using /tmp/nginx_export.ipinfo: 1000.
nginx-grafana-nginx-access-log-data-generator-1  |    Geo-lookup of 1000 IP addresses written to file: /tmp/nginx_import.ipinfo.csv.
nginx-grafana-nginx-access-log-data-generator-1  |    Complete!
nginx-grafana-nginx-access-log-data-generator-1 exited with code 0
...

5. Open http://localhost:3000 in your web browser:

6. On the bottom left of the screen, click on "Nginx Access Logs Analysis" to open the dashboard that has been automatically setup for you:

7. Congrats! You've successfully set up the "Nginx Access Logs Analysis" dashboard in Grafana:

 

Quick Rundown of the Grafana Dashboard

There are a total of 5 panels in the Grafana dashboard.

Panel 1: Geodistribution of Web Requests

This panel uses a geomap.

This is the query backing this panel:
SELECT 
	date_time_string, ipv4_num_to_string_class_c(nal.remote_ip) as 
	remote_ip, nip.country_name as country, nip.country_flag_emoji 
	as emoji, nip.loc as loc,
	geohash_encode(to_float64(split_by_char(',', loc)[2]), 
	to_float64(split_by_char(',', loc)[1])) as geohash 
FROM table(nginx_historical_access_log) as nal 
JOIN table(nginx_ipinfo) as nip ON nal.remote_ip = nip.ip 
WHERE 
	NOT empty(loc) 
GROUP BY remote_ip, date_time_string, country, emoji, loc 
HAVING count() > 0 
ORDER BY date_time_string DESC;


Panel 2: Geolocation

This panel uses a table to display the data shown in the geomap in tabular form.

The query backing this panel is identical to the query for the geomap.


Panel 3: Top Traffic Sources

This panel uses a bar-chart to display the countries that send the most traffic.

This is the query backing this panel:
SELECT 
	count() as page_views, nip.country_name as country, 
	nip.country_flag_emoji as emoji 
FROM table(nginx_historical_access_log) as nal 
JOIN table(nginx_ipinfo) as nip ON nal.remote_ip = nip.ip 
GROUP BY country, emoji 
HAVING count() > 0 
ORDER BY page_views DESC;


Panel 4: Page Traffic

This panel also uses a table to display the web pages that receive the most amount of traffic from visitors.

This is the query backing this panel:
SELECT path, count() as request_count
FROM table(nginx_historical_access_log)
WHERE (path NOT LIKE '%.js%')
    AND (path NOT LIKE '%.css%')
    AND (path NOT LIKE '%.png%')
    AND (path NOT LIKE '%.gif%')
    AND (path NOT LIKE '%.jpg')
GROUP BY path
ORDER BY count() DESC;


Panel 5: HTTP Status Codes

This panel also uses a bar-chart to visualize an aggregation of the HTTP status codes.

This is the query backing this panel:
SELECT
    multi_if(status >= 400, '40x - Error', status > 300, '30x - Notice', '200 - OK') AS status_code,
    count() as frequency
FROM table(nginx_historical_access_log)
GROUP BY status_code;

 

Summary

This post showed how to use the RANDOM STREAM feature of Timeplus Proton to generate Nginx access log data for some quick experimentation.

It also highlights the use of Timeplus Proton as a data source for visualizing queries using Grafana. This is made possible by their open source Grafana plugin.

My next blog post will explore the same analysis covered here using a different visualization tool: Metabase.

 

About the Author
Saïd Abiola is a Senior Software Engineer drawn to systems, networks and concurrent programming. See more of his writing on his blog: Misleading Metaphors.

WHY TIMEPLUS?

PRODUCT

DEPLOYMENT

WHY TIMEPLUS?

PRODUCT

WHY TIMEPLUS?

PRODUCT

Visualizing Nginx Access Logs using Timeplus and Grafana

Prerequisites

The Shape of Nginx's Access Log Data

Nginx Access Log Data

Nginx Access Log Format

Docker Environment

DDL Overview

Generating Data to Mimic Nginx Access Log Data

Random Streams

Environment Setup

Quick Rundown of the Grafana Dashboard

Panel 1: Geodistribution of Web Requests

Panel 2: Geolocation

Panel 3: Top Traffic Sources

Panel 4: Page Traffic

Panel 5: HTTP Status Codes

Summary

Related Posts

#	Field	Value
1	$remote_addr	161.35.230.x
2	$rfc1413_ident	-
3	$remote_user	-
4	$time_local	26/Jun/2023:06:33:53 +0000
5	$request	\x00\x0E8uON\x85J\xCF\xC5\x93\x00\x00\x00\x00\x00
6	$status	400
7	$body_bytes_sent	182
8	$http_referer	-
9	$http_user_agent	-

#	Field	Column	Data Type	Remarks
1	$remote_addr	remote_ip	ipv4	-
2	-	rfc1413_ident	string	-
3	$remote_user	remote_user	string	-
4	$time_local	date_time	datetime64	-
5	$request	http_method	string	1st part of splitting a $request
6	$request	path	string	2nd part of splitting a $request
7	$request	http_version	string	3rd part of splitting a $request
8	$request	malicious_request	string	Stores any $request that fails to parse correctly
9	$status	status	int	-
10	$body_bytes_sent	size	int	-
11	$http_referer	referer	string	-
12	$http_user_agent	user_agent	string	-

#	Column	Generated Data
1	remote_ip	random_in_type('ipv4'): returns a random ipv4 representing a user's IP address.
2	rfc1413_ident	defaults to '-' but should be a string conforming to RFC1413. Currently unused.
3	remote_user	defaults to '-' since most blog traffic is from unauthenticated users. Currently unused.
4	date_time	random_in_type('datetime64', 365, y -> to_time('2023-6-17') + interval y day): random_in_type('datetime64') between 2023-6-17 & a 1-yr interval i.e. [2023-06-17, 2024-06-17).
5	http_method	['GET', 'POST', 'PUT', 'DELETE', 'HEAD'][rand()%5]: uses rand() to return a random index between [0, 5) in this 5-element array. The array samples 5 of the 39 HTTP verbs.
6	path	['/rss/', '/', '/sitemap.xml', '/favicon.ico', '/robots.txt', ...][rand()%11]: uses rand() to return a random index between [0, 11) in this 11-element array of sample URL subpaths.
7	http_version	['HTTP/1.0', 'HTTP/1.1', 'HTTP/2.0'][rand()%3]: similar to path.
8	status	[200, 301, 302, 304, 400, 404][rand()%6]: similar to path.
9	size	rand()%5000000: uses rand() to generate a random amount of bytes sent, up to ~5MB.
10	referer	['-', 'https://ayewo.com/', '...', 'https://google.com/'][rand()%4]: similar to path.
11	user_agent	['...', '...', ...][rand()%14]: similar to path.
12	malicious_request	if(rand()%100 < 5, '\x16\...', ''): returns a malformed sequence whenever rand()%100 is < 5.