top of page
Writer's pictureSaïd Abiola

Visualizing Nginx Access Logs using Timeplus and Grafana

The following blog was written by our Timeplus Community member, Saïd Abiola, who originally published it to the Timeplus Proton /examples folder in GitHub.


 

In my last blog post, I showed how Timeplus Proton can be used to analyze web traffic in production (taking my personal blog as an example), using just the log files generated by an Nginx web server.


With SQL-based Observability gaining popularity as an alternative to the ELK stack, a major tool in this trend is ClickHouse, known for its fast ingestion and query speeds for log-handling. One example is Uber's Log Analytics platform: when based on ELK, the ingestion speed was ~25.5k docs per second; now, on a single ClickHouse node, it can handle 300K logs per second. Timeplus Proton extends ClickHouse's excellent log-handling features and historical data processing capabilities with streaming data.


Although I explored both real-time analysis and historical analysis, the analysis wasn't visualized. This article remedies the lack of rich visualizations by showing how to use open-source tools like Grafana.



 

Prerequisites


In addition to Timeplus Proton, the previous article used the following technologies:


The first two products require access to an AWS account, and the last two need to have been running in production, in order to have meaningful access log data to experiment with.


This article builds on the same analysis of the previous article but is quicker to get started as you only need:


Timeplus Proton has a handy feature RANDOM STREAM which we will use to generate all the access log data needed for experimentation in this article.


Before I dive into the article proper, I will first revisit the format used by Nginx to record access log data so that the data we'll generate will mimic its shape and properties.


 

The Shape of Nginx's Access Log Data


In the previous article, we briefly touched on the Nginx access logs but did not dwell on where it is stored on disk or how the data it contains should be parsed.



Nginx Access Log Data


The access logs for Nginx are typically written to the file /var/log/nginx/access.log.

Here's an excerpt from my blog's access logs:

161.35.230.x - - [26/Jun/2023:06:33:53 +0000] "\x00\x0E8uON\x85J\xCF\xC5\x93\x00\x00\x00\x00\x00" 400 182 "-" "-"
51.79.29.xx - - [26/Jun/2023:06:37:04 +0000] "POST / HTTP/1.1" 301 57 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
179.43.177.xxx - - [26/Jun/2023:06:45:28 +0000] "GET / HTTP/1.1" 301 60 "-" "Hello World"
87.236.176.xxx - - [26/Jun/2023:07:58:36 +0000] "GET / HTTP/1.1" 200 6364 "https://www.ayewo.com/" "Mozilla/5.0 (compatible; InternetMeasurement/1.0; +https://internet-measurement.com/)"
87.236.176.xxx - - [26/Jun/2023:07:58:38 +0000] "GET /favicon.png HTTP/2.0" 200 7636 "-" "Mozilla/5.0 (compatible; InternetMeasurement/1.0; +https://internet-measurement.com/)"
93.126.72.xxx - - [26/Jun/2023:08:29:38 +0000] "GET /how-to-install-rsync-on-windows/ HTTP/2.0" 200 6394 "https://www.google.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"

The first line in the access log is from the IP address 161.35.230.x.

We'll look at the Nginx access log format to understand what each field on that line means.



Nginx Access Log Format


Nginx's access log format is specified in /etc/nginx/nginx.conf and it broadly looks like this:

http { 
	log_format compression '$remote_addr - $remote_user 	
		[$time_local] ' 
		'"$request" $status $body_bytes_sent ' 
		'"$http_referer" "$http_user_agent"'; 

	server { 
		access_log /var/log/nginx/access.log; 
		... 
	} 
}

Here's the first line we saw earlier, but this time annotated to show individual fields from the log format.


We can see that the log format uses a single space as separator between fields.


To parse that line reliably, we can use this Python regex:

import re

line = '161.35.230.x - - [26/Jun/2023:06:33:53 +0000] "\x00\x0E8uON\x85J\xCF\xC5\x93\x00\x00\x00\x00\x00" 400 182 "-" "-"'

malicious_pattern = re.compile(r'(?P<remote_addr>\S+).(?P<rfc1413_ident>\S+).(?P<remote_user>\S+).\[(?P<time_local>[^\]]+)\] "(?P<request>[^"]*)" (?P<status>\d+) (?P<body_bytes_sent>\d+) "(?P<http_referer>[^"]*)" "(?P<http_user_agent>[^"]*)"')
match = malicious_pattern.match(line)

The results of parsing the first line are shown in the table below:

#

Field

Value

1

$remote_addr

161.35.230.x

2

$rfc1413_ident

-

3

$remote_user

-

4

$time_local

26/Jun/2023:06:33:53 +0000

5

$request

\x00\x0E8uON\x85J\xCF\xC5\x93\x00\x00\x00\x00\x00

6

$status

400

7

$body_bytes_sent

182

8

$http_referer

-

9

$http_user_agent

-


It has total of 9 fields which will be represented as Timeplus columns later.


It is clear from looking at the contents of the 5th field i.e. $request that this was a maliciously crafted request. The $request didn't specify a valid HTTP protocol method (e.g. GETPOST or OPTIONS) to the server which is why the server responded with a HTTP $status code of 400 (Bad Request).


Let's use the Python regex to parse the second line:

#

Field

Value

1

$remote_addr

51.79.29.xx

2

$rfc1413_ident

-

3

$remote_user

-

4

$time_local

26/Jun/2023:06:37:04 +0000

5

$request

POST / HTTP/1.1

6

$status

301

7

$body_bytes_sent

57

8

$http_referer

-

9

$http_user_agent

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36


Unlike the first line, the $request field parsed from the second line contains a request string that is valid under the HTTP protocol.


Here's the second line again annotated to show the $request field and its 3 parts (in orange):


The $request field also uses a single space as delimiter and can be split into three parts:

  • POST: indicates the HTTP verb or method

  • /: indicates the request URI (Uniform Resource Identifier) or path

  • HTTP/1.1: indicates the HTTP protocol version in use for the request


One of the analysis which we will run later on Timeplus Proton is a fast aggregation query that will show the top requested pages indicated by the $path field in the access logs. We could store each $request field in a single Timeplus column named request, but it would be super convenient if we split the $request field into 3 separate columns named: http_methodpath and http_version.


If we update our Python regex as follows:

import re

line = '51.79.29.xx - - [26/Jun/2023:06:37:04 +0000] "POST / HTTP/1.1" 301 57 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"'

pattern = re.compile(r'(?P<remote_addr>\S+).(?P<rfc1413_ident>\S+).(?P<remote_user>\S+).\[(?P<time_local>\S+ \+[0-9]{4})]."(?P<http_method>\S+) (?P<path>\S+) (?P<http_version>\S+)" (?P<status>[0-9]+) (?P<body_bytes_sent>\S+) "(?P<http_referer>.*)" "(?P<http_user_agent>.*)"\s*\Z')
match = pattern.match(line)

We can use the updated Python regex to parse the 2nd line into 11 fields (the 1st line was parsed into 9 fields):

#

Field

Value

1

$remote_addr

51.79.29.xx

2

$rfc1413_ident

-

3

$remote_user

-

4

$time_local

26/Jun/2023:06:37:04 +0000

5

$http_method

POST

6

$path

/

7

$http_version

HTTP/1.1

8

$status

301

9

$body_bytes_sent

57

10

$http_referer

-

11

$http_user_agent

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36


There are several malformed requests in the access logs similar to the one we saw on the 1st line. Many of them will not be parsed reliably using our updated Python regex, so we will add an additional database column called malicious_request to store such malformed $requests in full rather than attempt to split them.


In fact, this was why we had a total of 12 Timeplus columns in the nginx_historical_access_log stream created in the previous article.


Below is a summary of how individual fields from an Nginx access log will map to Timeplus columns:

#

Field

Column

Data Type

Remarks

1

$remote_addr

remote_ip

ipv4

-

2

-

rfc1413_ident

string

-

3

$remote_user

remote_user

string

-

4

$time_local

date_time

datetime64

-

5

$request

http_method

string

1st part of splitting a $request

6

$request

path

string

2nd part of splitting a $request

7

$request

http_version

string

3rd part of splitting a $request

8

$request

malicious_request

string

Stores any $request that fails to parse correctly

9

$status

status

int

-

10

$body_bytes_sent

size

int

-

11

$http_referer

referer

string

-

12

$http_user_agent

user_agent

string

-


With that out of the way, let's dive into this article proper!


 

Docker Environment


The docker-compose.yaml file defines 3 containers:

  • a Timeplus Proton container;

  • a Grafana container;

  • a third container that seeds the Timeplus Proton container with access log data.



DDL Overview


The 3rd container uses the proton client to seed the Timeplus Proton database with access log data from an SQL file:


The SQL file contains the following DDL:



Generating Data to Mimic Nginx Access Log Data


On line 3, the CREATE RANDOM STREAM DDL is used to create a random stream named nginx_access_log.


Each column is randomly assigned a default value as you can see on lines 4 - 30.


The table below goes over each of the 12 Timeplus columns and the database functions used to generate test data.


#

Column

Generated Data

1

remote_ip

random_in_type('ipv4'): returns a random ipv4 representing a user's IP address.

2

rfc1413_ident

defaults to '-' but should be a string conforming to RFC1413. Currently unused.

3

remote_user

defaults to '-' since most blog traffic is from unauthenticated users. Currently unused.

4

date_time

random_in_type('datetime64', 365, y -> to_time('2023-6-17') + interval y day)random_in_type('datetime64') between 2023-6-17 & a 1-yr interval i.e. [2023-06-17, 2024-06-17).

5

http_method

['GET', 'POST', 'PUT', 'DELETE', 'HEAD'][rand()%5]: uses rand() to return a random index between [0, 5) in this 5-element array. The array samples 5 of the 39 HTTP verbs.

6

path

['/rss/', '/', '/sitemap.xml', '/favicon.ico', '/robots.txt', ...][rand()%11]: uses rand() to return a random index between [0, 11) in this 11-element array of sample URL subpaths.

7

http_version

['HTTP/1.0', 'HTTP/1.1', 'HTTP/2.0'][rand()%3]: similar to path.

8

status

[200, 301, 302, 304, 400, 404][rand()%6]: similar to path.

9

size

rand()%5000000: uses rand() to generate a random amount of bytes sent, up to ~5MB.

10

referer

['-', 'https://ayewo.com/', '...', 'https://google.com/'][rand()%4]: similar to path.

11

user_agent

['...', '...', ...][rand()%14]: similar to path.

12

malicious_request

if(rand()%100 < 5, '\x16\...', ''): returns a malformed sequence whenever rand()%100 is < 5.


Random Streams


There is an important note about the design of random streams in Timeplus Proton mentioned in the RANDOM STREAM documentation that is worth highlighting here:


Note: When you run a Timeplus SQL query with a random stream, the data will be generated and analyzed by the query engine. Depending on the query, all generated data or aggregated states can be kept in memory during the query time. If you are not querying the random stream, no data is generated or kept in memory.


Essentially, each time you run a query like SELECT remote_ip, rfc1413_ident, remote_user, date_time, http_verb, path, http_ver, status, size, referer, user_agent, malicious_request FROM nginx_access_log LIMIT 1;, you'll get a different (random) result from the stream.


For instance, running that query yields the following output:

SELECT 
	remote_ip, rfc1413_ident, remote_user, date_time, http_verb, 
	path, http_ver, status, size, referer, user_agent, 
	malicious_request
FROM
	nginx_access_log
LIMIT 1

Query id: d4b7f92c-bcd8-42ec-94ef-655eae86861b

┌─remote_ip─────┬─rfc1413_ident─┬─remote_user─┬───────────────date_time─┬─http_verb─┐
│ 181.12.95.134 │ -             │ -           │ 2024-05-02 00:00:00.000 │           │ 
└───────────────┴───────────────┴─────────────┴─────────────────────────┴───────────┘
┌─path─────────┬─http_ver─┬─status─┬──size─┬─referer─┬─user_agent────┬─malicious_request─┐
│ /sitemap.xml │          │    302 │ 21345 │ -       │ Plenary/4.6.2 │                   │
└──────────────┴──────────┴────────┴───────┴─────────┴───────────────┴───────────────────┘

1 row in set. Elapsed: 0.012 sec.


Running it again, yields this output:


SELECT
	remote_ip, rfc1413_ident, remote_user, date_time, http_verb,
 	path, http_ver, status, size, referer, user_agent, 
	malicious_request
FROM
	nginx_access_log
LIMIT 1

Query id: 11d071e4-6028-47da-826d-9db42176f8e5

┌─remote_ip──────┬─rfc1413_ident─┬─remote_user─┬───────────────date_time─┬─http_verb─┐
│ 204.25.223.210 │ -             │ -           │ 2024-05-15 00:00:00.000 │ POST      │ 
└────────────────┴───────────────┴─────────────┴─────────────────────────┴───────────┘
┌─path─────────┬─http_ver─┬─status─┬──size─┬─referer────────────┐
│ /favicon.ico │ HTTP/1.1 │    301 │ 32062 │ https://ayewo.com/ │ 
└──────────────┴──────────┴────────┴───────┴────────────────────┘
┌─user_agent─────────────────────────────────────────┬─malicious_request─┐
│ NetNewsWire (RSS Reader; https://netnewswire.com/) │                   │
└────────────────────────────────────────────────────┴───────────────────┘

1 row in set. Elapsed: 0.013 sec.

Due to this design decision, we will create a snapshot from the random stream to serve as our test data. This snapshot will represent our access log data so we can reliably geolocate the latitude and longitude of each (randomly generated) IP address before attempting to visualize them in Grafana.



Environment Setup


1. Clone the Timeplus Proton repo locally:

git clone https://github.com/timeplus-io/proton.git

2. Change to the examples/nginx-grafana subfolder:

cd proton/examples/nginx-grafana

3. Sign up for a free IPinfo account and create an Access Token. Your Access Token is a 14-character alphanumeric string that will be used from an environment variable IPINFO_KEY to authenticate the lookup in bulk of IPs to their respective geolocations.

export IPINFO_KEY=xxxxxxxxxxxxxx

4. Now start up all 3 containers using the docker-compose.yaml file inside the proton/examples/nginx-grafana subfolder:

docker compose up

Your console should show output similar to the one below:

See output:

Attaching to nginx-grafana-grafana-1, nginx-grafana-nginx-access-log-data-generator-1, nginx-grafana-proton-1
nginx-grafana-proton-1                           | Processing configuration file '/etc/proton-server/config.yaml'.
nginx-grafana-proton-1                           | Merging configuration file '/etc/proton-server/config.d/data-paths.xml'.
nginx-grafana-proton-1                           | Merging configuration file '/etc/proton-server/config.d/docker_related_config.xml'.
nginx-grafana-proton-1                           | Merging configuration file '/etc/proton-server/config.d/logger.xml'.
nginx-grafana-proton-1                           | Merging configuration file '/etc/proton-server/config.d/openssl.xml'.
nginx-grafana-proton-1                           | Merging configuration file '/etc/proton-server/config.d/user-directories.xml'.
nginx-grafana-proton-1                           | Logging information to /var/log/proton-server/proton-server.log
nginx-grafana-proton-1                           | Logging errors to /var/log/proton-server/proton-server.err.log
nginx-grafana-nginx-access-log-data-generator-1  | Using the following config for IPinfo:
nginx-grafana-nginx-access-log-data-generator-1  | {"cache_enabled":true,"token":"xxxxxxxxxxxxxx","open_browser":true}
nginx-grafana-nginx-access-log-data-generator-1  | Extracting only IP addresses from the 1st column in /tmp/nginx_export.csv to /tmp/nginx_export.ipinfo.tmp ...
nginx-grafana-nginx-access-log-data-generator-1  | Removing duplicate IP addresses from /tmp/nginx_export.ipinfo.tmp ...
nginx-grafana-nginx-access-log-data-generator-1  | Combined all IP addresses into a single file: /tmp/nginx_export.ipinfo.
nginx-grafana-nginx-access-log-data-generator-1  | removed '/tmp/nginx_export.ipinfo.tmp'
nginx-grafana-nginx-access-log-data-generator-1  | Extracting the rest of the data from /tmp/nginx_export.csv to /tmp/nginx_import.csv ...
nginx-grafana-nginx-access-log-data-generator-1  | renamed '/tmp/nginx_export.csv' -> '/tmp/nginx_import.csv'
nginx-grafana-nginx-access-log-data-generator-1  | Geo-locating all the IP addresses in bulk using the IPInfo API (https://ipinfo.io):
nginx-grafana-nginx-access-log-data-generator-1  |    Total IP addresses that will be looked up in bulk using /tmp/nginx_export.ipinfo: 1000.
nginx-grafana-nginx-access-log-data-generator-1  |    Geo-lookup of 1000 IP addresses written to file: /tmp/nginx_import.ipinfo.csv.
nginx-grafana-nginx-access-log-data-generator-1  |    Complete!
nginx-grafana-nginx-access-log-data-generator-1 exited with code 0
...

5. Open http://localhost:3000 in your web browser:


6. On the bottom left of the screen, click on "Nginx Access Logs Analysis" to open the dashboard that has been automatically setup for you:


7. Congrats! You've successfully set up the "Nginx Access Logs Analysis" dashboard in Grafana:


 

Quick Rundown of the Grafana Dashboard


There are a total of 5 panels in the Grafana dashboard.


Panel 1: Geodistribution of Web Requests


This panel uses a geomap.


This is the query backing this panel:

SELECT 
	date_time_string, ipv4_num_to_string_class_c(nal.remote_ip) as 
	remote_ip, nip.country_name as country, nip.country_flag_emoji 
	as emoji, nip.loc as loc,
	geohash_encode(to_float64(split_by_char(',', loc)[2]), 
	to_float64(split_by_char(',', loc)[1])) as geohash 
FROM table(nginx_historical_access_log) as nal 
JOIN table(nginx_ipinfo) as nip ON nal.remote_ip = nip.ip 
WHERE 
	NOT empty(loc) 
GROUP BY remote_ip, date_time_string, country, emoji, loc 
HAVING count() > 0 
ORDER BY date_time_string DESC;


Panel 2: Geolocation


This panel uses a table to display the data shown in the geomap in tabular form.


The query backing this panel is identical to the query for the geomap.



Panel 3: Top Traffic Sources


This panel uses a bar-chart to display the countries that send the most traffic.


This is the query backing this panel:

SELECT 
	count() as page_views, nip.country_name as country, 
	nip.country_flag_emoji as emoji 
FROM table(nginx_historical_access_log) as nal 
JOIN table(nginx_ipinfo) as nip ON nal.remote_ip = nip.ip 
GROUP BY country, emoji 
HAVING count() > 0 
ORDER BY page_views DESC;


Panel 4: Page Traffic


This panel also uses a table to display the web pages that receive the most amount of traffic from visitors.


This is the query backing this panel:

SELECT path, count() as request_count
FROM table(nginx_historical_access_log)
WHERE (path NOT LIKE '%.js%')
    AND (path NOT LIKE '%.css%')
    AND (path NOT LIKE '%.png%')
    AND (path NOT LIKE '%.gif%')
    AND (path NOT LIKE '%.jpg')
GROUP BY path
ORDER BY count() DESC;


Panel 5: HTTP Status Codes


This panel also uses a bar-chart to visualize an aggregation of the HTTP status codes.


This is the query backing this panel:

SELECT
    multi_if(status >= 400, '40x - Error', status > 300, '30x - Notice', '200 - OK') AS status_code,
    count() as frequency
FROM table(nginx_historical_access_log)
GROUP BY status_code;

 

Summary


This post showed how to use the RANDOM STREAM feature of Timeplus Proton to generate Nginx access log data for some quick experimentation.


It also highlights the use of Timeplus Proton as a data source for visualizing queries using Grafana. This is made possible by their open source Grafana plugin.


My next blog post will explore the same analysis covered here using a different visualization tool: Metabase.


 

About the Author

Saïd Abiola is a Senior Software Engineer drawn to systems, networks and concurrent programming. See more of his writing on his blog: Misleading Metaphors.

bottom of page