The following blog was written by our Timeplus Community member, Saïd Abiola, who originally published it to the Timeplus Proton /examples folder in GitHub.
In my last blog post, I showed how Timeplus Proton can be used to analyze web traffic in production (taking my personal blog as an example), using just the log files generated by an Nginx web server.
With SQL-based Observability gaining popularity as an alternative to the ELK stack, a major tool in this trend is ClickHouse, known for its fast ingestion and query speeds for log-handling. One example is Uber's Log Analytics platform: when based on ELK, the ingestion speed was ~25.5k docs per second; now, on a single ClickHouse node, it can handle 300K logs per second. Timeplus Proton extends ClickHouse's excellent log-handling features and historical data processing capabilities with streaming data.
Although I explored both real-time analysis and historical analysis, the analysis wasn't visualized. This article remedies the lack of rich visualizations by showing how to use open-source tools like Grafana.
Prerequisites
In addition to Timeplus Proton, the previous article used the following technologies:
The first two products require access to an AWS account, and the last two need to have been running in production, in order to have meaningful access log data to experiment with.
This article builds on the same analysis of the previous article but is quicker to get started as you only need:
Timeplus Proton has a handy feature RANDOM STREAM which we will use to generate all the access log data needed for experimentation in this article.
Before I dive into the article proper, I will first revisit the format used by Nginx to record access log data so that the data we'll generate will mimic its shape and properties.
The Shape of Nginx's Access Log Data
In the previous article, we briefly touched on the Nginx access logs but did not dwell on where it is stored on disk or how the data it contains should be parsed.
Nginx Access Log Data
The access logs for Nginx are typically written to the file /var/log/nginx/access.log.
Here's an excerpt from my blog's access logs:
161.35.230.x - - [26/Jun/2023:06:33:53 +0000] "\x00\x0E8uON\x85J\xCF\xC5\x93\x00\x00\x00\x00\x00" 400 182 "-" "-"
51.79.29.xx - - [26/Jun/2023:06:37:04 +0000] "POST / HTTP/1.1" 301 57 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
179.43.177.xxx - - [26/Jun/2023:06:45:28 +0000] "GET / HTTP/1.1" 301 60 "-" "Hello World"
87.236.176.xxx - - [26/Jun/2023:07:58:36 +0000] "GET / HTTP/1.1" 200 6364 "https://www.ayewo.com/" "Mozilla/5.0 (compatible; InternetMeasurement/1.0; +https://internet-measurement.com/)"
87.236.176.xxx - - [26/Jun/2023:07:58:38 +0000] "GET /favicon.png HTTP/2.0" 200 7636 "-" "Mozilla/5.0 (compatible; InternetMeasurement/1.0; +https://internet-measurement.com/)"
93.126.72.xxx - - [26/Jun/2023:08:29:38 +0000] "GET /how-to-install-rsync-on-windows/ HTTP/2.0" 200 6394 "https://www.google.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
The first line in the access log is from the IP address 161.35.230.x.
We'll look at the Nginx access log format to understand what each field on that line means.
Nginx Access Log Format
Nginx's access log format is specified in /etc/nginx/nginx.conf and it broadly looks like this:
http {
log_format compression '$remote_addr - $remote_user
[$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent"';
server {
access_log /var/log/nginx/access.log;
...
}
}
Here's the first line we saw earlier, but this time annotated to show individual fields from the log format.
We can see that the log format uses a single space as separator between fields.
To parse that line reliably, we can use this Python regex:
import re
line = '161.35.230.x - - [26/Jun/2023:06:33:53 +0000] "\x00\x0E8uON\x85J\xCF\xC5\x93\x00\x00\x00\x00\x00" 400 182 "-" "-"'
malicious_pattern = re.compile(r'(?P<remote_addr>\S+).(?P<rfc1413_ident>\S+).(?P<remote_user>\S+).\[(?P<time_local>[^\]]+)\] "(?P<request>[^"]*)" (?P<status>\d+) (?P<body_bytes_sent>\d+) "(?P<http_referer>[^"]*)" "(?P<http_user_agent>[^"]*)"')
match = malicious_pattern.match(line)
The results of parsing the first line are shown in the table below:
# | Field | Value |
1 | $remote_addr | 161.35.230.x |
2 | $rfc1413_ident | - |
3 | $remote_user | - |
4 | $time_local | 26/Jun/2023:06:33:53 +0000 |
5 | $request | \x00\x0E8uON\x85J\xCF\xC5\x93\x00\x00\x00\x00\x00 |
6 | $status | 400 |
7 | $body_bytes_sent | 182 |
8 | $http_referer | - |
9 | $http_user_agent | - |
It has total of 9 fields which will be represented as Timeplus columns later.
It is clear from looking at the contents of the 5th field i.e. $request that this was a maliciously crafted request. The $request didn't specify a valid HTTP protocol method (e.g. GET, POST or OPTIONS) to the server which is why the server responded with a HTTP $status code of 400 (Bad Request).
Let's use the Python regex to parse the second line:
# | Field | Value |
1 | $remote_addr | 51.79.29.xx |
2 | $rfc1413_ident | - |
3 | $remote_user | - |
4 | $time_local | 26/Jun/2023:06:37:04 +0000 |
5 | $request | POST / HTTP/1.1 |
6 | $status | 301 |
7 | $body_bytes_sent | 57 |
8 | $http_referer | - |
9 | $http_user_agent | Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 |
Unlike the first line, the $request field parsed from the second line contains a request string that is valid under the HTTP protocol.
Here's the second line again annotated to show the $request field and its 3 parts (in orange):
The $request field also uses a single space as delimiter and can be split into three parts:
POST: indicates the HTTP verb or method
/: indicates the request URI (Uniform Resource Identifier) or path
HTTP/1.1: indicates the HTTP protocol version in use for the request
One of the analysis which we will run later on Timeplus Proton is a fast aggregation query that will show the top requested pages indicated by the $path field in the access logs. We could store each $request field in a single Timeplus column named request, but it would be super convenient if we split the $request field into 3 separate columns named: http_method, path and http_version.
If we update our Python regex as follows:
import re
line = '51.79.29.xx - - [26/Jun/2023:06:37:04 +0000] "POST / HTTP/1.1" 301 57 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"'
pattern = re.compile(r'(?P<remote_addr>\S+).(?P<rfc1413_ident>\S+).(?P<remote_user>\S+).\[(?P<time_local>\S+ \+[0-9]{4})]."(?P<http_method>\S+) (?P<path>\S+) (?P<http_version>\S+)" (?P<status>[0-9]+) (?P<body_bytes_sent>\S+) "(?P<http_referer>.*)" "(?P<http_user_agent>.*)"\s*\Z')
match = pattern.match(line)
We can use the updated Python regex to parse the 2nd line into 11 fields (the 1st line was parsed into 9 fields):
# | Field | Value |
1 | $remote_addr | 51.79.29.xx |
2 | $rfc1413_ident | - |
3 | $remote_user | - |
4 | $time_local | 26/Jun/2023:06:37:04 +0000 |
5 | $http_method | POST |
6 | $path | / |
7 | $http_version | HTTP/1.1 |
8 | $status | 301 |
9 | $body_bytes_sent | 57 |
10 | $http_referer | - |
11 | $http_user_agent | Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 |
There are several malformed requests in the access logs similar to the one we saw on the 1st line. Many of them will not be parsed reliably using our updated Python regex, so we will add an additional database column called malicious_request to store such malformed $requests in full rather than attempt to split them.
In fact, this was why we had a total of 12 Timeplus columns in the nginx_historical_access_log stream created in the previous article.
Below is a summary of how individual fields from an Nginx access log will map to Timeplus columns:
# | Field | Column | Data Type | Remarks |
1 | $remote_addr | remote_ip | ipv4 | - |
2 | - | rfc1413_ident | string | - |
3 | $remote_user | remote_user | string | - |
4 | $time_local | date_time | datetime64 | - |
5 | $request | http_method | string | 1st part of splitting a $request |
6 | $request | path | string | 2nd part of splitting a $request |
7 | $request | http_version | string | 3rd part of splitting a $request |
8 | $request | malicious_request | string | Stores any $request that fails to parse correctly |
9 | $status | status | int | - |
10 | $body_bytes_sent | size | int | - |
11 | $http_referer | referer | string | - |
12 | $http_user_agent | user_agent | string | - |
With that out of the way, let's dive into this article proper!
Docker Environment
The docker-compose.yaml file defines 3 containers:
a Timeplus Proton container;
a Grafana container;
a third container that seeds the Timeplus Proton container with access log data.
DDL Overview
The 3rd container uses the proton client to seed the Timeplus Proton database with access log data from an SQL file:
The SQL file contains the following DDL:
Generating Data to Mimic Nginx Access Log Data
On line 3, the CREATE RANDOM STREAM DDL is used to create a random stream named nginx_access_log.
Each column is randomly assigned a default value as you can see on lines 4 - 30.
The table below goes over each of the 12 Timeplus columns and the database functions used to generate test data.
# | Column | Generated Data |
1 | remote_ip | random_in_type('ipv4'): returns a random ipv4 representing a user's IP address. |
2 | rfc1413_ident | defaults to '-' but should be a string conforming to RFC1413. Currently unused. |
3 | remote_user | defaults to '-' since most blog traffic is from unauthenticated users. Currently unused. |
4 | date_time | random_in_type('datetime64', 365, y -> to_time('2023-6-17') + interval y day): random_in_type('datetime64') between 2023-6-17 & a 1-yr interval i.e. [2023-06-17, 2024-06-17). |
5 | http_method | ['GET', 'POST', 'PUT', 'DELETE', 'HEAD'][rand()%5]: uses rand() to return a random index between [0, 5) in this 5-element array. The array samples 5 of the 39 HTTP verbs. |
6 | path | ['/rss/', '/', '/sitemap.xml', '/favicon.ico', '/robots.txt', ...][rand()%11]: uses rand() to return a random index between [0, 11) in this 11-element array of sample URL subpaths. |
7 | http_version | ['HTTP/1.0', 'HTTP/1.1', 'HTTP/2.0'][rand()%3]: similar to path. |
8 | status | [200, 301, 302, 304, 400, 404][rand()%6]: similar to path. |
9 | size | rand()%5000000: uses rand() to generate a random amount of bytes sent, up to ~5MB. |
10 | referer | ['-', 'https://ayewo.com/', '...', 'https://google.com/'][rand()%4]: similar to path. |
11 | user_agent | ['...', '...', ...][rand()%14]: similar to path. |
12 | malicious_request | if(rand()%100 < 5, '\x16\...', ''): returns a malformed sequence whenever rand()%100 is < 5. |
Random Streams
There is an important note about the design of random streams in Timeplus Proton mentioned in the RANDOM STREAM documentation that is worth highlighting here:
Note: When you run a Timeplus SQL query with a random stream, the data will be generated and analyzed by the query engine. Depending on the query, all generated data or aggregated states can be kept in memory during the query time. If you are not querying the random stream, no data is generated or kept in memory.
Essentially, each time you run a query like SELECT remote_ip, rfc1413_ident, remote_user, date_time, http_verb, path, http_ver, status, size, referer, user_agent, malicious_request FROM nginx_access_log LIMIT 1;, you'll get a different (random) result from the stream.
For instance, running that query yields the following output:
SELECT
remote_ip, rfc1413_ident, remote_user, date_time, http_verb,
path, http_ver, status, size, referer, user_agent,
malicious_request
FROM
nginx_access_log
LIMIT 1
Query id: d4b7f92c-bcd8-42ec-94ef-655eae86861b
┌─remote_ip─────┬─rfc1413_ident─┬─remote_user─┬───────────────date_time─┬─http_verb─┐
│ 181.12.95.134 │ - │ - │ 2024-05-02 00:00:00.000 │ │
└───────────────┴───────────────┴─────────────┴─────────────────────────┴───────────┘
┌─path─────────┬─http_ver─┬─status─┬──size─┬─referer─┬─user_agent────┬─malicious_request─┐
│ /sitemap.xml │ │ 302 │ 21345 │ - │ Plenary/4.6.2 │ │
└──────────────┴──────────┴────────┴───────┴─────────┴───────────────┴───────────────────┘
1 row in set. Elapsed: 0.012 sec.
Running it again, yields this output:
SELECT
remote_ip, rfc1413_ident, remote_user, date_time, http_verb,
path, http_ver, status, size, referer, user_agent,
malicious_request
FROM
nginx_access_log
LIMIT 1
Query id: 11d071e4-6028-47da-826d-9db42176f8e5
┌─remote_ip──────┬─rfc1413_ident─┬─remote_user─┬───────────────date_time─┬─http_verb─┐
│ 204.25.223.210 │ - │ - │ 2024-05-15 00:00:00.000 │ POST │
└────────────────┴───────────────┴─────────────┴─────────────────────────┴───────────┘
┌─path─────────┬─http_ver─┬─status─┬──size─┬─referer────────────┐
│ /favicon.ico │ HTTP/1.1 │ 301 │ 32062 │ https://ayewo.com/ │
└──────────────┴──────────┴────────┴───────┴────────────────────┘
┌─user_agent─────────────────────────────────────────┬─malicious_request─┐
│ NetNewsWire (RSS Reader; https://netnewswire.com/) │ │
└────────────────────────────────────────────────────┴───────────────────┘
1 row in set. Elapsed: 0.013 sec.
Due to this design decision, we will create a snapshot from the random stream to serve as our test data. This snapshot will represent our access log data so we can reliably geolocate the latitude and longitude of each (randomly generated) IP address before attempting to visualize them in Grafana.
Environment Setup
1. Clone the Timeplus Proton repo locally:
git clone https://github.com/timeplus-io/proton.git
2. Change to the examples/nginx-grafana subfolder:
cd proton/examples/nginx-grafana
3. Sign up for a free IPinfo account and create an Access Token. Your Access Token is a 14-character alphanumeric string that will be used from an environment variable IPINFO_KEY to authenticate the lookup in bulk of IPs to their respective geolocations.
export IPINFO_KEY=xxxxxxxxxxxxxx
4. Now start up all 3 containers using the docker-compose.yaml file inside the proton/examples/nginx-grafana subfolder:
docker compose up
Your console should show output similar to the one below:
See output:
Attaching to nginx-grafana-grafana-1, nginx-grafana-nginx-access-log-data-generator-1, nginx-grafana-proton-1
nginx-grafana-proton-1 | Processing configuration file '/etc/proton-server/config.yaml'.
nginx-grafana-proton-1 | Merging configuration file '/etc/proton-server/config.d/data-paths.xml'.
nginx-grafana-proton-1 | Merging configuration file '/etc/proton-server/config.d/docker_related_config.xml'.
nginx-grafana-proton-1 | Merging configuration file '/etc/proton-server/config.d/logger.xml'.
nginx-grafana-proton-1 | Merging configuration file '/etc/proton-server/config.d/openssl.xml'.
nginx-grafana-proton-1 | Merging configuration file '/etc/proton-server/config.d/user-directories.xml'.
nginx-grafana-proton-1 | Logging information to /var/log/proton-server/proton-server.log
nginx-grafana-proton-1 | Logging errors to /var/log/proton-server/proton-server.err.log
nginx-grafana-nginx-access-log-data-generator-1 | Using the following config for IPinfo:
nginx-grafana-nginx-access-log-data-generator-1 | {"cache_enabled":true,"token":"xxxxxxxxxxxxxx","open_browser":true}
nginx-grafana-nginx-access-log-data-generator-1 | Extracting only IP addresses from the 1st column in /tmp/nginx_export.csv to /tmp/nginx_export.ipinfo.tmp ...
nginx-grafana-nginx-access-log-data-generator-1 | Removing duplicate IP addresses from /tmp/nginx_export.ipinfo.tmp ...
nginx-grafana-nginx-access-log-data-generator-1 | Combined all IP addresses into a single file: /tmp/nginx_export.ipinfo.
nginx-grafana-nginx-access-log-data-generator-1 | removed '/tmp/nginx_export.ipinfo.tmp'
nginx-grafana-nginx-access-log-data-generator-1 | Extracting the rest of the data from /tmp/nginx_export.csv to /tmp/nginx_import.csv ...
nginx-grafana-nginx-access-log-data-generator-1 | renamed '/tmp/nginx_export.csv' -> '/tmp/nginx_import.csv'
nginx-grafana-nginx-access-log-data-generator-1 | Geo-locating all the IP addresses in bulk using the IPInfo API (https://ipinfo.io):
nginx-grafana-nginx-access-log-data-generator-1 | Total IP addresses that will be looked up in bulk using /tmp/nginx_export.ipinfo: 1000.
nginx-grafana-nginx-access-log-data-generator-1 | Geo-lookup of 1000 IP addresses written to file: /tmp/nginx_import.ipinfo.csv.
nginx-grafana-nginx-access-log-data-generator-1 | Complete!
nginx-grafana-nginx-access-log-data-generator-1 exited with code 0
...
5. Open http://localhost:3000 in your web browser:
6. On the bottom left of the screen, click on "Nginx Access Logs Analysis" to open the dashboard that has been automatically setup for you:
7. Congrats! You've successfully set up the "Nginx Access Logs Analysis" dashboard in Grafana:
Quick Rundown of the Grafana Dashboard
There are a total of 5 panels in the Grafana dashboard.
Panel 1: Geodistribution of Web Requests
This panel uses a geomap.
This is the query backing this panel:
SELECT
date_time_string, ipv4_num_to_string_class_c(nal.remote_ip) as
remote_ip, nip.country_name as country, nip.country_flag_emoji
as emoji, nip.loc as loc,
geohash_encode(to_float64(split_by_char(',', loc)[2]),
to_float64(split_by_char(',', loc)[1])) as geohash
FROM table(nginx_historical_access_log) as nal
JOIN table(nginx_ipinfo) as nip ON nal.remote_ip = nip.ip
WHERE
NOT empty(loc)
GROUP BY remote_ip, date_time_string, country, emoji, loc
HAVING count() > 0
ORDER BY date_time_string DESC;
Panel 2: Geolocation
This panel uses a table to display the data shown in the geomap in tabular form.
The query backing this panel is identical to the query for the geomap.
Panel 3: Top Traffic Sources
This panel uses a bar-chart to display the countries that send the most traffic.
This is the query backing this panel:
SELECT
count() as page_views, nip.country_name as country,
nip.country_flag_emoji as emoji
FROM table(nginx_historical_access_log) as nal
JOIN table(nginx_ipinfo) as nip ON nal.remote_ip = nip.ip
GROUP BY country, emoji
HAVING count() > 0
ORDER BY page_views DESC;
Panel 4: Page Traffic
This panel also uses a table to display the web pages that receive the most amount of traffic from visitors.
This is the query backing this panel:
SELECT path, count() as request_count
FROM table(nginx_historical_access_log)
WHERE (path NOT LIKE '%.js%')
AND (path NOT LIKE '%.css%')
AND (path NOT LIKE '%.png%')
AND (path NOT LIKE '%.gif%')
AND (path NOT LIKE '%.jpg')
GROUP BY path
ORDER BY count() DESC;
Panel 5: HTTP Status Codes
This panel also uses a bar-chart to visualize an aggregation of the HTTP status codes.
This is the query backing this panel:
SELECT
multi_if(status >= 400, '40x - Error', status > 300, '30x - Notice', '200 - OK') AS status_code,
count() as frequency
FROM table(nginx_historical_access_log)
GROUP BY status_code;
Summary
This post showed how to use the RANDOM STREAM feature of Timeplus Proton to generate Nginx access log data for some quick experimentation.
It also highlights the use of Timeplus Proton as a data source for visualizing queries using Grafana. This is made possible by their open source Grafana plugin.
My next blog post will explore the same analysis covered here using a different visualization tool: Metabase.
About the Author
Saïd Abiola is a Senior Software Engineer drawn to systems, networks and concurrent programming. See more of his writing on his blog: Misleading Metaphors.