Server Side Analytics with GoAccess

When I updated my blog about a month ago, my primary goals were to simplify and modernize things. I updated to the latest version of Hugo, which powers by blog, added automatic deployment of new posts using GitHub Actions, and moving away from my previous hosting provider, Firebase, which was far more than was needed to host my simple, static site.

One of those simplifications was removing Google Analytics from my site. I, like many others, have become more and more aware of the privacy trade offs that I’ve often overlooked in technology. Once I examined my need for Google Analytics in my site it became clear that I hardly even use it, and decided that it isn’t worth pumping information about the behavioral patterns of my readers into Google’s big data machine.

I’m certainly not anti-analytics, but find it important to understand and weigh the trade-offs of any technology. Since this is simply a hobby site as opposed to some ad powered revenue generator, there isn’t much benefit to such a complex or invasive tracking apparatus.

While I don’t need everything that Google Analytics offers, it is definitely helpful to have an idea of which content people find valuable. This is also something that I have to report to the Google Developer Expert program. But even with these requirements it’s clear that there’s a difference between analytics that track my readers and those that track what content on my site they access.

Server Side Analytics⌗

The solution to getting this sort of information without violating my readers trust is Server Side Analytics. With Server Side Analytics, reports are generated from the server access logs, as opposed to a snippet of JavaScript that runs in a user’s browser. This means that you can easily track how many times a page is requested, and even some high level information like what OS your readers use, but you aren’t helping any company track specific users around the whole internet.

My first thought was to switch to serving my site through CloudFlare’s CDN, and use their analytics feature to understand how many visitors accessed my site. I quickly realized, however, that their analytics dashboards were limited to site level reports, without allowing me to see the specific page access numbers.

While setting up and testing CloudFlare analytics, I realized that I should be able to get all the information I need from server access logs. This thought led me to the GoAccess project, a visual web log analyzer that creates reports similar to those of Google Analytics, purely by analyzing your web servers access logs.

GoAccess seemed like the perfect solution for my non-invasive, server side analytic report needs. Unfortunately, it wouldn’t work with my existing host, GitHub Pages, since GitHub doesn’t make the access logs available. Due to some past work experience, I knew that I could easily host a static website using Amazon S3 and their CloudFront CDN and have the CDN access logs saved to an S3 bucket.

After quickly confirming that Hugo sites are absurdly easy to deploy to Amazon Web Services (AWS), I spent an evening in the AWS console setting up the buckets and CDN to host my site. I then updated my GitHub Actions to deploy there, instead of GitHub Pages, and went to bed. You may have seen the result:

Got a ssl certificate error when trying to access:
NET::ERR_CERT_COMMON_NAME_INVALID
— Black Lives Matter - Grégory Lureau (@glureau) June 10, 2020

A quick SSL certificate update and things were back to normal. :)

The next step was to get GoAccess processing the log files and displaying them somewhere.

GoAccess and CloudFront⌗

While GoAccess has support for processing logs in realtime and serving them via it’s own http server, CloudFront doesn’t provide realtime logs. Instead, it puts gzipped log files in an S3 bucket periodically, roughly every 5 minutes in my case. That makes running a server, complete with streaming web socket support, overkill.

What I decided on, instead, was a simple Docker container that would sync the logs and process them as a CRON job. Over the last couple of weeks, with some testing help from Jake Wharton, I created a docker container to do just that.

I run the container using Docker Compose on my home server, and it allows you to automatically sync log files from S3 and process them using GoAccess. Here’s the configuration I use to process my logs.

analytics:
  container_name: analytics
  image: rharter/goaccess-cloudfront
  restart: unless-stopped
  environment:
    - PUID=${PUID}
    - PGID=${PGID}
    - TZ=${TZ}
    - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
    - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    - BUCKET=ryanharter.com-access-logs
    - CRON=*/30 * * * *
    - PRUNE=60
  volumes:
    - ${USERDIR}/docker/analytics/logs:/logs:rw
    - ${USERDIR}/docker/analytics:/config:rw

This configuration runs every 30 minutes, syncs log files from my log bucket, processes them with GoAccess and serves them using nginx, and prunes log files that are more than 60 days old.

You can check out the repository for more details and options. Since pasting the configuration above I even updated mine to run a script to upload the processed report to the same S3 bucket that I use to host my website, making my analytics publicly available on my site.

Conclusion⌗

While I’m still tweaking my configuration to give me just the right results, this has proved to be a fun exercise. Not only did I get to use my home server, but I was able to get privacy respecting analytics for my website while keeping it proudly JavaScript free.