Private Git Servers, for the Paranoid Scientist

Data-compliant version control, with off-site backup.

diy
tech
research
Published

February 28, 2026

There is a principle in software engineering that says you should version control everything. The idea is that if you track every change you make to your code, you can always go back to a version that worked. This is obviously correct, and although I have been using it for years in writing (Part One has 3,859 commits at time of writing) and software, I have been studiously ignoring it for my statistics work.

The reason is I use Github, which is Microsoft-owned, and I usually deal with data that comes with a variety of riders on where it can be stored, how it can be transferred, and who can access it. I am naturally distrustful1 of the protean terms-of-service agreements of cloud providers, and have for the last couple of years kept all of my research work confined to a Nextcloud instance which runs on my home server. This provides version-control of a kind (it keeps old document versions), but it’s clunky, reverting changes is clunky and unreliable and only to be used in cases of extreme duress, and an imperfect solution.

1 Justifiably.

I chose to revisit this yesterday as over the last year I have commenced a project with some novel data. The data format is still in the process of being worked out, which means every data dump I get has slightly different formatting, which means the data cleaning scripts have required revision every 3 months. This has led to a mess forming in the project directory, and at some point the digital squalor becomes too much to be easily swept under the bed. I am too paranoid about inadvertently uploading data to GitHub for it to be used for this work, and syncing a git history over Nextcloud feels like a recipe for data corruption.

Over the last couple of years running a home server I have gotten much more comfortable with the shell, ssh, and blown away by the convenience of docker containers, and there is a certain pleasure in running your own infrastructure. Pleasure, and the constant low-level fear of a mission-critical bug at 11pm on a Sunday.2

2 One of the biggest surprises I have had is that it has actually required very little effort to maintain, and nothing has ever seriously broken. Setting it up took the better part of a weekend, I log in every couple of weeks to update the docker containers, and I’ve only had three container-breaking issues over the 2 years I have run it. Two of these have been after Nextcloud updates, which are fixed by logging on the Web GUI. It has been a remarkably stable and incredibly useful investment.

The Platform Decision

The self-hosted git space is crowded in the way that open-source software tends to be crowded: many options, with one obviously correct answer that you’ll identify after reading a half-dozen Reddit threads:

  1. GitLab is the enterprise option. It is very powerful and requires, by some accounts, more RAM than a small hospital. For a single user backing up their R scripts, this is more than is required.
  2. Gogs is small and fast, and also had a critical vulnerability added to CISA’s Known Exploited Vulnerabilities catalogue in January 2026. Moving on.
  3. Gitea and Forgejo are essentially the same software — Forgejo is a community fork created when Gitea was acquired by a for-profit company in 2022, in the way that these things tend to go.3 Forgejo runs in Docker (tick), uses SQLite (tick) for a single user, consumes around 150MB of RAM (tick), name from Esperanto (cross), and provides a web interface that looks enough like GitHub that I felt at home (tick). It is the obvious correct answer.

3 The open-source software lifecycle: a project is created by volunteers, becomes popular, gets acquired, community forks it, the fork becomes popular, repeat. It is a kind of institutional metabolism.

Installation

My home server runs Unraid, which manages Docker containers through a reasonably civilised web interface. Forgejo has a Community Applications template, which means installation is largely a matter of clicking things and making two decisions:

  1. Port mapping.
    Forgejo wants port 22 for SSH, which Unraid is already using for itself. Map the container’s SSH to port 222 instead. This will matter later.

  2. Storage.
    I mapped the data directly to a new share at /mnt/user/git/. I tend to commit chunks of work at a time (a new script), and potentially breaking updates separately. You could also pass to cache directly if you are a frequent committer.

Remote access remains handled by Tailscale, which I already had running on the server. Tailscale creates a private encrypted network between your devices, which means Forgejo is accessible from anywhere without being accessible to anyone. Point your Forgejo ROOT_URL at the Tailscale hostname and the web interface, clone URLs, and everything else works correctly.

Set up SSH Authentication

SSH authentication requires that your public key be registered with the Forgejo server. This is not complicated, but it is a step that Git clients running on top of GitHub have quietly handled for years, and so it is easy to forget it exists.4

4 GitHub Desktop, specifically, has a “Publish repository” button that creates the remote and pushes in a single step, handling authentication invisibly. This is very convenient right up until you’re working with a different git host and discover you don’t know how any of it works.

The sequence is:

# Check if you already have a key
cat ~/.ssh/id_ed25519.pub

# If not, generate one
ssh-keygen -t ed25519 -C "someString"

Copy the output of cat, go to Forgejo Settings → SSH/GPG Keys → Add Key, paste it in. Done. Subsequent connections authenticate without prompting for credentials.

Connecting Positron

I use the Github desktop GUI for handling git on my public projects. This isn’t an option here. Conveniently, Positron, which has become my data science IDE of choice5, has the same built-in git integration as VS Code, which is to say it can talk to any standard git remote, including Forgejo.

5 Positron is basically VS Code cosplaying as a modern version of RStudio. As I’ve spent the last few months doing more general coding and less statistics, I’ve spent more time with VS Code, and now that I am back in a research phase again I have found the move to Positron removing some of the pain points of RStudio.

There is not, as far as I can see, a workflow for pushing an existing local repository to Forgejo. This means setup is a bit more tedious than using the Github GUI, but basically involves:

  1. Create an empty repository in the Forgejo web interface, and copy the address (e.g. ssh://git@your-server.ts.net:222/username/repo-name.git). This has to be done for every project (this is the tedious part).6
  2. Use the “New folder from Git…” option in Positron and place the ssh or http link to the Forgejo remote.

6 The SSH URL contains one footgun: note the :222 rather than the standard :22. This is the port remapping from the installation step, now coming back to require attention. Forgejo’s web interface generates the correct clone URL automatically, provided you told it during setup that your SSH port is 222.

After this setup, Positron’s Source Control panel handles subsequent commits and syncs through the GUI.

An Offsite Copy, Because Paranoia Is Justified

Having a private git server is good. Having a backup on a machine in a different building is better. Git repositories are small, disk space is cheap, and the cost of being wrong about “unlikely” is losing years of work.

Accordingly, I have a headless Ubuntu7 machine living elsewhere, that keeps a separate backup of my Nextcloud. I now wanted to set this up to clone every repository from Forgejo automatically, including any repos created since the last run. This machine phones home daily using a Dead Man’s Snitch, and ideally this is as much as I’ll ever have to do with it.

7 I chose Ubuntu because it provides ZFS out-of-the-box, which provides strong native backup protection in case of drive failure.

The approach uses Forgejo’s API to enumerate all repositories, then mirrors each one:

#!/bin/bash
FORGEJO_URL="http://server.tailnet.ts.net:3000"
USERNAME="username"
TOKEN="your-token"
BACKUP_DIR="/home/username/git-backups"
LOG_FILE="/home/username/scripts/logs/git-backup.log"

mkdir -p "$BACKUP_DIR"
echo "[$(date)] Starting git backup" >> "$LOG_FILE"

page=1
while true; do
    repos=$(curl -s -H "Authorization: token $TOKEN" \
        "$FORGEJO_URL/api/v1/repos/search?limit=50&page=$page" \
        | grep -o '"ssh_url":"[^"]*"' | sed 's/"ssh_url":"//;s/"//')

    [ -z "$repos" ] && break

    while IFS= read -r ssh_url; do
        [ -z "$ssh_url" ] && continue
        repo_name=$(basename "$ssh_url" .git)
        repo_path="$BACKUP_DIR/$repo_name"

        if [ -d "$repo_path" ]; then
            echo "[$(date)] Updating $repo_name" >> "$LOG_FILE"
            git -C "$repo_path" fetch --all >> "$LOG_FILE" 2>&1 \
                || echo "[$(date)] ERROR: failed to fetch $repo_name" >> "$LOG_FILE"
        else
            echo "[$(date)] Cloning $repo_name" >> "$LOG_FILE"
            git clone "$ssh_url" "$repo_path" >> "$LOG_FILE" 2>&1 \
                || echo "[$(date)] ERROR: failed to clone $repo_name" >> "$LOG_FILE"
        fi
    done <<< "$repos"

    ((page++))
done

echo "[$(date)] Backup complete" >> "$LOG_FILE"

The script is scheduled to run every 4 hours via cron:

0 */4 * * * /home/user/scripts/git-backup.sh

This whole setup was surprisingly simple, and took less than an hour. The result is a git server that runs on known hardware, backed up in a known place, and accessible only on a network I manage. Turns out paranoia is cheap.