Frans Veldman f95517d3a1 Some rework of the Readme file.

2026-06-12 08:36:42 +00:00

12 KiB

Raw Blame History

SearxNG-Captcha

Instructions and code to add a 90-day captcha cookie to SearxNG. The captcha engine can be self hosted. This setup eliminated the abuse by bots of my instance almost completely.

Introduction

My SearxNG instance on searx.thefloatinglab.world suffered so badly from bots that it became unusable. Looking around at fellow SearxNG instances learned that they all suffer from the same problem.

Despite the deployment of botlists, limiters, and manually blocking the most obvious bots, it still remained an endless battle. Most of the time my instance was useless. So either I had to give up on this project, or find a way to block the bots and let the genuine users through. A captcha system might not be popular, but on the other hand, a useless site is well, pretty useless. By combining the captcha system with a cookie, only once in 90 days the user has to solve the captcha. A small price to pay for access to a wonderfull instance!

Features

No modification of the SearxNG code is necessary, the captcha system runs entirely within Nginx.
The captcha, once solved, stays valid for 90 days.
No puzzles to solve, just a confirmation click.
It is possible and encouraged to self host the captcha system, so no information leaks to the outside world.
The privacy and security of SearxNG are maintained if a self hosted captcha system is used.
Optional automatic reporting to AbuseIPDB.
Optionally, Cloudflare Turnstile can be used as captcha provider instead.
Everything is script based, no compilation is necessary.
In an emergency, all existing cookies can be invalidated at once.

Installation

This guide is not a complete walk through. I show you my code and how I have done it. Some tweaking might be necessary, and the locations of files on your system might be different than on mine. I have not made an attempt to add subdirectories to this git, so you have to download the files and place them manually in the correct locations.

Prerequisites

Nginx is used for the reverse proxying
Lua and some dependencies needs to be installed (apt install lua)
It is recommended to self host the captcha engine, see: github.com/tiagozip/cap.
A site/secret key set and URL/API-key for the captcha engine.
Optionally, an API key for AbuseIPDB.

00-captcha-init.conf

This file resides on my system in "/etc/nginx/conf.d".

It configures lua and also creates an extended log format. This log format is optional but it allows you to see in the log file whether someone got passed the captcha. Most bots will not progress beyond the "challenge" state.

captcha.env

This file is in my /etc/nginx directory.

The COOKIE_SECRET must be generated with "openssl rand -hex 32"

COOKIE_SECRET = <generate your own key>
ABUSEIPDB_API_KEY = <*Optional! obtain a key at abuseipdb for automated bot reporting, or leave empty for no bot reporting*>
CAP_API_URL = https://captcha.thefloatinglab.world # *self hosted site*
CAP_SITE_KEY = 1a9933aa22 # Example, change this!
CAP_SECRET_KEY = sk-TF8Gn4KKMSC0h46j83AqZWNnga6nlc5v4hoHwn7nE # Example, change this!
\# *Leave the CAP entries empty to use the Turnstile captcha.*
TURNSTILE_SITE_KEY = 0x4AAAAAADisco1ig4Qu4hPJ # Example, change this!
TURNSTILE_SECRET_KEY = 0x4AAAAAADisca-OEq9hnPskVM6G57pTXsM # Example, change this!

captcha.conf

This file is in my /etc/nginx/snippets directory.

Do not modify this file.

captcha.lua

This file is in my /etc/nginx/lua directory.

This file contains the core of the code.

SearxNG vhost

You have to modify your nginx searxng vhost file to run the captcha.

Add the optional line "access_log /var/log/nginx/searx.access.log ts;" for enhanced logging features.
Add the required line "include snippets/captcha.conf;"
Add the line "access_by_lua_block { require("captcha").guard() }" in every "Location" block that needs protection.
Add "Location" blocks for locations that should not be protected, such as the ones used for health checks/monitoring.
Duplicate your root location "/" to "/searxng/" to catch the bots that will entry from there.

You likely have no "location" for "/searxng/stats" yet, but if you use a health checker or monitor bot on the /stats directory, you can add it without the reference to the captcha system, so it remains accessible without the need for solving the captcha first.

Most bots search by using "/?q=" but some also from "/searxng/?q=". So both locations should be listed here.

    access_log /var/log/nginx/searx.access.log ts; # <-- Optional! The "ts" suffix indicates the extended log format so captcha status is shown.

    include snippets/captcha.conf; # <-- REQUIRED!

    # Add this location if you want to keep /searxng/stats captcha free. A reason to do this is that you might have it checked by a monitor bot (uptime).
    location = /searxng/stats {
        proxy_pass http://127.0.0.1:8886;
    }

    # You need to mention this location specifically to catch the bots that do not search via the root but via /searxng.
    location /searxng/ {
        access_by_lua_block { require("captcha").guard() } # <-- Add this!
        proxy_pass http://127.0.0.1:8886;
    }

    location / {
        access_by_lua_block { require("captcha").guard() } # <-- Add this!
        proxy_pass http://127.0.0.1:8886;

Things worth knowing

Single-use tokens + your 90-day cookie. Both providers issue tokens that are good for one verify call, after which your cookie carries the user. The cookie is provider-agnostic, so an existing __ts_verified cookie continues to work after you switch providers — if the same COOKIE_SECRET is still in the env file. Rotating that secret invalidates all passes regardless of who issued them.
I'm not affiliated in any way with the CAP self hosted captcha provider, but it looks like a sound project to me. You can fall back on Cloudflare Turnstile if you have more confidence in them, but beware that they do some logging and analysis which partly defeats the purpose of SearxNG.

Logging

You will not see everything in your logs! Bots are immediately redirected to the captcha system, before an entry in the nginx log is made. Many bots are not even capable of properly interfacing with this redirection and simply nevere make it to the captcha, and vanish without leaving a trail.
You will see a sharp decline in bots. This is not a malfunction but the intention. Some bots learn quickly, and getting listed in AbuseIPDB doesn't encourage them. It looks like they are coded to detect reporting, or some bot owners might receive automated notifications if they get listed, but one way or the other, they avoid sites that put and keep them on public blacklists.

Self hosted CAP

Token format / field name. Cap auto-injects a hidden cap-token field on solve (default name), and the widget docs note tokens are single-use — so don't be alarmed if reloading the verify endpoint with a stale POST body fails.
Server-to-server reachability. The verify endpoint runs in your Nginx workers, which now make an outbound HTTPS call to your self hosted captcha provider. If you've firewalled outbound or are running both Nginx and Cap on the same host, you may save a hop by setting CAP_API_URL=http://127.0.0.1:3000 (Cap's default port). The parse_url helper handles plain http:// automatically — no TLS handshake performed in that case. The widget still needs the public URL though, so you'd typically keep the public HTTPS URL for the widget and only flip to localhost for verification by setting them separately. If you want that split, easiest is to add a fourth env var like CAP_VERIFY_URL that defaults to CAP_API_URL — I leave it as an exercise to the reader.
TLS trust. The same lua_ssl_trusted_certificate directive that lets us reach Cloudflare also covers your Cap instance, assuming it's using a publicly-trusted cert (Let's Encrypt etc.). If Cap is on a private CA, point that directive at a bundle that includes your CA.
Falling back to Turnstile. Comment out or delete the CAP_SITE_KEY line in /etc/nginx/captcha.env and restart. Provider auto-flips back to Turnstile.
CSP, if you have one. Cap loads its widget script from cdn.jsdelivr.net and its WASM from the same CDN by default. If you've added a strict CSP, you'll need script-src 'self' cdn.jsdelivr.net 'wasm-unsafe-eval'. For pinning to a specific version, replace cap-widget in the script src with cap-widget@ — check the latest release on the project's GitHub.
Performance & UX. Cap's PoW is invisible-style: the user clicks one checkbox, then watches a brief spinner. Solve time depends on the client device (Cap reports a default-difficulty solve at roughly 2–3s on modern hardware) — much snappier than image puzzles, but slightly more "interactive" than Turnstile's typical zero-click case.

AbuseIPDB

Reporting to AbuseIPDB is not just for others but it benefits you too! Abusers have their own lists, and you might end up on their lists for "sites to avoid because they report" and it might carry over to other services on your site(s) as well.
Threshold of 10 per two hours is a reasonable default but tweak WALKAWAY_THRESHOLD and WALKAWAY_TTL to taste. With WALKAWAY_TTL = 3600, the counter auto-expires after an hour of silence, so a slow trickle never builds up. I have my treshold set on two hours.
One report per IP per 15 minutes. The ts_reported:add() with REPORT_COOLDOWN makes sure you don't spam AbuseIPDB if a botnet member keeps hitting you. Free tier caps at 1000 reports/day; with this design you'd need ~700 distinct repeat-offender IPs/day to come close.
Behind a CDN / reverse proxy? ngx.var.remote_addr would be the proxy's IP, not the client's. Either configure ngx_http_realip_module (set_real_ip_from/real_ip_header X-Forwarded-For) so $remote_addr reflects the real client, or change the calls to ngx.var.http_x_forwarded_for (and parse out the first hop yourself). Don't ship to AbuseIPDB without verifying which IP you're sending — reporting your CDN's IP would be embarrassing.
Reset on solve is per-IP. If a real human eventually solves from the same NAT/IP, the counter clears for the whole address. Good for shared exits.
Don't report your own monitoring. If you have uptime checks (UptimeRobot, Pingdom, internal Prometheus blackbox) hitting an HTML path without ever solving the challenge, they'll trigger this. Either point them at a path that bypasses the gate, give them a static __ts_verified cookie minted manually with a far-future expiry, or whitelist their IPs in note_walkaway.
Tor and shared VPNs. Mass reports against shared exit nodes are controversial. If your searx instance is publicly listed, expect a chunk of legitimate traffic from Tor exits — consider raising the threshold to 25–50, or excluding well-known shared-exit ASNs.
Where the counters live. lua_shared_dict is per-Nginx-instance, in shared memory across all workers, and lost on restart. That's fine here — we don't need persistence; a bot will simply rebuild its score within an hour. The dict is sized 10 MB, which fits ~100k IPs comfortably.
Async timing. ngx.timer.at(0, …) returns immediately, so the user's HTTP response (the 200 challenge page or the 302 redirect) is never delayed by the AbuseIPDB call. The report happens in a background light-thread inside the same worker.
Audit trail. Every report logs to error.log at notice level with the IP and the count, so you can grep 'reported.*AbuseIPDB' /var/log/nginx/error.log | wc -l for a daily tally. If you want richer accounting (which paths the bot hit, user-agent, ASN), you can pass them through to the timer and stitch them into the comment field — AbuseIPDB shows the comment verbatim on the IP's public page.

License

See the license file. The original of this project can be found at git.thefloatinglab.world/TheFloatingLab/SearxNG-Captcha which is part of www.thefloatinglab.world

12 KiB Raw Blame History Unescape Escape