- Lua 100%
| .gitignore | ||
| 00-captcha-init.conf | ||
| captcha-whitelist.txt | ||
| captcha.conf | ||
| captcha.env | ||
| captcha.jpg | ||
| captcha.lua | ||
| LICENSE | ||
| README.md | ||
SearxNG-Captcha
Instructions and code to add a 90-day captcha cookie to SearxNG. The captcha engine can be self hosted. This setup eliminated the abuse by bots of my instance almost completely.
Introduction
My SearxNG instance on searx.thefloatinglab.world suffered so badly from bots that it became unusable. Looking around at fellow SearxNG instances learned that they all suffer from the same problem.
Despite the deployment of botlists, limiters, and manually blocking the most obvious bots, it still remained an endless battle. A new problem is the rise of slow trickle bots. They don't hammer the site but launch a request once per 5 minutes or so. If you have 50 or so of these, you have a request every few seconds, enough to get the instance blocked by the upstream providers. And these slow trickle bots are indistinguishable from normal users... that is, until they hit a captcha...
My instance had become most of the time useless due to these bots. So either I had to give up on this project, or find a way to block the endless stream of bots and let the genuine users through. A captcha portal might not be popular, but on the other hand, a useless site is, well, pretty useless.
By combining the captcha system with a cookie, only once in 90 days the user has to click the captcha. A small price to pay for access to a wonderfull instance!
Features
- No modification of the SearxNG code is necessary, the captcha system runs entirely within Nginx.
- The captcha, once clicked, stays valid for 90 days.
- No puzzles to solve, just a confirmation click.
- It is recommended to self host the open source CAP captcha system, so no information leaks to the outside world.
- The privacy and security of SearxNG are fully maintained if the self hosted open source CAP captcha system is used.
- Effective against slow trickle bots.
- Optional automatic reporting to AbuseIPDB.
- Optionally, Cloudflare Turnstile can be used as captcha provider instead.
- Everything is script based, no compilation is necessary.
- In an emergency, all existing cookies can be invalidated at once.
Installation
This guide is not a complete walk through. I show you my code and how I have done it. Some tweaking might be necessary, and the locations, ownerships and permissions of files on your system might be different than on mine. I have not made an attempt to add subdirectories to this git, so you have to download the files and place them manually in the correct locations.
Prerequisites
- Nginx is used for the reverse proxying
- Lua and some dependencies needs to be installed (apt install lua)
- It is recommended to self host the open source captcha engine, see: github.com/tiagozip/cap.
- A site/secret key set and URL/API-key for the captcha engine.
- Optionally, an API key for AbuseIPDB.
00-captcha-init.conf
This file resides on my system in "/etc/nginx/conf.d".
It configures lua and also creates an extended log format. This log format is optional but it allows you to see in the log file whether someone got passed the captcha. Bots will rarely progress beyond the "challenge" state, if at all.
captcha.env
This file is in my /etc/nginx directory.
The COOKIE_SECRET must be generated with "openssl rand -hex 32"
COOKIE_SECRET = <generate your own key>
ABUSEIPDB_API_KEY = <*Optional! obtain a key at abuseipdb for automated bot reporting, or leave empty for no bot reporting*>
# Enter here the url of your self hosted CAP captcha provider.
CAP_API_URL = https://captcha.thefloatinglab.world
# Enter here your own keys:
CAP_SITE_KEY = 1a9933aa22
CAP_SECRET_KEY = sk-TF8Gn4KKMSC0h46j83AqZWNnga6nlc5v4hoHwn7nE
# Leave the CAP entries empty to use the Turnstile captcha instead.
TURNSTILE_SITE_KEY = 0x4AAAAAADisco1ig4Qu4hPJ
TURNSTILE_SECRET_KEY = 0x4AAAAAADisca-OEq9hnPskVM6G57pTXsM
captcha.conf
This file is in my /etc/nginx/snippets directory.
Do not modify this file.
captcha-whitelist.conf
This file is in my /etc/nginx directory.
This is an ip-whitelist. IP numbers listed here are excempt from the captcha and cookie check. If you have any monitors (uptime), regularly probing the site, put them in this list. If you are on the public searxng instance list, it is a good idea to put the IP of check.searx.space here. If you have a fixed IP yourself, you could add it here if you don't want to have to click every three months on the captcha.
captcha.lua
This file is in my /etc/nginx/lua directory.
This file contains the core of the code.
SearxNG vhost
You have to modify your nginx searxng vhost file to run the captcha.
- Add the optional line "access_log /var/log/nginx/searx.access.log ts;" for enhanced logging features.
- Add the required line "include snippets/captcha.conf;"
- Add the line "access_by_lua_block { require("captcha").guard() }" in every "Location" block that needs protection.
access_log /var/log/nginx/searx.access.log ts; # <-- Optional! The "ts" suffix indicates the extended log format so captcha status is shown.
include snippets/captcha.conf; # <-- REQUIRED!
location / {
access_by_lua_block { require("captcha").guard() } # <-- Add this!
proxy_pass http://127.0.0.1:8886;
Things worth knowing
- Single-use tokens + your 90-day cookie. Both providers issue tokens that are good for one verify call, after which your cookie carries the user. The cookie is provider-agnostic, so an existing __ts_verified cookie continues to work after you switch providers — if the same COOKIE_SECRET is still in the env file. Rotating that secret invalidates all passes regardless of who issued them.
Logging
- You will not see everything in your logs! Bots are immediately redirected to the captcha system, before an entry in the nginx log is made. Many bots are not even capable of properly interfacing with this redirection and simply never make it to the captcha, and vanish without leaving a trail.
- You will see a sharp decline in bots. This is not a malfunction but the intention. Some bots learn quickly, and getting listed in AbuseIPDB doesn't encourage them. It looks like they are coded to detect reporting, or some bot owners might receive automated notifications if they get listed, but one way or the other, they avoid sites that put and keep them on public blacklists.
Self hosted CAP captcha provider
- I'm not affiliated in any way with this CAP self hosted captcha provider, but it looks like a sound project to me. You can fall back on Cloudflare Turnstile if you have more confidence in them, but beware that Cloudflare does some logging and analysis which partly defeats the purpose of SearxNG.
- Token format / field name. Cap auto-injects a hidden cap-token field on solve (default name), and the widget docs note tokens are single-use — so don't be alarmed if reloading the verify endpoint with a stale POST body fails.
- Server-to-server reachability. The verify endpoint runs in your Nginx workers, which now make an outbound HTTPS call to your self hosted captcha provider. If you've firewalled outbound or are running both Nginx and Cap on the same host, you may save a hop by setting CAP_API_URL=http://127.0.0.1:3000 (Cap's default port). The parse_url helper handles plain http:// automatically — no TLS handshake performed in that case. The widget still needs the public URL though, so you'd typically keep the public HTTPS URL for the widget and only flip to localhost for verification by setting them separately. If you want that split, easiest is to add an additional env var like CAP_VERIFY_URL that defaults to CAP_API_URL — I leave it as an exercise to the reader.
- TLS trust. The same lua_ssl_trusted_certificate directive that lets us reach Cloudflare also covers your Cap instance, assuming it's using a publicly-trusted cert (Let's Encrypt etc.). If Cap is on a private CA, point that directive at a bundle that includes your CA.
- Falling back to Turnstile. Comment out or delete the CAP_SITE_KEY line in /etc/nginx/captcha.env and restart. Provider auto-flips back to Turnstile.
- CSP, if you have one. Cap loads its widget script from cdn.jsdelivr.net and its WASM from the same CDN by default. If you've added a strict CSP, you'll need script-src 'self' cdn.jsdelivr.net 'wasm-unsafe-eval'. For pinning to a specific version, replace cap-widget in the script src with cap-widget@ — check the latest release on the project's GitHub.
- Performance & UX. Cap's PoW is invisible-style: the user clicks one checkbox, then watches a brief spinner. Solve time depends on the client device (Cap reports a default-difficulty solve at roughly 2–3s on modern hardware) — much snappier than image puzzles, but slightly more "interactive" than Turnstile's typical zero-click case.
AbuseIPDB
- Reporting to AbuseIPDB is not just for others but it benefits you too! Abusers have their own lists, and you might end up on their lists for "sites to avoid 'cause them report us" and the effects might carry over to other services on your site(s) as well. Having said that, I have submitted over a million automated reports to AbuseIPDB about bots that probe port 25, try to guess SSH credentials, etc., so the problem is huge. But all bits help, and the resulting list was what I used before to weed out at least half of the offenders.
- Threshold of 10 "Walk-aways" (not solving the captcha) per two hours is a reasonable default but tweak WALKAWAY_THRESHOLD and WALKAWAY_TTL to taste. With WALKAWAY_TTL = 3600, the counter auto-expires after an hour of silence, so a slow trickle never builds up. I have my treshold set on two hours.
- One report per IP per 15 minutes. The ts_reported:add() with REPORT_COOLDOWN makes sure you don't spam AbuseIPDB if a botnet member keeps hitting you. Free tier caps at 1000 reports/day; with this design you'd need ~700 distinct repeat-offender IPs/day to come close.
- Behind a CDN / reverse proxy? ngx.var.remote_addr would be the proxy's IP, not the client's. Either configure ngx_http_realip_module (set_real_ip_from/real_ip_header X-Forwarded-For) so $remote_addr reflects the real client, or change the calls to ngx.var.http_x_forwarded_for (and parse out the first hop yourself). Don't ship to AbuseIPDB without verifying which IP you're sending — reporting your CDN's IP would be embarrassing.
- Reset on solve is per-IP. If a real human eventually solves from the same NAT/IP, the counter clears for the whole address. Good for shared exits.
- Don't report your own monitoring. If you have uptime checks (UptimeRobot, Pingdom, internal Prometheus blackbox) hitting an HTML path without ever solving the challenge, they'll trigger this. Put their IP's in the provided captcha-whitelist.txt file.
- Tor and shared VPNs. Mass reports against shared exit nodes are controversial. If your searx instance is publicly listed, expect a chunk of legitimate traffic from Tor exits — consider raising the threshold to 25–50, or excluding well-known shared-exit ASNs.
- Where the counters live. lua_shared_dict is per-Nginx-instance, in shared memory across all workers, and lost on restart. That's fine here — we don't need persistence; a bot will simply rebuild its score within an hour. The dict is sized 10 MB, which fits ~100k IPs comfortably.
- Async timing. ngx.timer.at(0, …) returns immediately, so the user's HTTP response (the 200 challenge page or the 302 redirect) is never delayed by the AbuseIPDB call. The report happens in a background light-thread inside the same worker.
- Audit trail. Every report logs to error.log at notice level with the IP and the count, so you can grep 'reported.*AbuseIPDB' /var/log/nginx/error.log | wc -l for a daily tally. If you want richer accounting (which paths the bot hit, user-agent, ASN), you can pass them through to the timer and stitch them into the comment field — AbuseIPDB shows the comment verbatim on the IP's public page.
License
See the license file. The original of this project can be found at git.thefloatinglab.world/TheFloatingLab/SearxNG-Captcha which is part of www.thefloatinglab.world
