Added a whitelist capability

2026-06-13 04:58:41 +00:00 · 2026-06-13 04:58:41 +00:00 · f240382b16
commit f240382b16
parent b64f38d52a
4 changed files with 162 additions and 12 deletions
--- a/README.md
+++ b/README.md
@ -5,7 +5,11 @@ Instructions and code to add a 90-day captcha cookie to SearxNG. The captcha eng
 # Introduction
 My SearxNG instance on [searx.thefloatinglab.world](https://searx.thefloatinglab.world) suffered so badly from bots that it became unusable. Looking around at fellow SearxNG instances learned that they all suffer from the same problem.

-Despite the deployment of botlists, limiters, and manually blocking the most obvious bots, it still remained an endless battle. Most of the time my instance was useless. So either I had to give up on this project, or find a way to block the bots and let the genuine users through. A captcha system might not be popular, but on the other hand, a useless site is well, pretty useless.
+Despite the deployment of botlists, limiters, and manually blocking the most obvious bots, it still remained an endless battle. 
+A new problem is the rise of slow trickle bots. They don't hammer the site but launch a request once per 5 minutes or so. If you have 50 or so of these, you have a request every few seconds, enough to get the instance blocked by the upstream providers. And these slow trickle bots are indistinguishable from normal users... that is, until they hit a captcha...
+
+Most of the time my instance was useless. So either I had to give up on this project, or find a way to block the bots and let the genuine users through. A captcha system might not be popular, but on the other hand, a useless site is, well, pretty useless.
+
 By combining the captcha system with a cookie, only once in 90 days the user has to solve the captcha. A small price to pay for access to a wonderfull instance!

 # Features
@ -13,7 +17,8 @@ By combining the captcha system with a cookie, only once in 90 days the user has
 - The captcha, once solved, stays valid for 90 days.
 - No puzzles to solve, just a confirmation click.
 - It is recommended to [self host the open source captcha system](https://github.com/tiagozip/cap), so no information leaks to the outside world.
- The privacy and security of SearxNG are fully maintained if this self hosted captcha system is used.
+- The privacy and security of SearxNG are fully maintained if this self hosted open source captcha system is used.
+- Effective against slow trickle bots.
 - Optional automatic reporting to [AbuseIPDB](https://www.abuseipdb.com).
 - Optionally, Cloudflare Turnstile can be used as captcha provider instead.
 - Everything is script based, no compilation is necessary.
@ -22,7 +27,7 @@ By combining the captcha system with a cookie, only once in 90 days the user has
 ![picture](https://git.thefloatinglab.world/TheFloatingLab/SearxNG-Captcha/raw/branch/main/captcha.jpg?raw=1)

 # Installation
-This guide is not a complete walk through. I show you my code and how I have done it. Some tweaking might be necessary, and the locations of files on your system might be different than on mine.
+This guide is not a complete walk through. I show you my code and how I have done it. Some tweaking might be necessary, and the locations, ownerships and permissions of files on your system might be different than on mine.
 I have not made an attempt to add subdirectories to this git, so you have to download the files and place them manually in the correct locations.

 ## Prerequisites
@ -60,6 +65,14 @@ TURNSTILE_SECRET_KEY = 0x4AAAAAADisca-OEq9hnPskVM6G57pTXsM

 Do not modify this file.

+## captcha-whitelist.conf
+*This file is in my /etc/nginx directory.*
+
+This is an ip-whitelist. IP numbers listed here are excempt from the captcha and cookie check.
+If you have any monitors (uptime), regularly probing the site, put them in this list.
+If you are on the public searxng instance list, it is a good idea to put the IP of check.searx.space here.
+If you have a fixed IP yourself, you could add it here if you don't want to have to click every three months on the captcha.
+
 ## captcha.lua
 *This file is in my /etc/nginx/lua directory.*

@ -71,7 +84,6 @@ You have to modify your nginx searxng vhost file to run the captcha.
 - Add the optional line "access_log /var/log/nginx/searx.access.log ts;" for enhanced logging features.
 - Add the required line "include snippets/captcha.conf;"
 - Add the line "access_by_lua_block { require("captcha").guard() }" in every "Location" block that needs protection.
- Add "Location" blocks for locations that should not be protected, such as the ones used for health checks/monitoring.
 - Duplicate your root location "/" to "/searxng/" to catch the bots that will entry from there.

 You likely have no "location" for "/searxng/stats" yet, but if you use a health checker or monitor bot on the /stats directory, you can add it without the reference to the captcha system, so it remains accessible without the need for solving the captcha first.
@ -83,11 +95,6 @@ Most bots search by using "/?q=" but some also from "/searxng/?q=". So both loca

    include snippets/captcha.conf; # <-- REQUIRED!

-    # Add this location if you want to keep /searxng/stats captcha free. A reason to do this is that you might have it checked by a monitor bot (uptime).
-    location = /searxng/stats {
-        proxy_pass http://127.0.0.1:8886;
-    }
-
    # You need to mention this location specifically to catch the bots that do not search via the root but via /searxng.
    location /searxng/ {
        access_by_lua_block { require("captcha").guard() } # <-- Add this!
@ -116,12 +123,12 @@ Most bots search by using "/?q=" but some also from "/searxng/?q=". So both loca
 - Performance & UX. Cap's PoW is invisible-style: the user clicks one checkbox, then watches a brief spinner. Solve time depends on the client device (Cap reports a default-difficulty solve at roughly 2–3s on modern hardware) — much snappier than image puzzles, but slightly more "interactive" than Turnstile's typical zero-click case.

 ## AbuseIPDB
- Reporting to [AbuseIPDB](https://www.abuseipdb.com) is not just for others but it benefits you too! Abusers have their own lists, and you might end up on their lists for "sites to avoid because they report" and it might carry over to other services on your site(s) as well.
- Threshold of 10 per two hours is a reasonable default but tweak WALKAWAY_THRESHOLD and WALKAWAY_TTL to taste. With WALKAWAY_TTL = 3600, the counter auto-expires after an hour of silence, so a slow trickle never builds up. I have my treshold set on two hours.
+- Reporting to [AbuseIPDB](https://www.abuseipdb.com) is not just for others but it benefits you too! Abusers have their own lists, and you might end up on their lists for "sites to avoid 'cause them report us" and the effects might carry over to other services on your site(s) as well. Having said that, I have submitted *over a million* automated reports to AbuseIPDB about bots that probe port 25, try to guess SSH credentials, etc., so the problem is huge. But all bits help, and the resulting list was what I used before to weed out at least half of the offenders.
+- Threshold of 10 "Walk-aways" (not solving the captcha) per two hours is a reasonable default but tweak WALKAWAY_THRESHOLD and WALKAWAY_TTL to taste. With WALKAWAY_TTL = 3600, the counter auto-expires after an hour of silence, so a slow trickle never builds up. I have my treshold set on two hours.
 - One report per IP per 15 minutes. The ts_reported:add() with REPORT_COOLDOWN makes sure you don't spam AbuseIPDB if a botnet member keeps hitting you. Free tier caps at 1000 reports/day; with this design you'd need ~700 distinct repeat-offender IPs/day to come close.
 - Behind a CDN / reverse proxy? ngx.var.remote_addr would be the proxy's IP, not the client's. Either configure ngx_http_realip_module (set_real_ip_from/real_ip_header X-Forwarded-For) so $remote_addr reflects the real client, or change the calls to ngx.var.http_x_forwarded_for (and parse out the first hop yourself). Don't ship to AbuseIPDB without verifying which IP you're sending — reporting your CDN's IP would be embarrassing.
 - Reset on solve is per-IP. If a real human eventually solves from the same NAT/IP, the counter clears for the whole address. Good for shared exits.
- Don't report your own monitoring. If you have uptime checks (UptimeRobot, Pingdom, internal Prometheus blackbox) hitting an HTML path without ever solving the challenge, they'll trigger this. Either point them at a path that bypasses the gate, give them a static __ts_verified cookie minted manually with a far-future expiry, or whitelist their IPs in note_walkaway.
+- Don't report your own monitoring. If you have uptime checks (UptimeRobot, Pingdom, internal Prometheus blackbox) hitting an HTML path without ever solving the challenge, they'll trigger this. Put their IP's in the provided captcha-whitelist.txt file.
 - Tor and shared VPNs. Mass reports against shared exit nodes are controversial. If your searx instance is publicly listed, expect a chunk of legitimate traffic from Tor exits — consider raising the threshold to 25–50, or excluding well-known shared-exit ASNs.
 - Where the counters live. lua_shared_dict is per-Nginx-instance, in shared memory across all workers, and lost on restart. That's fine here — we don't need persistence; a bot will simply rebuild its score within an hour. The dict is sized 10 MB, which fits ~100k IPs comfortably.
 - Async timing. ngx.timer.at(0, …) returns immediately, so the user's HTTP response (the 200 challenge page or the 302 redirect) is never delayed by the AbuseIPDB call. The report happens in a background light-thread inside the same worker.
--- a/captcha-whitelist.txt
+++ b/captcha-whitelist.txt
@ -0,0 +1,8 @@
+# IPs and CIDR blocks that bypass the captcha gate entirely.
+# One entry per line. Lines starting with # are comments.
+# Examples:
+#   1.2.3.4
+#   10.0.0.0/24
+#   2001:db8::/32
+127.0.0.1
+::1
--- a/captcha.jpg
+++ b/captcha.jpg
--- a/captcha.lua
+++ b/captcha.lua
@ -34,6 +34,140 @@ do
    end
 end

+-- ---------- whitelist ----------
+local WHITELIST_FILE = "/etc/nginx/captcha-whitelist.txt"
+local wl4, wl6 = {}, {}   -- {{net=..., mask=..., bits=...}, ...}
+
+local function ip4_to_n(ip)
+    local a, b, c, d = ip:match("^(%d+)%.(%d+)%.(%d+)%.(%d+)$")
+    if not a then return nil end
+    a, b, c, d = tonumber(a), tonumber(b), tonumber(c), tonumber(d)
+    for _, n in ipairs({a, b, c, d}) do
+        if not n or n < 0 or n > 255 then return nil end
+    end
+    -- Build via arithmetic so this works on plain Lua 5.1 (no bit lib)
+    return ((a * 256 + b) * 256 + c) * 256 + d
+end
+
+-- Expand an IPv6 string to 8 groups of 16-bit numbers
+local function ip6_to_groups(ip)
+    if not ip:find(":") then return nil end
+    local head, tail = ip:match("^(.-)::(.*)$")
+    local h_parts, t_parts = {}, {}
+    if head then
+        for g in (head .. ":"):gmatch("([^:]*):") do h_parts[#h_parts+1] = g end
+        for g in (tail .. ":"):gmatch("([^:]*):") do t_parts[#t_parts+1] = g end
+        if #h_parts == 1 and h_parts[1] == "" then h_parts = {} end
+        if #t_parts == 1 and t_parts[1] == "" then t_parts = {} end
+    else
+        for g in (ip .. ":"):gmatch("([^:]*):") do h_parts[#h_parts+1] = g end
+    end
+    local groups = {}
+    for _, g in ipairs(h_parts) do groups[#groups+1] = tonumber(g, 16) or -1 end
+    local fill = 8 - #h_parts - #t_parts
+    if head then
+        for _ = 1, fill do groups[#groups+1] = 0 end
+    end
+    for _, g in ipairs(t_parts) do groups[#groups+1] = tonumber(g, 16) or -1 end
+    if #groups ~= 8 then return nil end
+    for _, g in ipairs(groups) do
+        if g < 0 or g > 0xFFFF then return nil end
+    end
+    return groups
+end
+
+local function load_whitelist()
+    local new4, new6, count = {}, {}, 0
+    local f = io.open(WHITELIST_FILE, "r")
+    if not f then
+        ngx.log(ngx.ERR, "captcha: cannot open whitelist ", WHITELIST_FILE)
+        wl4, wl6 = {}, {}
+        return
+    end
+    for raw in f:lines() do
+        local line = raw:gsub("#.*$", ""):match("^%s*(.-)%s*$") or ""
+        if line ~= "" then
+            local addr, bits = line:match("^([^/]+)/(%d+)$")
+            if not addr then addr = line end
+
+            if addr:find(":", 1, true) then
+                local g = ip6_to_groups(addr)
+                if g then
+                    bits = tonumber(bits or 128)
+                    if bits >= 0 and bits <= 128 then
+                        new6[#new6+1] = { groups = g, bits = bits }
+                        count = count + 1
+                    else
+                        ngx.log(ngx.ERR, "captcha: bad whitelist line: ", raw)
+                    end
+                else
+                    ngx.log(ngx.ERR, "captcha: bad whitelist line: ", raw)
+                end
+            else
+                local n = ip4_to_n(addr)
+                if n then
+                    bits = tonumber(bits or 32)
+                    if bits >= 0 and bits <= 32 then
+                        -- mask = 2^32 - 2^(32-bits)
+                        local mask = (bits == 0) and 0
+                                  or (4294967296 - 2 ^ (32 - bits))
+                        new4[#new4+1] = { net = n - (n % (2 ^ (32 - bits))),
+                                          mask = mask }
+                        count = count + 1
+                    else
+                        ngx.log(ngx.ERR, "captcha: bad whitelist line: ", raw)
+                    end
+                else
+                    ngx.log(ngx.ERR, "captcha: bad whitelist line: ", raw)
+                end
+            end
+        end
+    end
+    f:close()
+    wl4, wl6 = new4, new6
+    ngx.log(ngx.ERR, "captcha: whitelist loaded, ", count, " entries")
+end
+
+local function ip_whitelisted(ip)
+    if not ip or ip == "" then return false end
+    if ip:find(":", 1, true) then
+        local g = ip6_to_groups(ip); if not g then return false end
+        for _, e in ipairs(wl6) do
+            local bits, ok = e.bits, true
+            for i = 1, 8 do
+                if bits >= 16 then
+                    if g[i] ~= e.groups[i] then ok = false; break end
+                    bits = bits - 16
+                elseif bits > 0 then
+                    local shift = 16 - bits
+                    local m = 0xFFFF - (2 ^ shift - 1)
+                    if (g[i] - g[i] % (2 ^ shift)) ~= e.groups[i] - e.groups[i] % (2 ^ shift) then
+                        ok = false
+                    end
+                    break
+                else
+                    break
+                end
+            end
+            if ok then return true end
+        end
+        return false
+    else
+        local n = ip4_to_n(ip); if not n then return false end
+        for _, e in ipairs(wl4) do
+            -- n AND mask == net  →  network match
+            if (n - n % (4294967296 - e.mask + 0.5 - 0.5)) - (n % (2 ^ 0)) then
+                -- placeholder; see real check below
+            end
+            -- real check using arithmetic AND via subtraction of remainder:
+            local block = 4294967296 - e.mask
+            if (n - n % block) == e.net then return true end
+        end
+        return false
+    end
+end
+
+load_whitelist()

 local function provider()
    if cfg.cap_site_key ~= "" and cfg.cap_secret_key ~= "" and cfg.cap_url ~= "" then
@ -293,6 +427,7 @@ end
 -- ---------- public: gate ----------
 function M.guard()
    if ngx.var.uri == VERIFY_PATH then return end
+    if ip_whitelisted(ngx.var.remote_addr) then return end
    if cookie_is_valid(ngx.var["cookie_" .. COOKIE_NAME]) then return end

    -- (re-add any monitoring bypasses you set up earlier here)