It reminds me of a level of Serious Sam – the one where 1,000s of those headless bomb-toting zombie-soldiers and screamers came pouring at you relentlessly, seemingly to infinity (and beyond).
It was a backdraft. Or the eye of the Zombie-Nado-Cane. When the bad-bots got some air around August 5th – hak4umz.net DDoS or DNS Amplification – fail2ban (and the servers) got burned.
Even the “eye-dee-keff-kuh-may” (TammyBelle’s God Mode Code for DOOM][ ) cheat didn’t help. fail2ban got clobbered… ‘already banned’ every one second in the log and no more bans happening because 100s or 1000s of times per second from 100s or thousands of bots: bad requests.
Here is the Serious! problem when you put the fail2ban -vs- the entire globe death match together:
2013-08-13 12:40:37,730 fail2ban.actions: INFO [named-flood] <ip> already banned 2013-08-13 12:40:38,732 fail2ban.actions: INFO [named-flood] <ip> already banned
almost exactly one second apart, hundreds of times, and no new banning going on.
A: fail2ban appears to have a “one second pulse/parse” clock built in to it.
A: So, when 4,000 log entries appear in a log that fail2ban is reading within that one second, fail2ban ‘queues’ (or spools or fifo’s) those 4,000 entries into an internal list and tries to de-queue them one-per-second.
Easier math: (“let’s say”) there are 10 ‘fail regex’ entries pouring into your log per second. Trying to de-queue the messages from the first second takes fail2ban 9 seconds. By the time it gets done, there are 90 more messages/fails waiting. So every second that goes by (in this low number scenario) the problem gets 10-to-the-10th-power worse. The problem being fail2ban over-run by those headless bomb-toting zombies. The “real world” explanation: fail2ban lags out and becomes combat ineffective. In cop-talk the radio call from Officer fail2ban would be: “Extended”
Now, a “server admin” must consider – besides ‘shutdown -h now’ – is there a solution to the problem? First part of that: what – exactly – is the problem. More Q/A (logic/reasoning):
A: fail2ban says ‘already banned’ and is ‘lagged out'; can’t fight the good fight.
A: Too many log entries per second. fail2ban reads logs and ‘actions’ based on log entries.
Q: So, why don’t you server admins just limit the number of log entries? (Instead of trying to hyper-tune fail2ban, just give it less to do? Remember the old-old server used to say ‘…the previous message repeated ### times…’)
A: Why didn’t I think of that.
The old-old server was a Gentoo box dragged across the millennium boundary by makes and make-installs. It finally wore out (it still runs, it was just retired because it had done it’s duty) this year. A little searching about ‘the previous message repeated’ and was reminded that that is called: rate-limit-ing. A modern Centos-6-x86_64 install (not a bunch of custom compiled stuff on a 32-bit Gentoo) uses an ‘out of the box’ rsyslog and doesn’t say things like ‘…the previous message…’ The new stuff says:
imuxsock begins to drop messages from pid 1228 due to rate-limiting
Very little more searching finds:
The docs are a ‘little dated’ (2010) but the essentials are there to solve the problem (problem being ‘too many log entries for poor old fail2ban’).
vim /etc/rsyslog.conf (and add as the 2nd and 3rd uncommented lines):
#### 8.12.13 - try to slow the message floods so fail2ban won't die so much #### $SystemLogRateLimitInterval 1 $SystemLogRateLimitBurst 5
[Esc]:wq (write and quit)
Now do a /etc/init.d/rsyslog restart or service rsyslog restart (reload does not work, I tried it) and…
Tah-dah! fail2ban can keep up with the log. Some of the abusers (firey screaming zombies with tater-bombs) get by for a few seconds until the rate-limit/fail2ban get Serious!; but, real-world they were getting by by the hundres-of-thousands before this fix (while poor old fail2ban was over-run or lag-back-buffered).
It may not be ‘iddqd’ (god/degreelessness mode in ‘that other great fps’), but $SystemLogRateLimitInterval/$SystemLogRateLimitBurst is very close to TammyBelle’s “eye-dee-keff-kuh-may” (megaarmor, weapons and keys) for fail2ban.
Almost as good as Tangy-Bells for break-shishst.
Very happy, ammo added.
*** A minor success/victory ***
49 hours later… fail2ban chugging along ban/unban-ing, much smaller log files, no other services lagged out because of the packet attacks on port 53….