Profanity is an odd beast. Practically everyone knows it when they see it; but the more you try to break down exactly makes certain words obscene and others not, the harder it becomes to explain. As fast as language evolves and changes, there’s only one thing you can be certain of: somewhere, there’s always someone working hard on new ways to offend other people.
About a month or so ago, I found myself in need of a Node.js profanity filter. Searching the NPM registry came up with a number of possible solutions, but none of them seemed to quite match what I was looking for. I wanted something that wouldn’t replace words blindly, preferably by using tuned regular expressions to try and avoid censoring relatively innocent communication (a scenario that people refer to as the Scunthorpe problem.) I mean, that’s pretty basic stuff, right? Tons of people have probably written code to do exactly that over the years. Someone must’ve gotten around to sharing their solution at some point.
Or so one would think. Instead, nearly everything I found seemed to involve brute-force approaches, usually by matching directly against an immense list of possible obscenities (possibly for reasons of optimization? I know that attacks against badly-written regexs can be a possible security issue…). The problem with that strategy is that people (or at least 4channers) are constantly generating new ways to spell or signal obscenities. There’s no way for any one programmer to anticipate every possible way of doing so, especially without running the risk of Scunthorpes. Besides, what’s considered offensive is highly dependent on context. What an elementary school or a site geared towards children might want to censor is different from, say, a forum admin who just wants to keep the post titles on the front page from resembling a PornHub feed. How could a single list cover both situations? The problem’s difficult enough that many developers just prefer to throw up their hands and admit defeat.
The way I see it, the point of profanity filters isn’t to eliminate obscene language altogether. It’s to make using curse words harder, or at least difficult enough that 90% of people will accept being bowdlerized rather than putting in the effort to game the system. Sure, bored teenagers with sufficient free time can eventually figure their way past any possible logic one could put in place. That’s inevitable. But most people don’t want to bother putting in that kind of effort, especially if the filter works in such a way so as to keep the basic thrust of their message intact. If comic strips have taught me anything, it’s that it doesn’t take too much context for someone to figure out what four-letter word one should read in place of $#!%.
grawlix (npm | github.) It’s aimed at server-side Node for the moment (though, the way it’s written, I doubt it would be very hard to adapt for browser/client-side use.) For the purposes of time, I chose to limit the number of words I was targeting to a choice few (NSFW), key among them George Carlin’s "Seven Dirty Words." Instead, I decided to focus on making it easy for other devs to add their own filters as necessary, with the package acting as more of a replacement engine than anything. I eventually introduced a plugin system so as to take a more modular approach. The idea is to allow people to tailor their filters as much as possible to a given situation, picking and choosing what functionality they need to load in. As a proof of concept, I also released a plugin of my own aimed at ethnic and racial slurs:
grawlix-racism (npm | github.)
So how did I do? Hard to say. I haven’t gotten much in the way of feedback, which is a little frustrating. It doesn’t help that I’m positive I’m just reinventing the wheel here (though if anyone else has shared a similar open-source solution along these lines, I have yet to find it.) In terms of the code, it’s possible that I might have gone overboard in terms of making things customizable, to the point that I may be overwhelming the user with options. But it’s hard for me to judge that on my own.
On the plus side, though, I really tested the hell out of this thing. These are the first NPM packages I’ve published where I’ve achieved 100% test coverage according to istanbul — and that’s not including all the extra regex and Scunthorpe checks I wrote. I also put a lot of effort into documenting things as much as I could. In addition to the standard README, I included five additional markdown files in a separate folder, with subjects ranging from word filters to output style configuration.
And that’s where things stand. A good learning experience for me, if nothing else. I just hope that it winds up coming in handy for someone at some point along the line. Even if it’s just to raid my regular expressions and hock the rest.