BanBuilder - PHP profanity filter composer package for application developers, moderators, etc.

BOO Censorship!

I hate censorship as much as anyone, but as a web application developer, there are times when a banned-word list is necessary, especially if the application or site is geared towards younger users or corporate environments.

What This Script Does

The script will take a word or phrase and replace the words from a predefined list of forbidden words and replace it with asterisks (or whatever other character you might fancy.)

Features

Case insensitive
Looks for "leetspeak"-style combinations of foreign characters, numbers and symbols
Uses regex, so your badword list stays short
Uses asterisks as the replacement, but you can specify your own character
Add as many bad words as you like

I haven't finished all of the leetspeak filters yet, but I have a good start so far.

How to Get It?

To install BanBuilder, simply include it in your project's composer.json.

"snipe/banbuilder": "dev-master",

And then run composer update. There are no additional dependencies required for this package to work.

Usage

use Snipe\BanBuilder\CensorWords;
$censor = new CensorWords;
$string = $censor->censorString($yourstring);

This returns $string as an array, where you can access $string['clean'] for the cleaned version of the $yourstring, or $string['orig'], which will give you the original $yourstring.

Options

By default, this package uses an asterisk (*) as the replacement character, so that shit becomes ****. You can change that by using the setReplaceChar method. (Note that the symbol or letter you use must be one-character.)

use Snipe\BanBuilder\CensorWords;
$censor = new CensorWords;
$censor->setReplaceChar("X");
$string = $censor->censorString($yourstring);

Languages

The available language libraries are in src/dict. We currently have:

English - US (en-us)
English - UK (en-uk)
Spanish - Spain (es)
Korean - South (kr)
French - France (fr)
Dutch - Netherlands (nl)
Norwegian - Bokmål & various dialects - (no)
German (de) - rudimentary
Finnish (fi)
Italian (it)
Japanese (jp)

To choose a non-English language file (or several dictionaries at once), pass the semantic filename without the .php into the setDictionary method call as a parameter. For example, to use the French dictionary of profanity, you would use:

use Snipe\BanBuilder\CensorWords;
$censor = new CensorWords;
$badwords = $censor->setDictionary('fr');
$string = $censor->censorString($yourstring);

To use multiple language dictionaries in once instance, pass the languages as an array:

use Snipe\BanBuilder\CensorWords;
$censor = new CensorWords;
$langs = array('fr','it');
$badwords = $censor->setDictionary($langs);
$string = $censor->censorString($yourstring);

Creating and Using Your Own Dictionaries

There are many reasons why you may want to create your own dictionaries instead of using ours. Ours may be too strict, or not strict enough, etc. If you'd like to create your own, create a new file for your dictionary, and make sure the words are listed in an array like this:

array_push($badwords,
'word1',
'word2',
);

Note: You should NOT put your custom dictionary files within the BanBuilder dict directory. Composer vendor files are typically not checked into a project's source code, so your changes to your custom file will be ignored if you put them there. You can put them anywhere outside of the vendors directory.

To use your own version of a language dictionary, pass the path and filename in its entirety:

use Snipe\BanBuilder\CensorWords;
$censor = new CensorWords;
$badwords = $censor->setDictionary('/path/to/my/dictionary.php');
$string = $censor->censorString($yourstring);

Important!

This filter does not protect you against XSS or SQL injection attacks, and never will, as that it not its purpose, and attempting to do so could cause unpredictable results depending on how/where it's implemented. Read up on PDO, mysql_real_escape_string(), the built-in PHP sanitizing filters, and OWASP's guidelines for data validation for more on this.

The Profanity Filter Conundrum

No banned-word list is going to be flawless. A G-rated list will block out the word "screw", but there are certainly legitimate uses for the word "screw".

The word "Dick" can be a crude reference to male genitalia, or to the nickname of a fellow named Richard. Context is the only way to tell the difference, and it's been argued that one cannot censor a language without actually comprehending it, since context is so critical.

If you put "ass" in your bad word array, legitimate words like "class" will be turned into "cl***", so choose your words wisely. This, and a lack of context-understanding, is a limitation of profanity filters in general and it isn't unique to this one. It is possible to create a whitelist of words on top of your blacklist, to specify legitimate words that might have an exactly matching swear word within it (like "assign", "classy", etc), but the creation and maintenance of that list would impractical, and running every string through it could increase processing time considerably.

"I want to stick my long-necked Giraffe up your fluffy white bunny"

In general, profanity filters just don't work. At least not the way we want them to.

"Obscenity filtering is an enduring, maybe even timeless problem. I'm doubtful it will ever be possible to solve this particular problem through code alone. But it seems some companies and developers can't stop tilting at that windmill. Which means you might want to think twice before you move to Scunthorpe."
- Jeff Atwood

And let's not forget that there are LOTS of ways to say horribly offensive, degrading and disrespectful things without ever using a single profane word. Check out this fantastic article on Habitat Chronicles for more.

But of course, there are times when we need to give a good best-effort to keep the obviously offensive stuff off of forums, leaderboards, and so on. And that's what this script does.

It perhaps goes without saying that someone who is really determined will find a way to post something awful, regardless of what profanity filter you use. You should know that walking in.

Your application and community management should be prepared on a way to address those issues quickly (for example, the ability to ban a repeat offender, community-moderation such as content flags that will remove an entry if it's marked as flagged or offensive more than x time, etc.)

Getting Help

If you're stuck getting something to work, or need to report a bug, please post an issue in the Github Issues for this project.

Contributing

If you're interesting in contributing code to this project, clone it by running:

$ git clone git@github.com:snipe/banbuilder.git

Pull requests are welcome, but please make sure you provide unit tests to cover your changes.

Saying Thanks

If you're using this library, I'd love to know about it. Ping me on Twitter @snipeyhead. If it's made your life easier and you want to, you can buy me a beer with ChangeTip or Flattr!