I was recently tasked with creating a simple dirty word filter in PHP that relies on our blacklist of about 600 words. The filter should simply replace the word with its character length in asterisks.
My first attempt just used str_replace, but I quickly realized that even dirty words embedded in clean words would get filtered. For example, words like “glass” would get filtered to “gl***”.
The solution? Regular expressions:
$pattern = "/\b$needle\b/i";
$haystack = preg_replace($pattern, $replacement, $haystack);
The pattern is pretty simple, but not being very familiar with regular expressions, I had to do a couple Google searches before I found what I needed.
Here’s a simple breakdown of the pattern string:
The pattern delimiter
The pattern string begins and ends with a forward slash / character. These two characters are called the pattern delimiter and are required by the preg_match function. They can actually be any non-alphanumeric and non-backslash character, so you can pick your pattern delimiters to help with readability.
$pattern = '/\bMatch me\b/';
$pattern = '@\bMatch me\b@';
$pattern = '#\bMatch me\b#';
If you don’t supply a pattern delimiter, PHP will spit out this warning:
Warning: preg_replace() [function.preg-replace]: Delimiter must not be alphanumeric or backslash in
The word boundary \b
The \b escape character acts as a word boundary. For a nice explanation of this metacharacter, take a look at Jan Goyvaerts’ excellent regular expression tutorial page.
The ‘i’ at the end
The ‘i’ at the end of the pattern string just tells preg_match that the search should be case insensitive.