Monday, January 25, 2010

Regex Pitfall

I guess this was the first time ever that I had to search/match/replace with regular expressions across line boundaries.
Not the usual multi-line ("/m") operation; quite the contrary, I wanted the source string to be treated as one single line regardless of newline-characters in it.

In other words: the dot (".") should also match a newline.

Not that easy, it turns out.
Quoting from the Regex Tutorial:
The dot matches a single character, without caring what that character is. The only exception are newline characters. In all regex flavors discussed in this tutorial, the dot will not match a newline character by default. So by default, the dot is short for the negated character class [^\n] (UNIX regex flavors) or [^\r\n] (Windows regex flavors).
Changing this behavior is actually language dependent.
Since I needed it within JavaScript (XUL/Thunderbird) I had to revert to a [\s\S] instead of the .
JavaScript [does] not have an option to make the dot match line break characters. In
those languages, you can use a character
class
such as [\s\S] to match any character.
This character matches a character that is either a whitespace
character (including line break characters), or a character that is not a
whitespace character. Since all characters are either whitespace or
non-whitespace, this character class matches any character.
Weird. Unreadable. Requires a comment.
But it works.

2 comments:

Steffen Jakob said...

http://xregexp.com/ supports the "s" flag. It's probably an overkill though to introduce a new library if the \s\S trick works for you.

Roman said...

Steffen,
true... to clumsy to include it... if there's such an easy workaround...