PHP

Link locator and regular expressions

Submitted by mimec on 2013-03-07

Remember the old joke about solving problems using regular expressions? It turns out it never gets out of date. I'm just putting together the markup processor for WebIssues, and since it also uses the link locator, I decided to take a closer look at it. The "link locator" is basically a small utility function which takes a piece of plain text, detects any URLs which appear in it and converts everything to HTML with links.

The heart of the link locator is the call to preg_split with an appropriate regular expression which matches any valid links. I've been using the simplest thing that I could come up with. It recognizes emails, URLs and issue identifiers. And identifier is straightforward; it consists of a "#" and one or more digits. But what makes an email address or URLs is much more difficult to define.

Initially I defined an email address as a sequence of non-whitespace characters starting and ending with a letter or digit and containing exactly one "@". It works, but gives false positives for meaningless strings like "a!@#$%^b". Looking for a better alternative I found this article. I decided to use a slightly modified version of the first regex, which allows the mailto: prefix and non-ASCII characters:

\b(?:mailto:)?[\w.%+-]+@[\w.-]+\.[a-z]{2,4}\b

Finding the start of an URL is easy if we assume that it can only start with one of the following prefixes: http://, https://, ftp://, www. or ftp. The last two make it possible to skip the protocol for common addresses like www.mimec.org. But where exactly does the URL end? In the previous sentence, the final dot is clearly punctuation, not part of the URL, even though dot can also be a part of the URL. My original regex assumed that the URL must end with a letter, digit, or slash.

This also works in most cases, but it's not perfect. We can allow more characters at the end of the URL, but the really interesting case is handling parentheses. Consider those two examples:

  • Visit my website (www.mimec.org).
  • For more information, visit http://en.wikipedia.org/wiki/Tool_(band).

In the first sentence, the closing parenthesis is not part of the URL, but in the second it is. That's obvious to a human reader, but what about a machine? Fortunately someone already invented a regex which solves this problem. The final regular expression which I'm going to use looks like this (split into three lines for readability):

(?:\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)|\\\\)
(?:\([\w+&@#\/\\%=~|$?!:,.-]*\)|[\w+&@#\/\\%=~|$?!:,.-])*
(?:\([\w+&@#\/\\%=~|$?!:,.-]*\)|[\w+&@#\/\\%=~|$])

I added file:// and \\ prefixes (the latter is for UNC paths, like \\server\folder\file.doc) and added backslash as valid character. They are already recognized by the Desktop Client as requested by one of the users. There is no reason not to handle them in the Web Client as well. Even though most browsers block access to such URLs, they can still be copied and pasted more easily.

While testing the regular expressions I made another interesting observation. When using character classes such as "\w" to match against a UTF-8 string, make sure to include the "u" modifier in the expression, for example "/(\w+)/u". Otherwise the result may break the UTF-8 encoding. For example, the Polish letter "ć" is represented in UTF-8 encoding as two bytes, equivalent to ASCII characters "ć". The first one is a "word" character, and the second is not, so the regular expression running in ASCII mode would break the string in the middle of the multi-byte character. Even the innocent "\s" pattern matches the "\xA0" character which can be part of a multi-byte character, so be careful.

Note that it took a bit of googling until I found information about that "u" modifier. The PHP manual should be more specific about it. What's worse, it seems that it's not always supported, even in recent versions of PHP. Just search for "this version of PCRE is not compiled with PCRE_UTF8 support" and you will see what I mean. Well, nothing is perfect, and PHP certainly isn't...