Blog

Syntax highlighting with Prettify

In version 1.1 of WebIssues it will be possible to use the [code] tag in comments and descriptions. Text included in this tag will be displayd using monospace font, with all formatting disabled. This is useful for including fragments of output, log files, etc., but it can also be used for code snippets; after all it's an issue tracking software. Developers generally like their code colored, so all kinds of editors and other development tools support syntax highlighting for various languages.

Of course creating a syntax hightlighter is a very complex task, especially given the vast number of programming languages with very different syntax. No wonder than one of the popular tools, SyntaxHighlighter, contains about 100 kilobytes of (partially minified) JavaScript code, slightly more than jQuery. Another example is GeSHi, a 200 kilobyte PHP class with 3 megabytes (!) of language definitions. But syntax highlighting is just decoration, not a key future, so I want to avoid having to download tons of .js and .css files just to achieve this.

The problem is that these tools try to be much too thorough. I don't care if every single PHP function is highlighted, as long as the most important keywords are, along with comments, strings and fragments of HTML that are embedded into the PHP file (which in turn can contain embedded CSS and JavaScript). That's exactly what Google Code Prettify does. It is actively maintained by folks from Google, and it's used by Google Code itself and Stack Overflow, among others. I decided to use it as well.

The version which is included in the current development version of WebIssues is just 16 kilobytes of code. I removed a few unnecessary features and incorporated some of the additional languages into the main file. Currently supported languages include HTML and XML, C and C++, C#, Java, Bash, Python, Perl, Ruby, JavaScript, CSS, SQL, Visual Basic and PHP. I also packed the final script using the Closure Compiler (also from Google) which decreased the file almost four times.

When I was looking for a syntax highlighter, initially I was thinking about doing it server side, using PHP code. It didn't occur to me that this can be done on the client using JavaScript. At first the idea seemed strange to me. However it's actually great and can significantly reduce the server load. From the user's perspective it doesn't really matter. After all, have you ever noticed that Stack Overflow highlights code snippets on the fly using JavaScript?

There is yet another benefit of using client script instead of PHP: it is possible to highlight code also in the Desktop Client. Otherwise the entire mechanism would have to be reimplemented in C++. Version 1.0 of the Desktop Client displays issue details using QTextBrowser, which doesn't support JavaScript and has very limited support for HTML and CSS. But version 1.1 will use QtWebKit, the Qt port of the same engine which powers Chrome and Safari. The advantage is that issue details will have the same look and feel in both the Web Client and the Desktop Client, and obviously it's possible to embed Prettify. I found some minor issues with QtWebKit, probably worth a separate post, but generally, everything works very well.

Filed under: Blog

Link locator and regular expressions

Remember the old joke about solving problems using regular expressions? It turns out it never gets out of date. I'm just putting together the markup processor for WebIssues, and since it also uses the link locator, I decided to take a closer look at it. The "link locator" is basically a small utility function which takes a piece of plain text, detects any URLs which appear in it and converts everything to HTML with links.

The heart of the link locator is the call to preg_split with an appropriate regular expression which matches any valid links. I've been using the simplest thing that I could come up with. It recognizes emails, URLs and issue identifiers. And identifier is straightforward; it consists of a "#" and one or more digits. But what makes an email address or URLs is much more difficult to define.

Initially I defined an email address as a sequence of non-whitespace characters starting and ending with a letter or digit and containing exactly one "@". It works, but gives false positives for meaningless strings like "a!@#$%^b". Looking for a better alternative I found this article. I decided to use a slightly modified version of the first regex, which allows the mailto: prefix and non-ASCII characters:

\b(?:mailto:)?[\w.%+-]+@[\w.-]+\.[a-z]{2,4}\b

Finding the start of an URL is easy if we assume that it can only start with one of the following prefixes: http://, https://, ftp://, www. or ftp. The last two make it possible to skip the protocol for common addresses like www.mimec.org. But where exactly does the URL end? In the previous sentence, the final dot is clearly punctuation, not part of the URL, even though dot can also be a part of the URL. My original regex assumed that the URL must end with a letter, digit, or slash.

This also works in most cases, but it's not perfect. We can allow more characters at the end of the URL, but the really interesting case is handling parentheses. Consider those two examples:

  • Visit my website (www.mimec.org).
  • For more information, visit http://en.wikipedia.org/wiki/Tool_(band).

In the first sentence, the closing parenthesis is not part of the URL, but in the second it is. That's obvious to a human reader, but what about a machine? Fortunately someone already invented a regex which solves this problem. The final regular expression which I'm going to use looks like this (split into three lines for readability):

(?:\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)|\\\\)
(?:\([\w+&@#\/\\%=~|$?!:,.-]*\)|[\w+&@#\/\\%=~|$?!:,.-])*
(?:\([\w+&@#\/\\%=~|$?!:,.-]*\)|[\w+&@#\/\\%=~|$])

I added file:// and \\ prefixes (the latter is for UNC paths, like \\server\folder\file.doc) and added backslash as valid character. They are already recognized by the Desktop Client as requested by one of the users. There is no reason not to handle them in the Web Client as well. Even though most browsers block access to such URLs, they can still be copied and pasted more easily.

While testing the regular expressions I made another interesting observation. When using character classes such as "\w" to match against a UTF-8 string, make sure to include the "u" modifier in the expression, for example "/(\w+)/u". Otherwise the result may break the UTF-8 encoding. For example, the Polish letter "ć" is represented in UTF-8 encoding as two bytes, equivalent to ASCII characters "ć". The first one is a "word" character, and the second is not, so the regular expression running in ASCII mode would break the string in the middle of the multi-byte character. Even the innocent "\s" pattern matches the "\xA0" character which can be part of a multi-byte character, so be careful.

Note that it took a bit of googling until I found information about that "u" modifier. The PHP manual should be more specific about it. What's worse, it seems that it's not always supported, even in recent versions of PHP. Just search for "this version of PCRE is not compiled with PCRE_UTF8 support" and you will see what I mean. Well, nothing is perfect, and PHP certainly isn't...

Filed under: Blog

Text formatting in WebIssues

Recently I wrote about version 1.1 of WebIssues and my plans to introduce issue descriptions and formatting of comments and descriptions. I also listed various markup languages which I considered for using in WebIssues. But first, let's look at how text is handled in the current version of WebIssues.

WebIssues currently uses plain text for comments. All whitespace characters, including indentation and line breaks, are preserved, making it easy to paste fragments of code directly into comments without breaking their formatting. At the same time, WebIssues wraps long lines, making it possible to write long paragraphs of text which are displayed correctly regardless of the width of the window. This basically corresponds to the "white-space: pre-wrap" CSS style. In addition, external URLs issue identifiers are automatically converted to links.

The key idea behind adding extended formatting options is to preserve compatibility with this "plain text" mode. It should be possible to edit an existing comment, enable formatting and add some markup to the existing text without breaking existing formatting. Now the problem is that most existing markup languages either ignore whitespace (for example HTML, unless wrapped in a <pre> block) or handle it in a specific way (for example, double line breaks are converted to paragraphs, indentation indicates a block of code, etc.). I'm not saying that they are wrong; this often makes sense when copying text from text files or plain text emails. However, I don't want to break habits of existing users of WebIssues. I would like to treat spaces and line breaks identically, whether formatting is enabled or not. Thanks to this, pieces of code will not break, even if they are not marked using special tags. These tags will only be used for decorative purposes, for example, by using different background color and enabling syntax highlighting.

There are generally two kinds of markup used in existing languages: punctuation (brackets, quotes, asterisks, etc.) and tags. Punctuation is used for inline formatting in various flavors of Wiki and languages such as Markdown or Textile, for example to indicate bold or italic text, though there is no single standard. It is also used for block formatting, for example trailing '>' may indicate a quote. HTML tags are commonly used by various languages, in addition to other format specifiers, though they are useful mostly for advanced formatting. Finally, various flavors of BBCode use custom tags, which are similar, but simpler than HTML. I decided to use a combination of punctuation for inline formatting and custom tags for block formatting. It's questionable whether yet another language should be invented when there's so many already, but I think it's going to be intuitive for everyone, and thanks to the embedded markItUp editor, there will be no need to remember it.

The following inline formatting tags will be supported: **bold**, __italic__, `monospace text` and [URL custom links]. The * and _ characters appear commonly in a technical text, so they need to be doubled to avoid false positives. Link syntax is quite similar to Wiki external links, however internal links can be created the same way, for example: [#123 some issue]. In the future it will be possible to introduce real Wiki functionality (where names can will be used instead of numeric identifiers).

Three different block level formatting tags will be suported. A [code][/code] pair will indicate a block of code, with optional syntax highlighing based on Google's prettify. A [quote][/quote] pair will indicate a quote with an optional header. Finally, a [list][/list] pair indicates a bullet list, where each item starts with one or more * (multiple asterisks indicate nested levels). Unlike automatic lists used by many markup languages, explicit tags will make it easy to clearly indicate where the list starts and ends. Also if will be possible to freely mix and nest all three kinds of tags.

I'm now playing with the prototype of the converter, and I may still do some minor changes, but so far I'm rather satisfied with the result. So when version 1.1 is going to be released? By the end of this year - that's all I can promise for now. I will probably release some beta version in a few months. But I still have my book and various other things to do, so don't expect miracles.

Filed under: Blog
Tags: WebIssues

WebIssues 1.1

Before I get to the main topic, just a short update on my novel :). I wrote a few more chapters and I have some new ideas, but I feel that I need some break. When I first started writing back in 2011, I was so involved that I could write all night, but now I have to get up earlier and generally I have too many things to do to be able to fully concentrate on this. So it's becoming a somewhat tedious process, quite like programming that I was trying to escape from. However, the result is still quite good and I'm certainly not going to leave it off.

Anyway, I can't ignore the fact that the idea of WebIssues 1.1 is growing in me. I've already had some items on the roadmap, but there's so many of them that I will have to split them into two releases. I'm probably going to postpone all improvements related to users, groups and permissions until version 1.2. The main improvement in version 1.1 will be issue descriptions. It's something that's clearly missing compared to other bug tracking systems. Of course, the first comment can act as a description, so the change is a bit cosmetic, but the ability to provide the description directly when creating an issue will certainly be an improvement. During the upgrade from version 1.0, the first comment will be automatically converted to a description.

Also projects will now have a description. There will be a project summary page, which in the future may contain other useful information, such as statistics, recent issues, etc.

The last (but not least) improvement in this area will be the ability to use simple formatting in both comments and descriptions. And that's an interesting problem, because there are lots of different markup languages that can be used to add formatting to a piece of text. Each existing standard has it's advantages and disadvantages:

HTML
Powerful and good for CMS, blogs, etc., but it's difficult to use by non-geeks. And it's even more difficult to display it correctly. A naive implementation opens the possibility for XSS attacks. Simple tools like kses still won't ensure that the markup is valid (e.g. check for unbalanced tags). More advanced tools like htmlpurifier are simply monstrous.
Textile / Markdown
They are quite different, but based on a similar idea: make the source text look as natural as possible. I prefer the latter, although they both seem to make more sense for writing longer articles (especially technical) than simple descriptions and comments.
Wiki markup
The main problem is that there is simply no such thing as a standard Wiki syntax. Although there are many similarities, each implementation has its own flavor. Also note that adding true Wiki support to WebIssues (i.e. being able to create cross-links based on titles, not just issue IDs) would be an entirely different story.
BBCode
Simple and widely used (also with many different flavors). On the other hand, square brackets don't seem more intuitive than angle brackets used in HTML.

This is a broader topic and I will write more about it in a separate post. So far I'm leaning towards a subset of Wiki syntax with some modifications, but I have to think more about it. And don't even get me started on the so called "WYSIWYG" editors. They are bloated and/or buggy and not 100% portable. I think I'm going to create something based on markItUp which is small, simple and easy to customize.

Yet another area of functionality that sooner or later must be (and will be) added to WebIssues is support for inbound emails. Some thoughts have been circling around and a few different persons have offered to help me implement this. If something gets done then I will include it in one of the next releases, but for now I can't promise anything.

Filed under: Blog
Tags: WebIssues

Writing

I want to write. Well, of course, I do; but I don't mean programs and technical documentation, but novels. My New Years resolution is to finish the book that I started some time ago and get it published. Why this sudden change of mind? Just a few months ago I wanted to start a business based on WebIssues. I even managed to briefly bring the attention of the management of the company I work for to it. But their idea of investing very little in order to hopefully get some profit wouldn't make too much sense. My own vision wasn't downright rejected, but considering all the political aspects that rule a corporation like this, and my complete lack of influence on these things, I can't realistically expect that this is ever going to happen.

Obviously, making a living from writing is an even more insane idea. It's a very demanding market, and in Poland also quite a narrow one. It also requires a lot of pure luck, probably even more than running a successful business. Not to mention that writing a novel requires huge amounts of time. But the real problem is that I'm really starting to hate programming. Commercial or open source, it's tedious, repetitive and rarely creative. And writing is not a new idea. I wrote some stories as a child. In high school I started writing a book with two friends; it didn't last long, but it was a lot of fun. But this time it's different, because I already have most of the plot in my mind, so the ideas are there waiting to be put on paper.

I already mentioned the novel I'm writing once or twice, but perhaps this time I will shed some more light on it. The idea came to my mind in summer 2011, while I was reading Neal Stephenson's Snow Crash, but it was also influenced by Lev Grossman's The Magicians which I read shortly before. It's basically a cyberpunk story, taking place largely in two different virtual worlds, but it also has some elements of contemporary fantasy and techno-thriller. The main characters are a few students of a school for young hackers, which is called the Academy of Magic, because in a virtual world, the boundaries between hacking and magic are blurred for the uninitiated. As my younger brother described it, when I told him about the novel today, it's like a "rolled pancake" :). I admit that mixing genres is risky, but if I do it well, maybe something interesting will come out of this.

And by the way, yesterday was my son's first birthday :). I must publish some new photos soon because I haven't done that in a while.

Filed under: Blog
Tags: personal
Syndicate content