Ra Japhala in Bengali (bn) and Unicode 5.0

During the past two days, I have been trying to weed out the remaining Bengali related bugs in Pango, and came up with three bug reports (427667, 427611, 427584). Of these, bug 427667 affects all Indic scripts and should be of interest to the entire Indic community. The rest are Bengali specific.

While working on the issues, to my dismay, I noticed that the last version of the Unicode Standard Book which is freely available online (as PDF documents) covers only version 4.0. The book that covers the latest version (5.0) has to be bought for around 50 USD (no idea how much the shipping costs for sending it to Kolkata would be).

The lack of documentation on 5.0 became very irritating while I was trying to work on issue 427611, which tries to implement correct behaviour in Pango while trying to render the sequence U+09B0 ZWNJ U+09CD U+09AF, which according to the Indic FAQ at the Unicode site, should be rendered as Ra-Japhala. However, after some discussion today on IRC with Runa-di and Rahul, it seems that Unicode 5.0 changes the recommended sequence for Ra-Japhala to U+09B0 ZWJ U+09CD U+09AF from U+09B0 ZWNJ U+09CD U+09AF. This seems to be a result of the acceptance of Public Review Issue #37, proposed by Peter Constable of Microsoft in July 2004. The logic put forward by PR-37 is absolutely fine, as far as my opinion goes, but the entire experience raises a couple of questions - namely:

  • Why hasn’t the Indic FAQ been updated to reflect the changes?
  • Why isn’t the Unicode 5.0 book freely available for download by developers?

A very good friend of my mine has access to a (physical) copy of the book, and she mailed me a snippet from the book:

” …Unicode Standard adopts the convention of placing the character U+200D ZWJ immediately after the ra to obtain the ra-yaphaala…”

So there you go. My work would have been made much more easier if I had access to that particular book - I spent almost an entire day trying to figure out which is the right way of representing Ra-Japhala :-(.

However, I would probably also mention here that though the recommended way is to represent Ra Japhala as U+09B0 ZWJ U+09CD U+09AF, people seem to favour rendering the sequence U+09B0 ZWNJ U+09CD U+09AF as Ra Japhala as well (probably for backward compatibility reasons), so the patch against issue 427611 stands. I asked Jamil-bhai to get the following sequences tested in a Windows XP and a Windows Vista machine (since I don’t run any flavour of Windows at all). The sequences to be tested were:

  • U+09B0 ZWNJ U+09CD U+09AF (as a part of a larger word)
  • U+09B0 ZWNJ U+09CD U+09AF (standalone)
  • U+09B0 ZWJ U+09CD U+09AF (standalone)

Windows XP does not render the third sequence correctly:

Ra Japhala in Windows XP

Windows Vista renders all three correctly:

Ra Japhala in Windows Vista

Pango (SVN trunk) with the patch applied renders all three correctly:

Ra Japhala in Pango

The difference between the two versions of Windows is probably caused by different versions of Uniscribe bundled with the OSs.

UPDATE: After writing this entry, I realize that I might be overreacting a bit - but somehow, I can’t help feeling a bit pissed off :-(.