Saturday, September 29, 2012

Unicode support for Assamese

There have been many discussions in the last several months about a perception that Assamese is not well supported in Unicode, the industry standard for the fundamentals required on a computer and on the internet for text input, processing, and display/printing in a natural language. In my opinion, this perception of lack of support is misplaced. After some detailed study, I have concluded that Unicode provides clear support for the Assamese LANGUAGE, distinct from the Bengali language. This is after studying the current Unicode standards and also on the basis of some prior knowledge and experience with Indic text processing[1]. There is an issue with the name of the script as people have pointed out. But, as I will explain below, it is a relatively minor issue. The rest of this post is an attempt to explain the details behind these conclusions.

On support for the script

In all the discussions and writings on the topic of Assamese language support on the internet, I have noticed that there is a lot of confusion about the distinction between "script" and "language". Although the differences are subtle, Unicode, being a computer related standard, has to be very precise about that distinction.

For example, I am currently typing in the Roman script to write in the English language. But, since "je parle un peu francais", I have used the same Roman script to write in the French language in the previous clause. Thus, two languages (English and French) use the same script (Roman).

Unicode specifies this distinction between scripts and languages as follows. It defines a CODE CHART for a script, and, separately, it defines a LOCALE for a language that uses that code chart. Thus, Unicode standardized the Latin code-chart (to represent an extended version of what we call the Roman script), and then defined English, French, and other languages as locales that use that code chart.

But why did Unicode choose the same code chart for those languages? It is because 90% of the written script is the same for English, French, Dutch, German, and other related languages. Noticing this, Unicode defined the Latin code chart to include ALL letters used in writing those languages. That way it takes less number of character codes to provide coverage for all the letters used in total by all those languages. This is done due to a natural economy of encoding that computer design always strives for. Thus, it is a guiding principle that Unicode uses in all its language standards work.

So, it should not be surprising that Unicode would apply the same principle to the Assamese and Bengali languages. The scripts used by Assamese and Bengali have some differences, but 90% of the letters used in the scripts are the same. Thus, it made sense for Unicode to define a COMMON code chart that includes ALL letters used by either Assamese or Bengali.

So, there was some method to how Unicode chose a common code chart for the two languages. But, did Unicode do everything right for the code chart for Assamese and Bengali? No, Unicode chose a poor name for the code chart. By calling it the "Bengali" code chart, they hurt the sentiment of the Assamese people due to past history of Assamese/Bengali language relations, that I do not plan to include in this technical discussion. Suffice it to say that, due to the efforts of several people to sensitize the standards body about the issue, Unicode is beginning to make some changes in its communication to be more explicit that Assamese letters are included in the code chart, despite its name as the "Bengali" code chart. Their web-site now says "Bengali and Assamese" code-chart http://www.unicode.org/charts/.

But, there is still an open issue about the name in the actual standard itself. The code chart specification document still calls it the "Bengali" code chart, even though the web-site lists it as the code chart for "Bengali and Assamese". Can Unicode do more to make the actual name of the code chart in the specification more inclusive of both languages?
It certainly could, but the change will cause a lot of work for all software platforms that implement the current Unicode standard. This includes Microsoft Windows, and all the Unix variants (FreeBSD, Linux, MacOS, Solaris, Android OS, iOS, to name some of them) for which software has been written to support the list of code-chart names in the Unicode standard (Latin, Arabic, Bengali, etc.). Any changes to any of those code chart names in the standard would cause software developers for all those platforms to modify existing source code and release an update that will impact billions of computers, tablets, smart-phones, and other computing devices world-wide.
So, Unicode has to carefully weigh the benefit of such a name change for input, processing, or display of Assamese characters on computing devices before making a decision. It turns out that there will be no benefit with such a name change for Assamese. That is because (1) all the character codes required to support input, processing, or displaying Assamese are already defined in the current code chart, and (2) additional language characteristics such as alphabetical order, which are language specific, are already standardized elsewhere in Unicode for Assamese. More on the additional language characteristics in the next section, but, it is important to understand that the reasons above are why the Unicode organization has not changed the name in the actual code chart, and has stopped at the step of making the web-site list it as "Bengali and Assamese".

On support for the Assamese language in Unicode

Note that language characteristics such as alphabetical order are different even for other related languages; so this is not a new problem for Unicode to consider. In the Swedish alphabet, there are three extra vowels placed at its end (..., X, Y, Z, Å, Ä, Ö), similar to the Danish and Norwegian alphabet, but with different written symbols and a different alphabetical order. For Swedish, Danish, and the Norwegian languages, Unicode has represented the differences outside of the code chart; in the locale. Similarly, the different alphabetical order of Assamese and Bengali are also represented using the same mechanism; the locale.

So, how does Unicode provide support for the Assamese LANGUAGE? The specification for locales in Unicode is called the Unicode Common Locale Data Repository (CLDR) http://cldr.unicode.org/.
The repository defines the Assamese language completely, by defining many things including:
- the name of the locale; short code is 'as' (note that the Bengali locale is different: 'bn')
- its code chart for script,
- its collation order (alphabetical order -- note that Bengali and Assamese have different collation requirements, especially since 'ra' and 'wa' for Assamese have larger code values)
- its date/time format, etc.

These are specified in the CLDR core.zip file which can be downloaded from the repository link above. The files that represent the Assamese locale are as follows:
./common/casing/as.xml
./common/collation/as.xml
./common/main/as.xml

I will do a follow-up posting, to describe the contents of each of these files and how they represent all the characteristics of the Assamese language.

Summary

In summary, with the existing code chart (even though it is not well named), and the Assamese ('as') locale already defined by Unicode, everything needed for reading and writing in Assamese on the internet already exists[2].

References

[1] I have some knowledge in this area based on work I did in the mid 80's on Indic language text processing. Here is a link to a peer-reviewed published paper written by me and colleagues that describes the work: https://docs.google.com/open?id=0B7lCYS3yYAjhdEpsclFmVWlLQk0).

[2] There is a gap currently on the Internet for searching reliably in Assamese but there is nothing needed from the Unicode standard to fix that. The fixes are required in search engines like Google and Bing. I have described the problem, the solution, and a workaround in the write-up at this link: https://docs.google.com/document/d/1gcgX1hua22rpvFFtgG50PVJILq64fr3MwWoJU6o7VD0/edit.

[3] Here is a Firefox plug-in for the workaround to search in Assamese: http://mycroft.mozdev.org/search-engines.html?name=Google+%28as%29+Assamese+%28Adds+ৰ%29.

2 comments:

  1. Very thorough article.. I had almost forgot the difference between the "Script" and the "Language" before going through the article...

    Also it was nice to know Unicode named the fonts "Bengali and Assamese", rather than only "Bengali". I think many people does not know this. Kudos to all the people who took the effort..

    ReplyDelete
    Replies
    1. very good article, stuck firmly to the realities.yes, indeed unicode commited some mistakes in the initial stages, and I believe the sentiment of the Assamese people also is justified. It hurts me as an Assamese, when I see ৱ as BENGALI RA WITH LOWER DIAGONAL, in the code chart. I understand that it will be difficult to change the things now!But something they can do, at least in the Documents to declare the mistakes.

      Delete