Musings on technology: 2012

Tuesday, November 20, 2012

Computer/Internet Score for Natural Languages

To get some objective data about the level of support on computers and the internet for different natural languages, I compared and contrasted such support for 4 natural languages - English, Dutch, Bangla, and Assamese. I defined 9 different elements of what it takes to support a natural language on computers and the internet and then listed the level of support for a language in each element. Using that analysis, I finally provided a score for each of those languages. My conclusion is that Assamese scored the lowest and there is a lot of work left for people to do for Assamese to catch up with the other 3 languages. The detailed findings of this analysis are shown in the table at this link.

Saturday, October 13, 2012

Firefox plug-in to easily search for Assamese content

One of the things I mentioned in an earlier post is the need for Google and other search engines to make it easier to search for Assamese writing on the internet. But, in the meantime, we can use the workaround of typing 'ৰ' after the search term to narrow content down to mostly Assamese content. To make it easier to use the workaround, I made a simple plugin for the popular Firefox internet browser.

Here are the simple steps to install and use the plug-in in Firefox:

Go to the following link with your Firefox browser http://mycroft.mozdev.org/search-engines.html?name=Google+%28as%29+Assamese+%28Adds+ৰ%29
On that page, you will see a link to the plug-in named "Google (as) Assamese (Adds ৰ)". Click on the link to the plug-in.
This will bring up a pop-up window named "Add Search Engine" and ask you whether you want to add "Google Assamese" to the list of search engines available on the search bar. Click on the "Add" button. This will make it as one of the search engines available in Firefox, but it will not yet be selected for your searches (that is the next step).
Click on the dropdown menu in the Firefox search bar (down arrow on its left edge) and find "Google Assamese" in the list. It will have a rhino icon. Select that item.
Type in any word in the search bar. You should only see results from web-sites with Assamese writing.

I look forward to feedback on the plug-in and its effectiveness in searching for web-pages with Assamese writing.

Additional notes:

To go back to searching with your previous search engine (for example, "Google"), simply go back to the search bar dropdown menu and pick that previous item from the list. All searches from the search bar after that will use that search engine.
To remove the "Google Assamese" from Firefox, please go to the search bar dropdown menu, click on "Manage Search Engines", select "Google Assamese" from the list, and click on "Remove". You can then select another search engine as shown above.
The plug-in format is such that it should also work on the Chrome browser, but I have not tested it yet on Chrome. I would appreciate someone trying it out and letting me know if it works.

Wednesday, October 3, 2012

My favorite web tool to type in Assamese

Pramukh IME, or in short, Pramukh, is a simple web tool to use the English keyboard to type in Assamese (and other Indian languages). It can be used on all computers (e.g. Windows and Mac computers) as well as on tablet and smartphone devices (e.g. Android, iPad, and iPhone).

Typing in Pramukh is phonetic. To enter Assamese letters, we type similar sounding letters on the English keyboard. For example, we type m to write ম and h to write হ, etc. This takes advantage of our familiarity with a standard English keyboard to type Assamese letters easily.

Pramukh does not need any software to be installed on your computer or your device. You can use Pramukh directly in an Internet browser (Internet Explorer, Chrome, Firefox, Safari, Edge, Opera, etc.) by going to its webpage (detailed in the section below).

Getting started with Pramukh for Assamese

To get started:

Go to the Pramukh IME webpage for Assamese https://www.pramukhime.com/type/assamese.

On that web-page, click on the text field. This is where you can type in English letters and see the corresponding Assamese letters

Here is an example Assamese sentence you can try with Pramukh:

আহক আমি সকলোৱে অসমীয়াত লিখোঁ

To type the above Assamese sentence, you will need to type the following English letters on your keyboard:
ahok ami xokolOwe oxomIyat likhO.n

As you type the above English letters, you will see the corresponding Assamese letters appear in the text field on the Pramukh web page.

Finally, you can use regular copy/paste on your computer, phone, or tablet to copy the Assamese text from that text field to where you want it, e.g., a Word document, a web-page, or an email message.

Getting help for Assamese letters on Pramukh

You can bring up the Help page in Pramukh by clicking on the 'অ?' icon on that page. This icon is next to the language selector on the web-page (see Figure 1 above). Note that the 'অ?' icon becomes visible only after you select Assamese as the language as shown earlier on this page.

This Help page contains a clear chart showing the English letter(s) needed to produce each Assamese letter. To remove the Help page and get back to typing text, you can click on the "Cancel" button at the bottom of the Help page.

Additional usage hints

These are some more hints to help you use the tool.

* Remembering to type 'o' for vowel অ after a consonant:

The one thing to keep in mind in Pramukh is to type in the vowel we want after the consonant, even for the vowel অ which is 'o'.

For the vowel অ, this takes a little getting used to, because in handwriting there is nothing to write for অ; we simply write the consonant, for example ক. So, we may forget to type an 'o' after the consonant and will need to remind ourselves to do so in Pramukh.

There is one exception at the end of the word where typing in space is sufficient. There is no need to type in 'o' at the end of the word.

Here is an example to illustrate the rule. To get the word আহক, we type "ahok " in Pramukh. The 'o' after 'h' is for the vowel অ following the consonant হ. The space after 'k' is sufficient for the vowel অ following the last consonant in the word, ক.

Other than this mental note to regularly type 'o' for the vowel অ, Pramukh has kept the English letter(s) needed to write an Assamese script letter fairly intuitive.

* Always type the vowel letter after the consonant:

For vowels other than অ, it is easy to remember to type it with the consonant, because even in handwriting, we do write something for the vowel sign. As an example, for ই, we write it as a vowel sign, like in কি. But, one thing to remember is that the letter for the vowel must always be typed in Pramukh after the consonant, even if the vowel sign in Assamese is written to the left of the consonant. For example, for কি, we have to type "ki", and not "ik".

Use the detailed help page for more tips

To get more detailed instructions in using Pramukh IME for Assamese, please click on this link.

Saturday, September 29, 2012

Unicode support for Assamese

There have been many discussions in the last several months about a perception that Assamese is not well supported in Unicode, the industry standard for the fundamentals required on a computer and on the internet for text input, processing, and display/printing in a natural language. In my opinion, this perception of lack of support is misplaced. After some detailed study, I have concluded that Unicode provides clear support for the Assamese LANGUAGE, distinct from the Bengali language. This is after studying the current Unicode standards and also on the basis of some prior knowledge and experience with Indic text processing[1]. There is an issue with the name of the script as people have pointed out. But, as I will explain below, it is a relatively minor issue. The rest of this post is an attempt to explain the details behind these conclusions.

On support for the script

In all the discussions and writings on the topic of Assamese language support on the internet, I have noticed that there is a lot of confusion about the distinction between "script" and "language". Although the differences are subtle, Unicode, being a computer related standard, has to be very precise about that distinction.

For example, I am currently typing in the Roman script to write in the English language. But, since "je parle un peu francais", I have used the same Roman script to write in the French language in the previous clause. Thus, two languages (English and French) use the same script (Roman).

Unicode specifies this distinction between scripts and languages as follows. It defines a CODE CHART for a script, and, separately, it defines a LOCALE for a language that uses that code chart. Thus, Unicode standardized the Latin code-chart (to represent an extended version of what we call the Roman script), and then defined English, French, and other languages as locales that use that code chart.

But why did Unicode choose the same code chart for those languages? It is because 90% of the written script is the same for English, French, Dutch, German, and other related languages. Noticing this, Unicode defined the Latin code chart to include ALL letters used in writing those languages. That way it takes less number of character codes to provide coverage for all the letters used in total by all those languages. This is done due to a natural economy of encoding that computer design always strives for. Thus, it is a guiding principle that Unicode uses in all its language standards work.

So, it should not be surprising that Unicode would apply the same principle to the Assamese and Bengali languages. The scripts used by Assamese and Bengali have some differences, but 90% of the letters used in the scripts are the same. Thus, it made sense for Unicode to define a COMMON code chart that includes ALL letters used by either Assamese or Bengali.

So, there was some method to how Unicode chose a common code chart for the two languages. But, did Unicode do everything right for the code chart for Assamese and Bengali? No, Unicode chose a poor name for the code chart. By calling it the "Bengali" code chart, they hurt the sentiment of the Assamese people due to past history of Assamese/Bengali language relations, that I do not plan to include in this technical discussion. Suffice it to say that, due to the efforts of several people to sensitize the standards body about the issue, Unicode is beginning to make some changes in its communication to be more explicit that Assamese letters are included in the code chart, despite its name as the "Bengali" code chart. Their web-site now says "Bengali and Assamese" code-chart http://www.unicode.org/charts/.

But, there is still an open issue about the name in the actual standard itself. The code chart specification document still calls it the "Bengali" code chart, even though the web-site lists it as the code chart for "Bengali and Assamese". Can Unicode do more to make the actual name of the code chart in the specification more inclusive of both languages?
It certainly could, but the change will cause a lot of work for all software platforms that implement the current Unicode standard. This includes Microsoft Windows, and all the Unix variants (FreeBSD, Linux, MacOS, Solaris, Android OS, iOS, to name some of them) for which software has been written to support the list of code-chart names in the Unicode standard (Latin, Arabic, Bengali, etc.). Any changes to any of those code chart names in the standard would cause software developers for all those platforms to modify existing source code and release an update that will impact billions of computers, tablets, smart-phones, and other computing devices world-wide.
So, Unicode has to carefully weigh the benefit of such a name change for input, processing, or display of Assamese characters on computing devices before making a decision. It turns out that there will be no benefit with such a name change for Assamese. That is because (1) all the character codes required to support input, processing, or displaying Assamese are already defined in the current code chart, and (2) additional language characteristics such as alphabetical order, which are language specific, are already standardized elsewhere in Unicode for Assamese. More on the additional language characteristics in the next section, but, it is important to understand that the reasons above are why the Unicode organization has not changed the name in the actual code chart, and has stopped at the step of making the web-site list it as "Bengali and Assamese".

On support for the Assamese language in Unicode

Note that language characteristics such as alphabetical order are different even for other related languages; so this is not a new problem for Unicode to consider. In the Swedish alphabet, there are three extra vowels placed at its end (..., X, Y, Z, Å, Ä, Ö), similar to the Danish and Norwegian alphabet, but with different written symbols and a different alphabetical order. For Swedish, Danish, and the Norwegian languages, Unicode has represented the differences outside of the code chart; in the locale. Similarly, the different alphabetical order of Assamese and Bengali are also represented using the same mechanism; the locale.

So, how does Unicode provide support for the Assamese LANGUAGE? The specification for locales in Unicode is called the Unicode Common Locale Data Repository (CLDR) http://cldr.unicode.org/.
The repository defines the Assamese language completely, by defining many things including:
- the name of the locale; short code is 'as' (note that the Bengali locale is different: 'bn')
- its code chart for script,
- its collation order (alphabetical order -- note that Bengali and Assamese have different collation requirements, especially since 'ra' and 'wa' for Assamese have larger code values)
- its date/time format, etc.

These are specified in the CLDR core.zip file which can be downloaded from the repository link above. The files that represent the Assamese locale are as follows:
./common/casing/as.xml
./common/collation/as.xml
./common/main/as.xml

I will do a follow-up posting, to describe the contents of each of these files and how they represent all the characteristics of the Assamese language.

Summary

In summary, with the existing code chart (even though it is not well named), and the Assamese ('as') locale already defined by Unicode, everything needed for reading and writing in Assamese on the internet already exists[2].

References

[1] I have some knowledge in this area based on work I did in the mid 80's on Indic language text processing. Here is a link to a peer-reviewed published paper written by me and colleagues that describes the work: https://docs.google.com/open?id=0B7lCYS3yYAjhdEpsclFmVWlLQk0).

[2] There is a gap currently on the Internet for searching reliably in Assamese but there is nothing needed from the Unicode standard to fix that. The fixes are required in search engines like Google and Bing. I have described the problem, the solution, and a workaround in the write-up at this link: https://docs.google.com/document/d/1gcgX1hua22rpvFFtgG50PVJILq64fr3MwWoJU6o7VD0/edit.

[3] Here is a Firefox plug-in for the workaround to search in Assamese: http://mycroft.mozdev.org/search-engines.html?name=Google+%28as%29+Assamese+%28Adds+ৰ%29.