This is follow-up to an earlier post where I tried to outline Unicode support for Assamese, concentrating mostly on the script aspects. It is instructive to also look at Unicode support for the Assamese language. In that earlier post, I did mention the Unicode Common Locale Data Repository (CLDR) and briefly outlined how Unicode represents distinctive attributes of the Assamese language as a Locale. This blog post looks in further detail at the same.
Here is the link to the Unicode page containing the core data for the Assamese language Locale:
https://st.unicode.org/cldr-apps/v#/as/Alphabetic_Information/7ed88347aa1b55ed
Note the following in the row named "Main Letters" in that table:
Here is the link to the Unicode page containing the core data for the Bengali language Locale:
https://st.unicode.org/cldr-apps/v#/bn/Alphabetic_Information/7ed88347aa1b55ed
The following is the comparison of the row named "Main Letters" of the Bengali language locale with that of Assamese:
A key benefit of doing this is that search spiders like Google or Bing will be able to record that the content they found, during the spidering, is in the Assamese locale, as distinct from Bengali.
Without a locale marking on a web page, search spiders (Google, Bing, etc) make an educated guess about the language of the spidered content. They do this by inspecting the Unicode codes used in the page content.
For example, an Assamese language page, that is not marked with the Assamese locale, is guessed to be a Bengali language page. This is because the Unicode codes used on the page are in the Bengali code chart, and a much higher proportion of web pages using that code chart is in the Bengali language relative to the Assamese language.
This is similar to how a French language page, not marked to be in the French locale, may end up being guessed as an English language page, since a much higher proportion of web pages using the Latin code chart is in the English language relative to the French language.
With the locale clearly marked on Assamese pages as Assamese, search spiders will be able to correctly record and count the Assamese language pages. This will provide concrete data to Google and other internet providers to add full support for the Assamese language to display, search, and edit such Assamese language content, and also for new Assamese language content that gets added.
<html ... lang="as" ... >
Here is an example on one of my Assamese blog posts. The HTML element and the lang="as" attribute can be seen by viewing the source of the web page.
More details of the lang attribute and locale codes can be found here https://www.w3.org/TR/html4/struct/dirlang.html#h-8.1.1
and here: http://xml.coverpages.org/iso639a.html
More details about the "language:" operator in Bing and the language codes it supports can be found here: https://msdn.microsoft.com/en-us/library/ff795616.aspx
More details about the languages supported by Google search can be found here: https://developers.google.com/custom-search/docs/ref_languages
Here is the link to the Unicode page containing the core data for the Assamese language Locale:
https://st.unicode.org/cldr-apps/v#/as/Alphabetic_Information/7ed88347aa1b55ed
Note the following in the row named "Main Letters" in that table:
- Letter ক্ষ is present, denoted as {ক\u09CDষ}. It is located after হ, which is correct in the alphabetical order for the Assamese language
- Letter ৰ is present and is located between য়, denoted as {য\u09BC}, and ল. This location is correct in the alphabetical order for Assamese
- Letter র is not present, as expected, since that letter is not used in the Assamese language
- Letter ৱ is present, and is between ল and শ , which is correct in the alphabetical order for Assamese
Here is the link to the Unicode page containing the core data for the Bengali language Locale:
https://st.unicode.org/cldr-apps/v#/bn/Alphabetic_Information/7ed88347aa1b55ed
The following is the comparison of the row named "Main Letters" of the Bengali language locale with that of Assamese:
- Letter ক্ষ {ক\u09CDষ} is in a different place in alphabetical order -- between ক and খ -- which is correct for the Bengali language
- Letter ৰ is not present, as expected, since that letter is not used in the Bengali language
- Letter র is present and is located between য় {য\u09BC} and ল, which is correct in the alphabetical order for Bengali
- Letter ৱ is not present, as expected, since that letter is not used in the Bengali language
This comparison shows that the Unicode Locales for Assamese and Bengali clearly identify the two primary differences between the languages, namely:
- The specific letters of the script that are used and not used by each language, and
- The unique alphabetical order of each language
Further, since these differences are encapsulated by a standard Locale name for each language ('as' for Assamese, and 'bn' for Bengali), there is a clear way to distinguish Assamese language content from Bengali language content on computers and the internet, i.e. by marking the locale associated with the content.
Why should we mark the locale associated with Assamese content?
As of writing this blog post, there is very little Assamese language content on the internet, which is marked with the Assamese locale. Marking content written in the Assamese language to be in the Assamese Locale will unambiguously distinguish it from content written in other languages that share the script, primarily the Bengali language.A key benefit of doing this is that search spiders like Google or Bing will be able to record that the content they found, during the spidering, is in the Assamese locale, as distinct from Bengali.
Without a locale marking on a web page, search spiders (Google, Bing, etc) make an educated guess about the language of the spidered content. They do this by inspecting the Unicode codes used in the page content.
For example, an Assamese language page, that is not marked with the Assamese locale, is guessed to be a Bengali language page. This is because the Unicode codes used on the page are in the Bengali code chart, and a much higher proportion of web pages using that code chart is in the Bengali language relative to the Assamese language.
This is similar to how a French language page, not marked to be in the French locale, may end up being guessed as an English language page, since a much higher proportion of web pages using the Latin code chart is in the English language relative to the French language.
With the locale clearly marked on Assamese pages as Assamese, search spiders will be able to correctly record and count the Assamese language pages. This will provide concrete data to Google and other internet providers to add full support for the Assamese language to display, search, and edit such Assamese language content, and also for new Assamese language content that gets added.
How can we mark a web page to be in a specific locale?
The HTML standard specifies how to mark a web page to be in a specific language locale. This is by including a lang attribute in the HTML element of the web page.
Concretely, to mark a web page to be in the Assamese Locale, we need to add the lang="as" attribute in the HTML element on that web-page, i.e.
Concretely, to mark a web page to be in the Assamese Locale, we need to add the lang="as" attribute in the HTML element on that web-page, i.e.
<html ... lang="as" ... >
Here is an example on one of my Assamese blog posts. The HTML element and the lang="as" attribute can be seen by viewing the source of the web page.
and here: http://xml.coverpages.org/iso639a.html
How can we use locale to search for Assamese content?
Currently, Microsoft Bing has support to limit web searches to pages marked in the Assamese locale. Google does not currently have this support.
The behavior of this operator is to limit search results to the language specified ('fr' in the example above), by matching that value to the value of the 'lang' attribute (described in the previous section above) on web pages (and a couple of other language tags).
Bing supports 'as' (for Assamese) as a valid parameter for 'language:'.
Support in Bing
Bing supports a 'language:' search operator that users can enter in the search field. This operator takes a language parameter (e.g. language:fr for French).The behavior of this operator is to limit search results to the language specified ('fr' in the example above), by matching that value to the value of the 'lang' attribute (described in the previous section above) on web pages (and a couple of other language tags).
Bing supports 'as' (for Assamese) as a valid parameter for 'language:'.
A simple test is to try the following search query on http://bing.com:
language:as কবিতা
This search query on Bing currently returns three or four search results, because only a few Assamese web pages are marked with the lang="as" attribute (primarily in Assamese wikipedia). Once more Assamese web pages are marked with the lang="as" attribute, those web pages will also be considered in the search results returned by queries using the "language:as" operator.
More details about the "language:" operator in Bing and the language codes it supports can be found here: https://msdn.microsoft.com/en-us/library/ff795616.aspx
Lack of support in Google
Google does not currently have support to limit web searches to pages marked in the 'as' locale (for Assamese). Unlike Bing, Google does not support the 'language:' operator. It has other mechanisms to specify a language to limit searches. But those mechanisms do not currently have support for 'as' as a valid language value. Some mechanisms prevent 'as' from being specified. Other mechanisms silently ignore any 'as' specified, returning all matching search results, as if no language value was specified.
Interestingly, Google does not currently have support for 'bn' (for Bengali) as a valid language value either, in those search mechanisms. But, lack of support for 'bn' poses no practical problem for searching for Bengali language content. This is due to the much larger volume of Bengali language content relative to Assamese, and how the search algorithms work to rank the higher volume content higher. However, lack of support for 'as' in Google poses a lot of problems for searchers of Assamese language content. I have described the issues in some detail in this earlier write-up.
Conclusion
In conclusion, the call to action is to mark all Assamese web pages with the lang="as" attribute as described above. As more pages do that, effective searches for Assamese content on the web will become possible, using Bing and the language:as operator, and in future, by lobbying Google to also add the same level of support.