Khmer Software Initiative

 

Adaptation of Khmer to Open Source Software

 

 
 

Using Khmer script (and many other Indic languages) in computers is more complicated than using languages that apply standard Latin encoding (such as Spanish or French), as fonts have to be interpreted, characters placed in the right place (reordered) and many exceptions handled. Special software is needed to handle Khmer and other Indic languages.

There are several high quality user interfaces (UI) already developed in OpenSource Software (these interfaces are the equivalent of the Windows desktop in a Microsoft environment, the program that allows users to access the different applications). Among these, there are some that seem to be well adapted to the goals of this project, as they already supports many Indic complex scripts (such as Thai, Hindi or Kannada, not very different from Khmer in complication) and are being translated to many languages, so the mechanics of translation are very well developed. Using one of these interfaces that can easily integrate the capability of using Khmer script seems to be a good technical solution that would permit a low-risk adaptation of an interface that will be maintained and improved by the worldwide computing community.

Besides the user interface, support for Khmer Script also needs to be developed independently for office applications, Internet applications and some utilities. As with user interfaces, the state of internationalization of office and Internet applications is already very high, including support for some Indic languages, which simplifies the work.

The last three years have seen an important advance on the handling of Indic language scripts by OpenSource Software.

Unfortunately, as Khmer was not yet standardized in Unicode -and no fonts were available- Khmer was not included in these developments that embraced Indian languages such as Hindi, Kannada or Tamil or other languages that are written suing Indic scripts, such as Thai.

Khmer is now handled almost correctly by very-up-to-date versions of Windows XP using the last version of Uniscribe (Microsoft rendering engine, Usp10.dll, not included in the XP standard distribution). It may also work correctly in Windows 2000. For more information, you can look into our page on this subject, or either here or here. MS Word still has some problems, but reportedly, MS Publishers handles it very well. Other MS applications - such as PowerPoint - still have important problems with Khmer. Many Win32 (Microsoft Windows) versions of OpenSource software get language support from Uniscribe. Mozilla and OpenOffice handle Khmer correctly under Windows 2000 and XP.

Different projects aim at allowing OpenSource Software to handle all the languages of the world (I18n[1] or Internationalization projects) under Linux and to handle local date, currency and other formats (Localization or l10n). Many of these projects are being supported by major computer manufacturers such as IBM or Sun Microsystems.

The most global of these projects maybe the OpenI18N group that “aims to provide a common open-source environment where applications can be executed and behave correctly worldwide, with different scripts, cultures and languages.”

Here are some specific implementations that are required for the KhmerOS initiative:

The ICU project -managed by IBM- has included support for many Indic languages. ICU gives script layout support  to the OpenOffice suite. No work has been done yet to implement Khmer in ICU, but the work done in Pango opens the way for the implementation (they are very similar). See our page on ICU.

A modified form of the ICU libraries[2] is used to give support to Pango, a rendering infrastructure that is used in high level interfaces such as Gnome (see our page on Gnome and Pango) or partially used in the Mozilla browser. The problem of using Pango is that the printing modules for Gnome still do not use Pango, so screen display of Khmer does not imply being able to print Khmer (which for now is the case, but work is being done to integrate Pango and Gnome-print by the Gnome-print maintainer).

Gnome (together with other Pango based applications) has been preliminarily chosen as the user interface for this project. Once implemented, Pango gives also support to quite a number of applications in the Gnome environment, including the Evolution e-mail/agenda tool, the Gimp graphic editor and some multimedia utilities.

When time for implementation comes, it will be very important to make sure that the user interface used allows correct handling of Khmer in the screen and in printing. If Gnome is not prepared to handle Khmer correctly, other interface will have to be selected.

The OpenSource alternative for a user interface seems to be KDE, which receives support from Qt. It could be an option for this initiative if Gnome proved to to work well in Khmer 100%. This change would require changing the set of tools included in the project for tools that use the KDE toolkit.

In relation with these developments, it is necessary to develop locales for Khmer. A locale is a data file that contains information about date formats, number formats, sorting… and other cultural information, so that when dates and other data is printed, it follows local conventions. See ICU in the status pages for what is happening with locales.

It is also necessary to develop a "dictionary" of Khmer words (a word list). This dictionary is used for spell checking, for indicating the word-processor and other programs where they should hyphenate or terminate a line of text (as in Khmer no spaces are inserted between words), and also to do dictionary based ordering (instead of rule-based ordering, as the official dictionary is no always systematic). The same dictionary format is used by OpenOffice and Mozilla (e-mail). A synonym dictionary should be considered in later versions of the software. See the status page for developments

Next, ordering algorithms have to be developed. One following governamental indications, should follow the Chuon Nath dictionary order (not systematic). Words not in the dictionary will be clasified according to specific ordering rule. A second algorith with similar rules, but not taking into consideration the specificities of the Chuon Nath dictionary also needs to be developed.

Some development of software have to be made in order to have all these applications work in Khmer in an OpenSource operating system such as Linux.

These developments can be done by the “maintainer” of the application, a person who usually does it out of his/her own will and on a non-profit basis. This person knows the program very well and the effort required is not to large. The problem is that this person usually have a day-job that allows very little time for this work, and they have many other priorities as maintainers. The developments for Khmer, if done by them, could take a long time.

They can also be done by a volunteer or a student in a Cambodian university that will write the necessary code. This solution could be used in case the project does not have enough funds, but it can also take a long time, as volunteers students are not always available and have other priorities.

The third solution is to contract the development to a person or company who will take care of it. This person needs to learn how the program works, find similar developments, adjust them for Khmer and add them to the standard code of the applications.

Please look at the status page for work that has already been done.

Another project to keep an eye on is Freetype, the project maintains FreeType 2, “a software font engine that is designed to be small, efficient, highly customizable and portable while capable of producing high-quality output (glyph images). It can be used in graphics libraries, display servers, font conversion tools, text image generation tools, and many other products as well”.

There a couple of other projects that so far do not seem to be moving much, but should be watched closely. They have not yet produced any results of interest for this project. They are: The Indian GNU/Linux Project (“The goal of this project is to create a Linux distribution that supports Indian Languages from a GUI/Application level as well as Kernel level) and The Indic-Computing Project. (“We create open-source infrastructural code, and provide technical documentation on Indian language computing issues. Our mailing lists provide forums where Indian language computing can be discussed”.)


[1] I18n is short for Internationalization, because there are 18 letter between the first I and the last N. L10n is short for localization).

[2] The problem with this is that ICU in Pango is not maintained. IBM may consider restructuring ICU to make integration into Pango easy, but they don’t know when or if. Now including Khmer in ICU would not automatically include it in Pango, adaptations would have to be made.

 

Page Last Updated: Friday, 22 October 2004

For any comments on the web, please contact the wembaster of this domain