Other Stuff (locale, collation...)


A locale (language and cultural data for Cambodia) has been proposed to the common locale repository mantained by the Unicode Consortium. It is composed of two files that can be found here: km.xml and km_KH.xml

The latest version can be found on the server of the CLDR (Common Locale Data Repository).

Locale definition for GLIBC

After you download the source file, please follow the instruction as below. You must be root user to do this!

This instruction is valid for REDHAT 9.0 and Suse 9.3, other distributions may use other directories!

cp km_KH.UTF-8 /usr/share/i18n/locales/

cp /usr/share/i18n/charmaps/UTF-8.gz /tmp

gzip -d /tmp/UTF-8.gz

localedef -i /usr/share/i18n/locales/km_KH.UTF-8 -f /tmp/UTF-8 /usr/lib/locale/km_KH

rm -f /tmp/UTF-8

That is all. The Khmer Unicode locale file will be installed in the directory /usr/lib/locale/km_KH.

You can use the command: locale -a to see the available locale in your system.

Collation (sorting)

We have developed a couple of collation sequences for Khmer. One of them based on the traditional Chuon Nat dictionary, and another based on the more modern Headly Khmer-English dictionary.
The source for a C program implementing them can be found here. This programs sort data contained in files in UTF-8 format, producing another file (sorted) also in UTF-8 format.
The collation sequence corresponding to the Chuon Nat ordering, in XML format, has been submitted to the Unicode Consortium.

Automatic insertion of ZWSP (word separation)

Jens Herden has developed a first version of a word breaking program for Khmer Unicode. The program goes through a Khmer Unicode text in UTF-8 format and inserts ZWSP characters between the words. It separates words using an internal dictionary (based on the Chuon Nat dictionary).

The program - which you can download here - is in java, so you need to have the Java Runtime Environment installed in your computer. It runs on any platform that has java installed.

It can handle UTF-8 format files, even if these files are in HTML/XML. It can also deal with simple RTF files. (MS Word can save and read documents as RTF files).

To learn how to use it, just type:

java -jar khwrdbrk.jar -o readme.txt -r

in a console (Linux) or the command prompt in WIndows. Make sure that you are in the folder where the *.jar is located. This will generate a file called readme.txt that contains the instructions.

A quicker way is to type:

java -jar khwrdbrk.jar -h

in order to get a short help.

If you can test it and find any bugs or have any wishes, please write to Jens, so that he can improve it further.