Collating Unicode Text
The following is an explanation of the issues when collating text in other than the Roman alphabet.
The fundamental problem faced by the collation software is agreement upon a single character set. Otherwise the collation can look like trash to everyone. The following is a list of some common character sets.
- ASCII — Seven bit characters, no exotic characters or diacritical marks.
- ISO-8859-1 — Eight bit characters. The first half of these characters are the ASCII characters. This is a common standard used by many Western European languages.
- Unicode — These are 16 bit characters. The first block of Unicode characters is the ISO-8859-1 character set. Unicode characters can be further encoded as one, two, three or more bytes in a format called UTF-8.
- Windows-1251 — like ISO-8859-1, but slightly different.
- MacRoman — like ISO-8859-1, but quite different.
The collation software has two modes for dealing with character sets. Let us call them ASCII and Unicode. The ASCII mode is used by the LatinStudy groups. If the collation software detects a non-ASCII character while in ASCII mode, it will transform that exotic character into an ASCII character. For instance, if the exotic character was a vowel with a hat over it, the script would replace it with a capitalized ASCII vowel, the LatinStudy convention for a vowel with a macron. The collation software is forcing everything into ASCII because everyone can read ASCII.
The Unicode mode is used by the GreekStudy study groups. In Unicode mode, the script will not transform “exotic” characters into ASCII. Instead it assumes its input is UTF-8 encoded Unicode. The Unicode mode is not specific to Greek; the collation software has been used to collate Unicode-encoded Sanskrit (Devanagari script) and Hebrew.
If the collation software is used for a language that does not use the Roman script, there are several possible options. It is up to the group's coordinator to choose one.
First, everyone could use a Unicode capable word processor to generate Unicode-encoded text in UTF-8 format. Word 2003 can now generate polytonic Greek. The Unicorn text editor works for ancient Greek and Hebrew. There are many other similar software packages. If there isn't a problem with obsolete software, this is the recommended approach.
A straight ASCII transliteration is also possible. For ancient Greek, a recommended transliteration is the betacode standard. Other languages have their own ASCII transliterations. Everyone can get this method to work with their existing software, although it looks a bit pathetic to render a language in other than its proper script.
The hybrid approach has some people submitting Unicode, while others submit an ASCII transliteration. It so happens that an ASCII file is a Unicode/UTF-8 file, so the collation script can combine the two sorts of submissions without any problem. Some coordinators have successfully used this approach on the GreekStudy list.