As I have intimated on more than one occasion, one of the challenges facing Unicode and WG2 is how to successfully encode historic scripts which mostly do not have a standard, well-defined repertoire and which frequently exhibit great variation in character repertoire and glyph forms geographically and/or chronologically. The problems are often exacerbated by the fact that different scholars may have very different opinions on how to encode the script and what names to use for the characters (people often get very hung up on names), and it can be exceedingly difficult to reconcile these differences.
When the Unicode was first devised it was intended to accommodate all the scripts of the world in common modern usage, but as can be seen from Joe Becker's 1988 outline of the proposed Unicode standard, it was not envisaged that "obsolete or rare" scripts would be allowed into the Unicode repertoire :
Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 2 14 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally-useful Unicodes.
Joe Becker, Proposal for the Unicode Standard (29th August 1988) page 5
Ten years later, when Unicode had been around for nearly six years, there was still an antipathy in some quarters towards the encoding of rare and historic scripts, as can be seen from this position statement to SC2 by the Netherlands National Body (I just love the line about standardization bodies subsidizing academic research !) :
Market-relevance should guide selection of projects. This does not mean that academic preferences should be ignored, only that standards institutes, depending on industry contributions, cannot be expected to subsidize academic research. If Learned Societies want to raise their agreed conventions to the status of an International Standard, they should take the way of a Fast Track procedure, after having done the development themselves.
SC2 N2881 "Position of the Netherlands National Body (NNI) Regarding Further Development in JTC 1/SC 2" [1997-06-02]
Since the opening up of the supplementary planes this sort of attitude has thankfully become less prevalent, and most people involved in Unicode and 10646 have come to appreciate the importance to the scholarly community of being able to represent historical scripts (or even enigmatic script-like symbols) in electronic form. In many cases the encoding of an historic script is an important step towards greater understanding of the corpus of texts or even the decipherment of the script. As of Unicode 5.1 the following primarily historic scripts will have been encoded, in a large part due to the single-handed dedication and hard work of Michael Everson :
- Ogham (Unicode 3.0)
- Runic (Unicode 3.0)
- Gothic (Unicode 3.1)
- Old Italic (Unicode 3.1)
- Tagalog (Unicode 3.2)
- Cypriot Syllabary (Unicode 4.0)
- Linear B (Unicode 4.0)
- Ugaritic (Unicode 4.0)
- Coptic (Unicode 4.1)
- Glagolitic (Unicode 4.1)
- Kharoshthi (Unicode 4.1)
- Old Persian (Unicode 4.1)
- Phags-pa (Unicode 5.0)
- Phoenician (Unicode 5.0)
- Sumero-Akkadian Cuneiform (Unicode 5.0)
- Carian (Unicode 5.1)
- Lycian (Unicode 5.1)
- Lydian (Unicode 5.1)
And under consideration for encoding are a number of other historic scripts, including :
- Anatolian Hieroglyphs
- Egyptian Hieroglyphs
- Imperial Aramaic
- Parthian, Inscriptional Pahlavi, and Psalter Pahlavi
- Tangut (multi-font code chart [17MB], standard code chart [10MB])
Scripts that were devised by a single person at a single point in time, such as Gothic and Phags-pa, generally have a clearcut character repertoire, but it is often difficult to define the character repertoire of scripts that evolved over a long period of time, especially when they developed geographically distinct variants as in the case of Runic. In many cases it is difficult to even clearly define the limits of the script, and there may be arguments amongst experts as to whether different assemblages of inscriptions represent the same or different scripts, or whether a script that evolves over a long period of time should be treated as a single script or a number of distinct scripts in the same lineage. When this is the case, reaching a consensus on how best to encode a script (or even whether a script should be encoded separately) can be quite difficult. Matters are only made more difficult when a proposed script is an historic form of a living script, and users of the living script insist that the characters of the proposed script should be treated as glyph variants of the corresponding characters in the modern script. This was the case when Phoenician was proposed for encoding, and subscribers to the Unicode public mailing list will remember the endless vitriolic arguments between pro-encoders (Phoenician is a separate script in its own right and should be encoded separately from Hebrew) and anti-encoders (Phoenician is just an historical variant of Hebrew that should be dealt with at the font level not the character encoding level).
Which brings me in a roundabout way to "Old Hanzi" 古漢字 (hànzì 漢字 being the Chinese word for a Chinese character or "ideograph", equivalent to the Japanese word kanji). Like other long-lived scripts, the Chinese script is best viewed as a script continuum which evolved by stages to the modern form. Up until a few years ago I think that it was generally assumed within Unicode circles that ancient forms of the Chinese script should be dealt with at the font level rather than at the encoding level, but there was pressure from within China to encode at least the most important early forms of the Chinese script, resulting in an agreement in 2003 to initially encode three important nodes in the Chinese script continuum (the links are to encoding samples for each script prepared by the Chinese National Body) :
- Oracle Bone Script (jiǎgǔwén 甲骨文)
- Bronze Inscription Script (jīnwén 金文)
- Small Seal Script (xiǎo zhuàn 小篆)
Oracle Bone Script
No-one knows for sure when or where the Chinese script was devised, although a number of neolithic sites dating from as early as about 6600 BC up to about 2000 BC have yielded examples of individual symbols carved in isolation on tortoise shells or pottery shards that may or may not be early forms of Chinese characters (personally, I am quite sceptical that any of these marks are directly related to the Chinese script). However, the earliest undisputed stage in the Chinese script continuum that we have evidence for is the Oracle Bone Script (jiǎgǔwén 甲骨文), which was used for divination inscriptions in the royal court of the Shang 商 dynasty at the capital Yin 殷 (near modern Anyang in Henan province) during the period 1300-1050 BC (a few examples of inscribed oracle bones dating the early Western Zhou period have also been found at a number of other sites).
A question, or more frequently a series of parallel questions, is asked by a specialist diviner, and the answer divined by applying intense localised heat to the shell or bone and observing the pattern of the resultant cracks and/or the sound that the cracks make (the character bǔ *pŏk 卜 "to divine" both graphically represents a crack, and onomatopoeically represents the sound of a crack being made). The question (usually prefixed by the cyclic day on which the divination took place and the name of the diviner) as well as the resultant prognostication are then inscibed on the shell or bone, and the object archived, so that thousands of years later archaelogists can unearth them and learn all about the daily ritual of court life in the Shang dynasty. Many thousands of inscribed oracle bones from the ancient capital of the Shang dynasty have been preserved, and they indicate that every aspect of royal life, from toothache to warfare, was governed by a complex cycle of divination and ritual.
An Oracle Bone Inscription on an Ox Scapula
Historical Relics Unearthed in New China 新中國出土文物 (Foreign Languages Press, 1972) plate 37.
The above oracle bone was discovered in 1955 southeast of the site of the ancient capital of Yin, and dates to the third of five periods that oracle bone inscriptions can be classified as belonging to. The inscription itself comprises a compound question inscribed in a single column :
On the cyclic days ding mao [Day 4] and gui hai [Day 60] it was divined: "Should the King enter the city of Shang, and on the cyclic day yi chou [Day 2] should the king not perform the hui rite ?"
The resultant prognostication, 弘吉 "very auspicious", is incised by the crack marks to the left of the question.
Bronze Inscription Script
The next stage in the history of the Chinese script is the Bronze Inscription Script (jīnwén 金文), which is a form of the Chinese script that was used for inscriptions on bronze bells and vessels. A few very short inscriptions on Shang dynasty bronze vessels (mostly little more than the name of the vessel's owner) have been found, but the vast majority of bronze inscriptions date to the succeeding Zhou dynasty (circa 1050 to 256 BC). Because of the long period during which these bronze inscriptions were made there is quite a large variation in the style of characters used. The characters found on the earliest bronze inscriptions from the Shang and early Zhou dynasties are very similar in form to those found on oracle bones (although as would be expected, oracle bone characters are generally more angular and often simpler than the corresponding bronze inscription characters due to the difficulty of inscribing characters on a hard medium such as bone and shell). Bronze inscriptions from the later period are much less closely related to the oracle bone script and are more closely related to the Small Seal script.
The Xing Hou gui 邢侯簋 ...
Chinese Bronzes: Art and Ritual (British Museum, 1987) plate 25.
... and its Inscription
Chinese Bronzes: Art and Ritual (British Museum, 1987) rubbing 10.
This is a very famous example of a ritual vessel for offering food known as a gui 簋 that was unearthed at Luoyang 洛陽 in 1921, and is now at the British Museum. The vessel dates to the early or middle Western Zhou period, and has a quite long and rather difficult to read inscription that seems to record the grant of men to the Marquis of Xing (Xing Hou 邢侯), and is dedicated to his famous ancestor, the Duke of Zhou (Zhou Gong 周公), brother of the first ruler of the Zhou dynasty ("〇" represents an undeciphered or unencoded character) :
Small Seal Script
The Small Seal Script (xiǎo zhuàn 小篆) was adopted by the First Emperor (Qin Shi Huang 秦始皇) as the standard script of the Qin dynasty (221-206 BC). It developed from the characters used for inscriptions during the latter part of the Zhou dynasty, and so many late Zhou bronze inscriptions are written with characters that are much closer in style to the small seal script thanto the early Zhou bronze inscription script. By the time that the small seal script had developed the Chinese writing system had adopted the radical/phonetic method of character composition, and so the vast majority of small seal characters correspond directly to a modern character.
The main source for the Small Seal script repertoire will be editions of the Shuowen 說文 dictionary that was compiled by Xu Shen 許慎 in about the year 100. The illustration below shows a page from the table of 540 radicals at the beginning of a modern edition of Xu Shen's dictionary :
Table of Radicals in the Shuowen Dictionary
Shuowen Jiezi 說文解字 (Zhonghua Shuju, 1963) page 3.
You might have thought that the decision to encode these three historic script forms of Chinese would have led to the same level of complex debate and bitter argument that we saw for Phoenician, especially as the result of this decision will be to add many thousands more characters to Unicode, but there hasn't been a squeak. So here are my thoughts about some of the issues involved.
The first thing to realise is that the oracle bone script is quite different from the modern Chinese script in several respects, and that a large percentage of oracle bone characters remain undeciphered or do not correspond directly to any modern character. One of the reasons for this is that the method of composing characters by combining radical and phonetic elements, which is used for the majority of modern Chinese characters, is little used in the oracle bone script, with the result that a character that in the later script is written as a radical/phonetic compound may have been written in the earlier script as a completely different unitary character, which is unrecognisable to modern eyes.
The oracle bone script also makes use of compound characters, in which two separate characters are combined into a single glyph. For example the character jiǎ 甲 in oracle bone script is written as a cross (like 十), but the titles of the royal ancestors Shang Jia 上甲 "Upper Jia" and Xiao Jia 小甲 "Little Jia" are not written as a sequence of two characters shàng plus jiǎ and xiǎo plus jiǎ respectively, as would be expected according to the principles of the modern Chinese script, but Shang Jia is written as a cross (= 甲) in square box, and Xiao Jia is written as a cross (= 甲) with a dot in each of the four corners (however Da Jia 大甲 "Big Jia" is written as a sequence of the two characters dà plus jiǎ). Likewise, the titles of the royal ancestors Bao Yi 報乙, Bao Bing 報丙 and Bao Ding 報丁 are each written as a sideways bowl shape (similar to a reversed "C") representing the character baò 報 with the second character of the title (yǐ 乙, bǐng 丙 or dīng 丁) enclosed within. All these compound characters are shown in the oracle bone inscription below, which is also a good example of how it is currently impossible to represent many oracle bone inscriptions accurately, and why most authors working with oracle bone texts write the characters out by hand (those characters that are currently unencoded are represented by "〇", although a couple of them are in the pipeline for CJK-D, including ⿰酉彡 for the character which looks like but isn't 酒) :
Complex numbers may also written as compound characters, so that for example the numbers "50", "60", "70", "80" and "90" are represented by the characters for "5" (五), "6" (六), "7" (七), "8" (八) and "9" (九) with the character for "10" (十, written as a vertical line in oracle bone script) joined from above.
In a few cases there is a complete disjuncture between the character used in the oracle bone script and the corresponding modern Chinese character. For example in the oracle bone script the character used to represent the first of the twelve earthly branches does not correspond to the modern character for the first earthly branch (zǐ 子) but is written with a completely unrelated glyph of unknown meaning; whereas the oracle bone character used to represent the sixth earthly branch (which is sì 巳 in modern Chinese) is actually written with the character for zǐ 子 "son". Thus zǐ 子 is the first earthly branch in the modern Chinese script but the sixth earthly branch in the oracle bone script.
These sorts of issues are the reason why I think that it is not practical to treat the oracle bone script simply as a stylistic variant of the modern Han script. The fact that a majority of oracle bone characters either have no known counterpart in the modern Chinese script or are significantly different from the corresponding modern Chinese character with respect to their glyph composition also makes it very difficult to represent oracle bone script text using CJK Unified Ideographs and a suitable oracle bone style font that maps oracle bone glyphs to the corresponding modern characters (in many cases the mapping just does not exist). However, it has to be said that many artificial modernised versions of oracle bone and bronze inscription characters have been encoded already or are proposed for encoding (there are 367 characters in CJK-C and 1,481 characters in CJK-D that are derived from Yinzhou Jinwen Jicheng Yinde 殷周金文集成引得 [Concordance of Shang and Zhou Dynasty Bronze Inscriptions]). And it could be argued that if the encoding of artificial modernised forms of ancient characters is extended so that all ancient characters can be mapped to an encoded CJK Unified Ideograph then there would be no need to encode the oracle bone script separately. But the counterargument is that artificial modernised forms of ancient characters can only be encoded if they are attested, and not all oracle script or bronze inscription script characters have been or probably ever will be represented with artificial modernised forms (and often it is almost impossible to devise a modern form of an oracle bone script character). Another argument against this approach is that different scholars may modernise a character differently, so that there may be multiple artificial modern forms for the same oracle bone character.
A further problem that scholars of ancient Chinese inscriptions face is that most oracle bone and bronze inscription characters occur in a variety of different glyph forms, often composed using different combinations of component elements, and scholars want to be able to represent these significant glyph differences at the encoding level. Just picking at random the character for "spring" (chūn 春), it occurs in at least five distinct glyph forms :
Each of these five forms of the character is written with a different set of components, and are thus not unifiable according to the rules of CJK unification. It is to be expected that when the oracle bone script repertoire is eventually submitted for encoding it will contain separate characters for each of these forms (and probably also for other less common forms of the character).
I personally think that encoding the oracle bone script separately from the ordinary Han script is the only way for scholars to be able to work with oracle bone script texts, and I am looking forward to seeing it encoded as soon as possible. The same arguments that I have used to support the encoding of the oracle bone script may also be used for the bronze inscription script, although it could be argued that due to the similarity between the characters on early bronze inscriptions and oracle bone inscriptions it would have been better (or at least more economical) to combine the two scripts at the encoding level so as to avoid encoding duplicate versions of characters that are used in both oracle bone and bronze inscriptions.
I haven't said much about the small seal script, mainly because the issues of character identity that affect the oracle bone and bronze inscription scripts largely do not apply to the seal script. There is a high level of correspondence between small seal characters and modern Chinese characters, so I think that it is quite possible to deal with small seal script satisfactorily at the font level. Nevertheless, I don't have any strong objections to seeing the small seal script encoded as a separate script if that is what the user community wants.