Character Set

Character Encoding

The POSIX locale contains the characters in Portable Character Set , which have the properties listed in  LC_CTYPE . Implementations may also add other characters. In other locales, the presence, meaning and representation of any additional characters is locale-specific.

In locales other than the POSIX locale, a character may have a state-dependent encoding. There are two types of these encodings:

While in the initial shift state, all characters in the portable character set retain their usual interpretation and do not alter the shift state. The interpretation for subsequent bytes in the sequence is a function of the current shift state. A byte with all bits zero is interpreted as the null character independent of shift state. Thus a byte with all bits zero must never occur in the second or subsequent bytes of a character.

The maximum allowable number of bytes in a character in the current locale is indicated by MB_CUR_MAX, defined in the POSIX specification <stdlib.h>, and by the <mb_cur_max> value in a character set description file; see Character Set Description File . The implementation's maximum number of bytes in a character is defined by the C-language macro {MB_LEN_MAX}.

C Language Wide-character Codes

In the shell, the standard utilities are written so that the encodings of characters are described by the locale's LC_CTYPE definition (see LC_CTYPE ) and there is no differentiation between characters consisting of single octets (8-bit bytes), larger bytes, or multiple bytes. However, in the C language, a differentiation is made. To ease the handling of variable length characters, the C language has introduced the concept of wide character codes.

All wide-character codes in a given process consist of an equal number of bits. This is in contrast to characters, which can consist of a variable number of bytes. The byte or byte sequence that represents a character can also be represented as a wide-character code. Wide-character codes thus provide a uniform size for manipulating text data. A wide-character code having all bits zero is the null wide-character code, and terminates wide-character strings. The wide-character value for each member of the Portable Character Set will equal its value when used as the lone character in an integer character constant. Wide-character codes for other characters are locale- and implementation-dependent. State shift bytes do not have a wide-character code representation.

Character Set Description File

Implementations provide a character set description file for at least one coded character set supported by the implementation. These files are referred to elsewhere in this specification set as charmap files. It is implementation-dependent whether or not users or applications can provide additional character set description files.

This specification set does not require that multiple character sets or codesets be supported. Although multiple charmap files are supported, it is the responsibility of the implementation to provide the file or files; if only one is provided, only that one will be accessible using the localedef utility's -f option (although in the case of just one file on the system, -f is not useful).

Each character set description file defines characteristics for the coded character set and the encoding for the characters specified in Portable Character Set and may define encoding for additional characters supported by the implementation. Other information about the coded character set may also be in the file. Coded character set character values are defined using symbolic character names followed by character encoding values.

The character set description file provides:

The charmap file was introduced to resolve problems with the portability of, especially, localedef sources. This specification set assumes that the portable character set is constant across all locales, but does not prohibit implementations from supporting two incompatible codings, such as both ASCII and EBCDIC. Such dual-support implementations should have all charmaps and localedef sources encoded using one portable character set, in effect cross-compiling for the other environment. Naturally, charmaps (and localedef sources) are only portable without transformation between systems using the same encodings for the portable character set. They can, however, be transformed between two sets using only a subset of the actual characters (the portable set). However, the particular coded character set used for an application or an implementation does not necessarily imply different characteristics or collation; on the contrary, these attributes should in many cases be identical, regardless of codeset. The charmap provides the capability to define a common locale definition for multiple codesets (the same localedef source can be used for codesets with different extended characters; the ability in the charmap to define empty names allows for characters missing in certain codesets).

Each symbolic name specified in Portable Character Set is included in the file and is mapped to a unique encoding value (except for those symbolic names that are shown with identical glyphs). If the control characters commonly associated with the symbolic names in the following table are supported by the implementation, the symbolic names and their corresponding encoding values are included in the file.

The following declarations can precede the character definitions. Each must consist of the symbol shown in the following list, starting in column 1, including the surrounding brackets, followed by one or more blank characters, followed by the value to be assigned to the symbol.

<code_set_name>
The name of the coded character set for which the character set description file is defined. The characters of the name must be taken from the set of characters with visible glyphs defined in Portable Character Set.
 
<mb_cur_max>
The maximum number of bytes in a multi-byte character. This defaults to 1.
 
<mb_cur_min>
An unsigned positive integer value that defines the minimum number of bytes in a character for the encoded character set. On XSI-conformant systems, <mb_cur_min> is always 1.
 
<escape_char>
The escape character used to indicate that the characters following will be interpreted in a special way, as defined later in this section. This defaults to backslash (\), which is the character glyph used in all the following text and examples, unless otherwise noted.
 
<comment_char>
The character that when placed in column 1 of a charmap line, is used to indicate that the line is to be ignored. The default character is the number sign (#).

The character set mapping definitions will be all the lines immediately following an identifier line containing the string CHARMAP starting in column 1, and preceding a trailer line containing the string ENDCHARMAP starting in column 1. Empty lines and lines containing a <comment_char> in the first column will be ignored. Each non-comment line of the character set mapping definition (that is, between the CHARMAP and ENDCHARMAP lines of the file) must be in either of two forms:

"%s %s %s\n", <symbolic-name>, <encoding>, <comments>

or:

"%s...%s %s %s\n", <symbolic-name>, <symbolic-name>, <encoding>, <comments>

In the first format, the line in the character set mapping definition defines a single symbolic name and a corresponding encoding. A symbolic name is one or more characters from the set shown with visible glyphs in Portable Character Set , enclosed between angle brackets. A character following an escape character is interpreted as itself; for example, the sequence <\\\>> represents the symbolic name \> enclosed between angle brackets.

In the second format, the line in the character set mapping definition defines a range of one or more symbolic names. In this form, the symbolic names must consist of zero or more non-numeric characters from the set shown with visible glyphs in Portable Character Set , followed by an integer formed by one or more decimal digits. The characters preceding the integer must be identical in the two symbolic names, and the integer formed by the digits in the second symbolic name must be equal to or greater than the integer formed by the digits in the first name. This is interpreted as a series of symbolic names formed from the common part and each of the integers between the first and the second integer, inclusive. As an example, <j0101>...<j0104> is interpreted as the symbolic names <j0101>, <j0102>, <j0103> and <j0104>, in that order.

A character set mapping definition line must exist for all symbolic names specified in Portable Character Set , and must define the coded character value that corresponds to the character glyph indicated in the table, or the coded character value that corresponds with the control character symbolic name. If the control characters commonly associated with the symbolic names in Control Character Set are supported by the implementation, the symbolic name and the corresponding encoding value must be included in the file. Additional unique symbolic names may be included. A coded character value can be represented by more than one symbolic name.

The encoding part is expressed as one (for single-byte character values) or more concatenated decimal, octal or hexadecimal constants in the following formats:

"%cd%d", <escape_char>, <decimal byte value>
"%cx%x", <escape_char>, <hexadecimal byte value>
"%c%o", <escape_char>, <octal byte value>

Decimal constants must be represented by two or three decimal digits, preceded by the escape character and the lower-case letter d; for example, \d05, \d97 or \d143. Hexadecimal constants must be represented by two hexadecimal digits, preceded by the escape character and the lower-case letter x; for example, \x05, \x61 or \x8f. Octal constants must be represented by two or three octal digits, preceded by the escape character; for example, \05, \141 or \217. In a portable charmap file, each constant must represent an 8-bit byte. Implementations supporting other byte sizes may allow constants to represent values larger than those that can be represented in 8-bit bytes, and to allow additional digits in constants. When constants are concatenated for multi-byte character values, they must be of the same type, and interpreted in byte order from first to last with the least significant byte of the multi-byte character specified by the last constant. The manner in which these constants are represented in the character stored in the system is implementation-dependent. (This big endian notation was chosen for reasons of portability. There is no requirement that the internal representation in the computer memory be in this same order.) Omitting bytes from a multi-byte character definition produces undefined results.

In lines defining ranges of symbolic names, the encoded value is the value for the first symbolic name in the range (the symbolic name preceding the ellipsis). Subsequent symbolic names defined by the range will have encoding values in increasing order. For example, the line:

<j0101>...<j0104>    \d129\d254

will be interpreted as:

<j0101>              \d129\d254
<j0102>              \d129\d255
<j0103>              \d130\d0
<j0104>              \d130\d1

Note that this line will be interpreted as the example even on systems with bytes larger than 8 bits.

The comment is optional.