Don't get lost: References » Character Encoding

Character Encoding

A character encoding system is a numbering of each character in a given character set, in which each character is assigned a distinct number.

Example

In the ASCII character set, character A is represented by the number 65, B by 66, C by 67 and so on in increasing sequence. These numbers are often referred to as 'codes' or 'character codes'. Internally, software uses these sequences of these numbers to represent text.

Supported Encodings

IMan provides the following implementations of the Encoding to support current Unicode encodings and other encodings:

asmo-708

Arabic (ASMO 708)

big5

Chinese Traditional (Big5)

cp866

Cyrillic (DOS)

cp875

IBM EBCDIC (Greek Modern)

dos-720

Arabic (DOS)

dos-862

Hebrew (DOS)

euc-jp

Japanese (JIS 0208-1990 and 0212-1990)

euc-jp

Japanese (EUC)

euc-kr

Korean (EUC)

gb2312

Chinese Simplified (GB2312)

ibm00858

OEM Multilingual Latin I

ibm037

IBM EBCDIC (US-Canada)

ibm437

OEM United States

ibm500

IBM EBCDIC (International)

ibm737

Greek (DOS)

ibm775

Baltic (DOS)

ibm850

Western European (DOS)

ibm852

Central European (DOS)

ibm855

OEM Cyrillic

ibm857

Turkish (DOS)

ibm860

Portuguese (DOS)

ibm861

Icelandic (DOS)

ibm863

French Canadian (DOS)

ibm864

Arabic (864)

ibm865

Nordic (DOS)

ibm869

Greek, Modern (DOS)

ibm870

IBM EBCDIC (Multilingual Latin-2)

iso-2022-jp

Japanese (JIS)

iso-2022-jp

Japanese (JIS-Allow 1 byte Kana - SO/SI)

iso-2022-kr

Korean (ISO)

iso-8859-1

Western European (ISO)

iso-8859-13

Estonian (ISO)

iso-8859-15

Latin 9 (ISO)

iso-8859-2

Central European (ISO)

iso-8859-3

Latin 3 (ISO)

iso-8859-4

Baltic (ISO)

iso-8859-5

Cyrillic (ISO)

iso-8859-6

Arabic (ISO)

iso-8859-7

Greek (ISO)

iso-8859-8

Hebrew (ISO-Visual)

iso-8859-9

Turkish (ISO)

koi8-r

Cyrillic (KOI8-R)

koi8-u

Cyrillic (KOI8-U)

ks_c_5601-1987

Korean

macintosh

Western European (Mac)

shift_jis

Japanese (Shift-JIS)

unicodefffe

Unicode (Big endian)

us-ascii

US-ASCII

utf-32

Unicode (UTF-32)

utf-32be

Unicode (UTF-32 Big endian)

utf-7

Unicode (UTF-7)

utf-8

Unicode (UTF-8)

windows-1250

Central European (Windows)

windows-1251

Cyrillic (Windows)

windows-1252

Western European (Windows)

windows-1253

Greek (Windows)

windows-1254

Turkish (Windows)

windows-1255

Hebrew (Windows)

windows-1256

Arabic (Windows)

windows-1257

Baltic (Windows)

windows-1258

Vietnamese (Windows)

windows-874

Thai (Windows)

x-mac-arabic

Arabic (Mac)

x-mac-ce

Central European (Mac)

x-mac-chinesesimp

Chinese Simplified (Mac)

x-mac-chinesetrad

Chinese Traditional (Mac)

x-mac-croatian

Croatian (Mac)

x-mac-cyrillic

Cyrillic (Mac)

x-mac-greek

Greek (Mac)

x-mac-hebrew

Hebrew (Mac)

x-mac-icelandic

Icelandic (Mac)

x-mac-japanese

Japanese (Mac)

x-mac-korean

Korean (Mac)

x-mac-romanian

Romanian (Mac)

x-mac-thai

Thai (Mac)

x-mac-turkish

Turkish (Mac)

x-mac-ukrainian

Ukrainian (Mac)

Byte Order Mark (Unicode Files)

The byte order mark (BOM) is a Unicode character used to signal the ‘endianness’ (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.

Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving Unicode text from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's ‘endianness’ to the consumer of the text without requiring some contract or metadata outside of the text stream itself.