Java: Character sets and encoding

When you need to communicate with an external system or application and are not located in the US, it is very important to use the right character set. If all involved system are first class Unicode citizens, it is pretty easy. But it is not always the case especially when you work with older systems. As long as you stay in the Java world, everything is UTF-16 so no problem. And when you leave this safe Unicode world, you need to encode your string using an appropriate character set.

Often, you will want to convert your Java strings to a sequence of byte representing this string in a given character set. For this you have the getBytes method of the String class:

public byte[] getBytes(String charsetName) throws UnsupportedEncodingException

and:

public byte[] getBytes(Charset charset)

The second one is great if you already know in advance which character set you will be using. But if your software can be used in many countries world-wide and you do not know in advance which character set you will be needing, you’ll have to make the character set used configurable and go for the first method.

The first method takes the name of a character set as parameter. Unfortunately, it is not so easy to find a list of supported character sets and especially to find out how they are named.

This method basically java.lang.StringCoding class to lookup an instance of the Charset class from the provided character set name and uses a StringEncoder to transform the UTF-16 String to a byte array.

Note that this lookup of the Charset instance is not so expensive as the responses are cached since it is expected that most programs use the same character set multiple times.

If the character set is not already in the cache, a lookup is done in three lists of character sets:

  1. Standard character sets
  2. Extended character sets
  3. Character set from providers found via the application class loader

Note that there are many entries in the lists below which refer to the same character set. Java has an additional layer in there which are character set aliases. So each character set can have multiple aliases e.g. UTF-16 can be specified with any of the following aliases:

  • UTF_16
  • utf16
  • unicode
  • UnicodeBig

Standard character sets

Here is a list of standard character sets. This list should be complete but it might be different in different versions of the JRE.

iso-ir-6
ANSI_X3.4-1986
ISO_646.irv:1991
ASCII
ISO646-US
us
IBM367
cp367
csASCII
default
646
iso_646.irv:1983
ANSI_X3.4-1968
ascii7
UTF8
unicode-1-1-utf-8
UTF_16
utf16
unicode
UnicodeBig
UTF_16BE
ISO-10646-UCS-2
X-UTF-16BE
UnicodeBigUnmarked
UTF_16LE
X-UTF-16LE
UnicodeLittleUnmarked
UnicodeLittle
UTF_32
UTF32
UTF_32LE
X-UTF-32LE
UTF_32BE
X-UTF-32BE
UTF_32LE_BOM
UTF-32LE-BOM
UTF_32BE_BOM
UTF-32BE-BOM
iso-ir-100
ISO_8859-1
latin1
l1
IBM819
cp819
csISOLatin1
819
IBM-819
ISO8859_1
ISO_8859-1:1987
ISO_8859_1
8859_1
ISO8859-1
iso8859_2
8859_2
iso-ir-101
ISO_8859-2
ISO_8859-2:1987
ISO8859-2
latin2
l2
ibm912
ibm-912
cp912
912
csISOLatin2
iso8859_4
iso8859-4
8859_4
iso-ir-110
ISO_8859-4
ISO_8859-4:1988
latin4
l4
ibm914
ibm-914
cp914
914
csISOLatin4
iso8859_5
8859_5
iso-ir-144
ISO_8859-5
ISO_8859-5:1988
ISO8859-5
cyrillic
ibm915
ibm-915
cp915
915
csISOLatinCyrillic
iso8859_7
8859_7
iso-ir-126
ISO_8859-7
ISO_8859-7:1987
ELOT_928
ECMA-118
greek
greek8
csISOLatinGreek
sun_eu_greek
ibm813
ibm-813
813
cp813
iso8859-7
iso8859_9
8859_9
iso-ir-148
ISO_8859-9
ISO_8859-9:1989
ISO8859-9
latin5
l5
ibm920
ibm-920
920
cp920
csISOLatin5
iso8859_13
8859_13
iso_8859-13
ISO8859-13
ISO_8859-15
8859_15
ISO-8859-15
ISO8859_15
ISO8859-15
IBM923
IBM-923
cp923
923
LATIN0
LATIN9
L9
csISOlatin0
csISOlatin9
ISO8859_15_FDIS
koi8_r
koi8
cskoi8r
koi8_u
cp1250
cp5346
cp1251
cp5347
ansi-1251
cp1252
cp5348
cp1253
cp5349
cp1254
cp5350
cp1257
cp5353
cp437
ibm437
ibm-437
437
cspc8codepage437
windows-437
cp737
ibm737
ibm-737
737
cp775
ibm775
ibm-775
775
cp850
ibm-850
ibm850
850
cspc850multilingual
cp852
ibm852
ibm-852
852
csPCp852
cp855
ibm-855
ibm855
855
cspcp855
cp857
ibm857
ibm-857
857
csIBM857
cp858
ccsid00858
cp00858
858
cp862
ibm862
ibm-862
862
csIBM862
cspc862latinhebrew
cp866
ibm866
ibm-866
866
csIBM866
cp874
ibm874
ibm-874
874

Extended character sets

Extended character sets are more exotic character sets. it e.g. contains asian character sets like Big5, GBK, GB18030, ISO-2022-JP and ISO-2022-KR.
Here is a list of those character sets. This list should also be complete but it might be different in different versions of the JRE.

Big5
csBig5
x-MS950-HKSCS
x-windows-950
windows-950
x-windows-874
ms-874
x-EUC-TW
euctw
cns11643
EUC-TW
Big5-HKSCS
big5hk
big5-hkscs
big5-hkscs:unicode3.0
x-Big5-Solaris
GBK
windows-936
CP936
GB18030
gb18030-2000
GB2312
gb2312
gb2312-80
gb2312-1980
euc-cn
euccn
x-mswin-936
ms_936
Shift_JIS
shift_jis
shift-jis
ms_kanji
x-sjis
csShiftJIS
windows-31j
windows-932
csWindows31J
JIS_X0201
JIS_X0201
X0201
csHalfWidthKatakana
x-JIS0208
JIS_C6226-1983
iso-ir-87
x0208
JIS_X0208-1983
csISO87JISX0208
JIS_X0212-1990
jis_x0212-1990
x0212
iso-ir-159
csISO159JISX02121990
EUC-JP
eucjis
eucjp
Extended_UNIX_Code_Packed_Format_for_Japanese
csEUCPkdFmtjapanese
x-euc-jp
x-eucjp
x-euc-jp-linux
euc-jp-linux
x-eucjp-open
eucJP-open
x-PCK
ISO-2022-JP
jis
csISO2022JP
jis_encoding
csjisencoding
ISO-2022-JP-2
csISO2022JP2
iso2022jp2
x-windows-50221
cp50221
x-windows-50220
cp50220
x-windows-iso2022jp
x-JISAutoDetect
EUC-KR
ksc5601
euckr
ks_c_5601-1987
ksc5601-1987
ksc5601_1987
ksc_5601
csEUCKR
5601
x-windows-949
windows949
windows-949
ms_949
x-Johab
ksc5601-1992
ksc5601_1992
ms1361
ISO-2022-KR
csISO2022KR
ISO-2022-CN
csISO2022CN
x-ISO-2022-CN-CNS
ISO-2022-CN-CNS
x-ISO-2022-CN-GB
ISO-2022-CN-GB
x-ISCII91
iscii
ST_SEV_358-88
iso-ir-153
csISO153GOST1976874
ISO-8859-3
8859_3
ISO_8859-3:1988
iso-ir-109
ISO_8859-3
ISO8859-3
latin3
l3
ibm913
ibm-913
cp913
913
csISOLatin3
ISO-8859-6
8859_6
iso-ir-127
ISO_8859-6
ISO_8859-6:1987
ISO8859-6
ECMA-114
ASMO-708
arabic
ibm1089
ibm-1089
cp1089
1089
csISOLatinArabic
ISO-8859-8
8859_8
iso-ir-138
ISO_8859-8
ISO_8859-8:1988
ISO8859-8
cp916
916
ibm916
ibm-916
hebrew
csISOLatinHebrew
x-ISO-8859-11
iso-8859-11
iso8859_11
TIS-620
tis620.2533
windows-1255
windows-1256
windows-1258
x-IBM942
ibm942
ibm-942
942
x-IBM942C
ibm942C
ibm-942C
942C
x-IBM943
ibm943
ibm-943
943
x-IBM943C
ibm943C
ibm-943C
943C
x-IBM948
ibm948
ibm-948
948
x-IBM950
ibm950
ibm-950
950
x-IBM930
ibm930
ibm-930
930
x-IBM935
ibm935
ibm-935
935
x-IBM937
ibm937
ibm-937
937
x-IBM856
ibm-856
ibm856
856
IBM860
ibm860
ibm-860
860
csIBM860
IBM861
ibm861
ibm-861
861
csIBM861
cp-is
IBM863
ibm863
ibm-863
863
csIBM863
IBM864
ibm864
ibm-864
864
csIBM864
IBM865
ibm865
ibm-865
865
csIBM865
IBM868
ibm868
ibm-868
868
cp-ar
csIBM868
IBM869
ibm869
ibm-869
869
cp-gr
csIBM869
x-IBM921
ibm921
ibm-921
921
x-IBM1006
ibm1006
ibm-1006
1006
x-IBM1046
ibm1046
ibm-1046
1046
IBM1047
ibm-1047
1047
x-IBM1098
ibm1098
ibm-1098
1098
IBM037
ibm037
ebcdic-cp-us
ebcdic-cp-ca
ebcdic-cp-wt
ebcdic-cp-nl
csIBM037
cs-ebcdic-cp-us
cs-ebcdic-cp-ca
cs-ebcdic-cp-wt
cs-ebcdic-cp-nl
ibm-037
ibm-37
cpibm37
037
x-IBM1025
ibm1025
ibm-1025
1025
IBM1026
ibm1026
ibm-1026
1026
x-IBM1112
ibm1112
ibm-1112
1112
x-IBM1122
ibm1122
ibm-1122
1122
x-IBM1123
ibm1123
ibm-1123
1123
x-IBM1124
ibm1124
ibm-1124
1124
IBM273
ibm273
ibm-273
273
IBM277
ibm277
ibm-277
277
IBM278
ibm278
ibm-278
278
ebcdic-sv
ebcdic-cp-se
csIBM278
IBM280
ibm280
ibm-280
280
IBM284
ibm284
ibm-284
284
csIBM284
cpibm284
IBM285
ibm285
ibm-285
285
ebcdic-cp-gb
ebcdic-gb
csIBM285
cpibm285
IBM297
ibm297
ibm-297
297
ebcdic-cp-fr
cpibm297
csIBM297
IBM420
ibm420
ibm-420
ebcdic-cp-ar1
420
csIBM420
IBM424
ibm424
ibm-424
424
ebcdic-cp-he
csIBM424
IBM500
ibm500
ibm-500
500
ebcdic-cp-ch
ebcdic-cp-bh
csIBM500
x-IBM834
cp834
ibm834
ibm-834
IBM-Thai
ibm838
ibm-838
838
IBM870
ibm870
ibm-870
870
ebcdic-cp-roece
ebcdic-cp-yu
csIBM870
IBM871
ibm871
ibm-871
871
ebcdic-cp-is
csIBM871
x-IBM875
ibm875
ibm-875
875
IBM918
ibm-918
918
ebcdic-cp-ar2
x-IBM922
ibm922
ibm-922
922
x-IBM1097
ibm1097
ibm-1097
1097
x-IBM949
ibm949
ibm-949
949
x-IBM949C
ibm949C
ibm-949C
949C
x-IBM939
ibm939
ibm-939
939
x-IBM933
ibm933
ibm-933
933
x-IBM1381
ibm1381
ibm-1381
1381
x-IBM1383
ibm1383
ibm-1383
1383
x-IBM970
ibm970
ibm-970
ibm-eucKR
970
x-IBM964
ibm964
ibm-964
964
x-IBM33722
ibm33722
ibm-33722
33722
IBM01140
ccsid01140
cp01140
1140
IBM01141
ccsid01141
cp01141
1141
IBM01142
ccsid01142
cp01142
1142
IBM01143
ccsid01143
cp01143
1143
IBM01144
ccsid01144
cp01144
1144
IBM01145
ccsid01145
cp01145
1145
IBM01146
ccsid01146
cp01146
1146
IBM01147
ccsid01147
cp01147
1147
IBM01148
ccsid01148
cp01148
1148
IBM01149
ccsid01149
cp01149
1149
x-MacRoman
x-MacCentralEurope
x-MacCroatian
x-MacGreek
x-MacCyrillic
x-MacUkraine
x-MacTurkish
x-MacArabic
x-MacHebrew
x-MacIceland
x-MacRomania
x-MacThai
x-MacSymbol
x-MacDingbat

Additional character set providers

If you need to work with other character sets, not supported by default, you can write your own character set provider.

Note that it might also be useful, if you want to do some character substitution. Instead of configuring iso-latin-1 as character set, you could write your own character set encoder/decoder and replace e.g. strange characters produced by the autocompletion feature of Microsoft Word by corresponding characters which exist in the target character set.

In order to do that you need to extend java.nio.charset.spi.CharsetProvider and make your class available in the application classpath. Additionally, you will need to create a new class extending java.nio.charset.Charset.

CharsetProvider

In this class, you need to implement two methods:

charsetForName

This class basically gets a String representing a character set name and returns a Charset object or null if this particular provider doesn’t support the provided character set.

charsets

This returns an iterator over the character sets supported by this provider.

Charset

This is the class where the whole work is done. You have to implement quite a few methods. Most of the work is done by an Encoder and a Decoder.

Decoder

To implement a decoder you have to implement the decodeLoop method. It basically does what its name says: it decodes bytes to characters. So it will iterate through the bytes and see whether it’s the first byte of a multi-byte character, if not it will add a character to the output otherwise, it will read more bytes before adding a character to the output character buffer.

Encoder

Similarly, to implement an encoder, you need to implement the encodeLoop method. It takes a character buffer as input, iterates through the character buffer, convert each of them to one or multiple bytes and writes the bytes to the Byte buffer.

Leave a Reply

Your email address will not be published. Required fields are marked *