org.apache.poi.util
Class CodePageUtil

java.lang.Object
  extended by org.apache.poi.util.CodePageUtil

public class CodePageUtil
extends java.lang.Object

Utilities for working with Microsoft CodePages.

Provides constants for understanding numeric codepages, along with utilities to translate these into Java Character Sets.


Field Summary
static int CP_037
          Codepage 037, a special case
static int CP_EUC_JP
          Codepage for EUC-JP
static int CP_EUC_KR
          Codepage for EUC-KR
static int CP_GB18030
          Codepage for GB18030
static int CP_GB2312
          Codepage for GB2312
static int CP_GBK
          Codepage for GBK, aka MS936
static int CP_ISO_2022_JP1
          Codepage for ISO-2022-JP
static int CP_ISO_2022_JP2
          Another codepage for ISO-2022-JP
static int CP_ISO_2022_JP3
          Yet another codepage for ISO-2022-JP
static int CP_ISO_2022_KR
          Codepage for ISO-2022-KR
static int CP_ISO_8859_1
          Codepage for ISO-8859-1
static int CP_ISO_8859_2
          Codepage for ISO-8859-2
static int CP_ISO_8859_3
          Codepage for ISO-8859-3
static int CP_ISO_8859_4
          Codepage for ISO-8859-4
static int CP_ISO_8859_5
          Codepage for ISO-8859-5
static int CP_ISO_8859_6
          Codepage for ISO-8859-6
static int CP_ISO_8859_7
          Codepage for ISO-8859-7
static int CP_ISO_8859_8
          Codepage for ISO-8859-8
static int CP_ISO_8859_9
          Codepage for ISO-8859-9
static int CP_JOHAB
          Codepage for Johab
static int CP_KOI8_R
          Codepage for KOI8-R
static int CP_MAC_ARABIC
          Codepage for Macintosh Arabic (Java: MacArabic)
static int CP_MAC_CENTRAL_EUROPE
          Codepage for Macintosh Central Europe (Latin-2) (Java: MacCentralEurope)
static int CP_MAC_CHINESE_SIMPLE
          Codepage for Macintosh Chinese Simplified (Java: unknown - use EUC_CN, ISO2022_CN_GB, MS936 or cp935)
static int CP_MAC_CHINESE_TRADITIONAL
          Codepage for Macintosh Chinese Traditional (Java: unknown - use Big5, MS950, or cp937)
static int CP_MAC_CROATIAN
          Codepage for Macintosh Croatian (Java: MacCroatian)
static int CP_MAC_CYRILLIC
          Codepage for Macintosh Cyrillic (Java: MacCyrillic)
static int CP_MAC_GREEK
          Codepage for Macintosh Greek (Java: MacGreek)
static int CP_MAC_HEBREW
          Codepage for Macintosh Hebrew (Java: MacHebrew)
static int CP_MAC_ICELAND
          Codepage for Macintosh Iceland (Java: MacIceland)
static int CP_MAC_JAPAN
          Codepage for Macintosh Japan (Java: unknown - use SJIS, cp942 or cp943)
static int CP_MAC_KOREAN
          Codepage for Macintosh Korean (Java: unknown - use EUC_KR or cp949)
static int CP_MAC_ROMAN
          Codepage for Macintosh Roman (Java: MacRoman)
static int CP_MAC_ROMAN_BIFF23
           
static int CP_MAC_ROMANIA
          Codepage for Macintosh Romanian (Java: MacRomania)
static int CP_MAC_THAI
          Codepage for Macintosh Thai (Java: MacThai)
static int CP_MAC_TURKISH
          Codepage for Macintosh Turkish (Java: MacTurkish)
static int CP_MAC_UKRAINE
          Codepage for Macintosh Ukrainian (Java: MacUkraine)
static int CP_MS949
          Codepage for MS949
static int CP_SJIS
          Codepage for SJIS
static int CP_UNICODE
          Codepage for Unicode
static int CP_US_ACSII
          Codepage for US-ASCII
static int CP_US_ASCII2
          Another codepage for US-ASCII
static int CP_UTF16
          Codepage for UTF-16
static int CP_UTF16_BE
          Codepage for UTF-16 big-endian
static int CP_UTF8
          Codepage for UTF-8
static int CP_WINDOWS_1250
          Codepage for Windows 1250
static int CP_WINDOWS_1251
          Codepage for Windows 1251
static int CP_WINDOWS_1252
          Codepage for Windows 1252
static int CP_WINDOWS_1252_BIFF23
           
static int CP_WINDOWS_1253
          Codepage for Windows 1253
static int CP_WINDOWS_1254
          Codepage for Windows 1254
static int CP_WINDOWS_1255
          Codepage for Windows 1255
static int CP_WINDOWS_1256
          Codepage for Windows 1256
static int CP_WINDOWS_1257
          Codepage for Windows 1257
static int CP_WINDOWS_1258
          Codepage for Windows 1258
static java.util.Set<java.nio.charset.Charset> DOUBLE_BYTE_CHARSETS
           
 
Constructor Summary
CodePageUtil()
           
 
Method Summary
static java.lang.String codepageToEncoding(int codepage)
          Turns a codepage number into the equivalent character encoding's name (in Java NIO canonical naming format).
static java.lang.String codepageToEncoding(int codepage, boolean javaLangFormat)
          Turns a codepage number into the equivalent character encoding's name, in either Java NIO or Java Lang canonical naming.
static java.lang.String cp950ToString(byte[] data, int offset, int lengthInBytes)
          This tries to convert a LE byte array in cp950 (Microsoft's dialect of Big5) to a String.
static byte[] getBytesInCodePage(java.lang.String string, int codepage)
          Converts a string into bytes, in the equivalent character encoding to the supplied codepage number.
static java.lang.String getStringFromCodePage(byte[] string, int codepage)
          Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.
static java.lang.String getStringFromCodePage(byte[] string, int offset, int length, int codepage)
          Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DOUBLE_BYTE_CHARSETS

public static final java.util.Set<java.nio.charset.Charset> DOUBLE_BYTE_CHARSETS

CP_037

public static final int CP_037

Codepage 037, a special case

See Also:
Constant Field Values

CP_SJIS

public static final int CP_SJIS

Codepage for SJIS

See Also:
Constant Field Values

CP_GBK

public static final int CP_GBK

Codepage for GBK, aka MS936

See Also:
Constant Field Values

CP_MS949

public static final int CP_MS949

Codepage for MS949

See Also:
Constant Field Values

CP_UTF16

public static final int CP_UTF16

Codepage for UTF-16

See Also:
Constant Field Values

CP_UTF16_BE

public static final int CP_UTF16_BE

Codepage for UTF-16 big-endian

See Also:
Constant Field Values

CP_WINDOWS_1250

public static final int CP_WINDOWS_1250

Codepage for Windows 1250

See Also:
Constant Field Values

CP_WINDOWS_1251

public static final int CP_WINDOWS_1251

Codepage for Windows 1251

See Also:
Constant Field Values

CP_WINDOWS_1252

public static final int CP_WINDOWS_1252

Codepage for Windows 1252

See Also:
Constant Field Values

CP_WINDOWS_1252_BIFF23

public static final int CP_WINDOWS_1252_BIFF23
See Also:
Constant Field Values

CP_WINDOWS_1253

public static final int CP_WINDOWS_1253

Codepage for Windows 1253

See Also:
Constant Field Values

CP_WINDOWS_1254

public static final int CP_WINDOWS_1254

Codepage for Windows 1254

See Also:
Constant Field Values

CP_WINDOWS_1255

public static final int CP_WINDOWS_1255

Codepage for Windows 1255

See Also:
Constant Field Values

CP_WINDOWS_1256

public static final int CP_WINDOWS_1256

Codepage for Windows 1256

See Also:
Constant Field Values

CP_WINDOWS_1257

public static final int CP_WINDOWS_1257

Codepage for Windows 1257

See Also:
Constant Field Values

CP_WINDOWS_1258

public static final int CP_WINDOWS_1258

Codepage for Windows 1258

See Also:
Constant Field Values

CP_JOHAB

public static final int CP_JOHAB

Codepage for Johab

See Also:
Constant Field Values

CP_MAC_ROMAN

public static final int CP_MAC_ROMAN

Codepage for Macintosh Roman (Java: MacRoman)

See Also:
Constant Field Values

CP_MAC_ROMAN_BIFF23

public static final int CP_MAC_ROMAN_BIFF23
See Also:
Constant Field Values

CP_MAC_JAPAN

public static final int CP_MAC_JAPAN

Codepage for Macintosh Japan (Java: unknown - use SJIS, cp942 or cp943)

See Also:
Constant Field Values

CP_MAC_CHINESE_TRADITIONAL

public static final int CP_MAC_CHINESE_TRADITIONAL

Codepage for Macintosh Chinese Traditional (Java: unknown - use Big5, MS950, or cp937)

See Also:
Constant Field Values

CP_MAC_KOREAN

public static final int CP_MAC_KOREAN

Codepage for Macintosh Korean (Java: unknown - use EUC_KR or cp949)

See Also:
Constant Field Values

CP_MAC_ARABIC

public static final int CP_MAC_ARABIC

Codepage for Macintosh Arabic (Java: MacArabic)

See Also:
Constant Field Values

CP_MAC_HEBREW

public static final int CP_MAC_HEBREW

Codepage for Macintosh Hebrew (Java: MacHebrew)

See Also:
Constant Field Values

CP_MAC_GREEK

public static final int CP_MAC_GREEK

Codepage for Macintosh Greek (Java: MacGreek)

See Also:
Constant Field Values

CP_MAC_CYRILLIC

public static final int CP_MAC_CYRILLIC

Codepage for Macintosh Cyrillic (Java: MacCyrillic)

See Also:
Constant Field Values

CP_MAC_CHINESE_SIMPLE

public static final int CP_MAC_CHINESE_SIMPLE

Codepage for Macintosh Chinese Simplified (Java: unknown - use EUC_CN, ISO2022_CN_GB, MS936 or cp935)

See Also:
Constant Field Values

CP_MAC_ROMANIA

public static final int CP_MAC_ROMANIA

Codepage for Macintosh Romanian (Java: MacRomania)

See Also:
Constant Field Values

CP_MAC_UKRAINE

public static final int CP_MAC_UKRAINE

Codepage for Macintosh Ukrainian (Java: MacUkraine)

See Also:
Constant Field Values

CP_MAC_THAI

public static final int CP_MAC_THAI

Codepage for Macintosh Thai (Java: MacThai)

See Also:
Constant Field Values

CP_MAC_CENTRAL_EUROPE

public static final int CP_MAC_CENTRAL_EUROPE

Codepage for Macintosh Central Europe (Latin-2) (Java: MacCentralEurope)

See Also:
Constant Field Values

CP_MAC_ICELAND

public static final int CP_MAC_ICELAND

Codepage for Macintosh Iceland (Java: MacIceland)

See Also:
Constant Field Values

CP_MAC_TURKISH

public static final int CP_MAC_TURKISH

Codepage for Macintosh Turkish (Java: MacTurkish)

See Also:
Constant Field Values

CP_MAC_CROATIAN

public static final int CP_MAC_CROATIAN

Codepage for Macintosh Croatian (Java: MacCroatian)

See Also:
Constant Field Values

CP_US_ACSII

public static final int CP_US_ACSII

Codepage for US-ASCII

See Also:
Constant Field Values

CP_KOI8_R

public static final int CP_KOI8_R

Codepage for KOI8-R

See Also:
Constant Field Values

CP_ISO_8859_1

public static final int CP_ISO_8859_1

Codepage for ISO-8859-1

See Also:
Constant Field Values

CP_ISO_8859_2

public static final int CP_ISO_8859_2

Codepage for ISO-8859-2

See Also:
Constant Field Values

CP_ISO_8859_3

public static final int CP_ISO_8859_3

Codepage for ISO-8859-3

See Also:
Constant Field Values

CP_ISO_8859_4

public static final int CP_ISO_8859_4

Codepage for ISO-8859-4

See Also:
Constant Field Values

CP_ISO_8859_5

public static final int CP_ISO_8859_5

Codepage for ISO-8859-5

See Also:
Constant Field Values

CP_ISO_8859_6

public static final int CP_ISO_8859_6

Codepage for ISO-8859-6

See Also:
Constant Field Values

CP_ISO_8859_7

public static final int CP_ISO_8859_7

Codepage for ISO-8859-7

See Also:
Constant Field Values

CP_ISO_8859_8

public static final int CP_ISO_8859_8

Codepage for ISO-8859-8

See Also:
Constant Field Values

CP_ISO_8859_9

public static final int CP_ISO_8859_9

Codepage for ISO-8859-9

See Also:
Constant Field Values

CP_ISO_2022_JP1

public static final int CP_ISO_2022_JP1

Codepage for ISO-2022-JP

See Also:
Constant Field Values

CP_ISO_2022_JP2

public static final int CP_ISO_2022_JP2

Another codepage for ISO-2022-JP

See Also:
Constant Field Values

CP_ISO_2022_JP3

public static final int CP_ISO_2022_JP3

Yet another codepage for ISO-2022-JP

See Also:
Constant Field Values

CP_ISO_2022_KR

public static final int CP_ISO_2022_KR

Codepage for ISO-2022-KR

See Also:
Constant Field Values

CP_EUC_JP

public static final int CP_EUC_JP

Codepage for EUC-JP

See Also:
Constant Field Values

CP_EUC_KR

public static final int CP_EUC_KR

Codepage for EUC-KR

See Also:
Constant Field Values

CP_GB2312

public static final int CP_GB2312

Codepage for GB2312

See Also:
Constant Field Values

CP_GB18030

public static final int CP_GB18030

Codepage for GB18030

See Also:
Constant Field Values

CP_US_ASCII2

public static final int CP_US_ASCII2

Another codepage for US-ASCII

See Also:
Constant Field Values

CP_UTF8

public static final int CP_UTF8

Codepage for UTF-8

See Also:
Constant Field Values

CP_UNICODE

public static final int CP_UNICODE

Codepage for Unicode

See Also:
Constant Field Values
Constructor Detail

CodePageUtil

public CodePageUtil()
Method Detail

getBytesInCodePage

public static byte[] getBytesInCodePage(java.lang.String string,
                                        int codepage)
                                 throws java.io.UnsupportedEncodingException
Converts a string into bytes, in the equivalent character encoding to the supplied codepage number.

Parameters:
string - The string to convert
codepage - The codepage number
Throws:
java.io.UnsupportedEncodingException

getStringFromCodePage

public static java.lang.String getStringFromCodePage(byte[] string,
                                                     int codepage)
                                              throws java.io.UnsupportedEncodingException
Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.

Parameters:
string - The byte of the string to convert
codepage - The codepage number
Throws:
java.io.UnsupportedEncodingException

getStringFromCodePage

public static java.lang.String getStringFromCodePage(byte[] string,
                                                     int offset,
                                                     int length,
                                                     int codepage)
                                              throws java.io.UnsupportedEncodingException
Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.

Parameters:
string - The byte of the string to convert
codepage - The codepage number
Throws:
java.io.UnsupportedEncodingException

codepageToEncoding

public static java.lang.String codepageToEncoding(int codepage)
                                           throws java.io.UnsupportedEncodingException

Turns a codepage number into the equivalent character encoding's name (in Java NIO canonical naming format).

Parameters:
codepage - The codepage number
Returns:
The character encoding's name. If the codepage number is 65001, the encoding name is "UTF-8". All other positive numbers are mapped to their Java NIO names, normally either "windows-" followed by the number, eg "windows-1251", or "cp" followed by the number, e.g. if the codepage number is 1252 the returned character encoding name will be "cp1252".
Throws:
java.io.UnsupportedEncodingException - if the specified codepage is less than zero.

codepageToEncoding

public static java.lang.String codepageToEncoding(int codepage,
                                                  boolean javaLangFormat)
                                           throws java.io.UnsupportedEncodingException

Turns a codepage number into the equivalent character encoding's name, in either Java NIO or Java Lang canonical naming.

Parameters:
codepage - The codepage number
javaLangFormat - Should Java Lang or Java NIO naming be used?
Returns:
The character encoding's name, in either Java Lang format (eg Cp1251, ISO8859_5) or Java NIO format (eg windows-1252, ISO-8859-9)
Throws:
java.io.UnsupportedEncodingException - if the specified codepage is less than zero.
See Also:
Supported Encodings

cp950ToString

public static java.lang.String cp950ToString(byte[] data,
                                             int offset,
                                             int lengthInBytes)
This tries to convert a LE byte array in cp950 (Microsoft's dialect of Big5) to a String. We know MS zero-padded ascii, and we drop those. There may be areas for improvement in this.

Parameters:
data -
offset -
lengthInBytes -
Returns:
Decoded String