Package org.apache.lucene.util
Class UnicodeUtil
java.lang.Object
org.apache.lucene.util.UnicodeUtil
Class to encode java's UTF16 char[] into UTF8 byte[]
without always allocating a new byte[] as
String.getBytes("UTF-8") does.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final BytesRef
A binary term consisting of a number of 0xff bytes, likely to be bigger than other terms (e.g.static final int
static final int
static final int
static final int
static final int
-
Method Summary
Modifier and TypeMethodDescriptionstatic int
codePointCount
(BytesRef utf8) Returns the number of code points in this UTF8 sequence.static String
newString
(int[] codePoints, int offset, int count) Cover JDK 1.5 API.static String
static void
UTF16toUTF8
(char[] source, int offset, int length, BytesRef result) Encode characters from a char[] source, starting at offset for length chars.static void
UTF16toUTF8
(CharSequence s, int offset, int length, BytesRef result) Encode characters from this String, starting at offset for length characters.static int
UTF16toUTF8WithHash
(char[] source, int offset, int length, BytesRef result) Encode characters from a char[] source, starting at offset for length chars.static void
UTF8toUTF16
(byte[] utf8, int offset, int length, CharsRef chars) Interprets the given byte array as UTF-8 and converts to UTF-16.static void
UTF8toUTF16
(BytesRef bytesRef, CharsRef chars) Utility method forUTF8toUTF16(byte[], int, int, CharsRef)
static void
UTF8toUTF32
(BytesRef utf8, IntsRef utf32) This method assumes valid UTF8 input.static boolean
validUTF16String
(char[] s, int size) static boolean
-
Field Details
-
BIG_TERM
A binary term consisting of a number of 0xff bytes, likely to be bigger than other terms (e.g. collation keys) one would normally encounter, and definitely bigger than any UTF-8 terms.WARNING: This is not a valid UTF8 Term
-
UNI_SUR_HIGH_START
public static final int UNI_SUR_HIGH_START- See Also:
-
UNI_SUR_HIGH_END
public static final int UNI_SUR_HIGH_END- See Also:
-
UNI_SUR_LOW_START
public static final int UNI_SUR_LOW_START- See Also:
-
UNI_SUR_LOW_END
public static final int UNI_SUR_LOW_END- See Also:
-
UNI_REPLACEMENT_CHAR
public static final int UNI_REPLACEMENT_CHAR- See Also:
-
-
Method Details
-
UTF16toUTF8WithHash
Encode characters from a char[] source, starting at offset for length chars. Returns a hash of the resulting bytes. After encoding, result.offset will always be 0. -
UTF16toUTF8
Encode characters from a char[] source, starting at offset for length chars. After encoding, result.offset will always be 0. -
UTF16toUTF8
Encode characters from this String, starting at offset for length characters. After encoding, result.offset will always be 0. -
validUTF16String
-
validUTF16String
public static boolean validUTF16String(char[] s, int size) -
codePointCount
Returns the number of code points in this UTF8 sequence.This method assumes valid UTF8 input. This method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped).
- Throws:
IllegalArgumentException
- If invalid codepoint header byte occurs or the content is prematurely truncated.
-
UTF8toUTF32
This method assumes valid UTF8 input. This method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped).
- Throws:
IllegalArgumentException
- If invalid codepoint header byte occurs or the content is prematurely truncated.
-
newString
Cover JDK 1.5 API. Create a String from an array of codePoints.- Parameters:
codePoints
- The code arrayoffset
- The start of the text in the code point arraycount
- The number of code points- Returns:
- a String representing the code points between offset and count
- Throws:
IllegalArgumentException
- If an invalid code point is encounteredIndexOutOfBoundsException
- If the offset or count are out of bounds.
-
toHexString
-
UTF8toUTF16
Interprets the given byte array as UTF-8 and converts to UTF-16. TheCharsRef
will be extended if it doesn't provide enough space to hold the worst case of each byte becoming a UTF-16 codepoint.NOTE: Full characters are read, even if this reads past the length passed (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is passed). Explicit checks for valid UTF-8 are not performed.
-
UTF8toUTF16
Utility method forUTF8toUTF16(byte[], int, int, CharsRef)
- See Also:
-