Object
Character.Subset
Filter
- Enclosing class:
Characters
Subsets of Unicode characters identified by their general category.
The categories are identified by constants defined in the
Character class, like
LOWERCASE_LETTER,
UPPERCASE_LETTER,
DECIMAL_DIGIT_NUMBER and
SPACE_SEPARATOR.
An instance of this class can be obtained from an enumeration of character types
using the forTypes(byte[]) method, or using one of the constants predefined
in this class. Then, Unicode characters can be tested for inclusion in the subset by
calling the contains(int) method.
Relationship with international standards
ISO 19162:2015 §B.5.2 recommends to ignore spaces, case and the following characters when comparing two identified object names: “_” (underscore), “-” (minus sign), “/” (solidus),
“(” (left parenthesis) and “)” (right parenthesis).
The same specification also limits the set of valid characters in a name to the following (§6.3.1):
A-Z a-z 0-9 _ [ ] ( ) { } < = > . , : ; + - (space) % & ' " * ^ / \ ? | °
Note: SIS does not enforce this restriction in its programmatic API,
but may perform some character substitutions at Well Known Text (WKT) formatting time.
If we take only the characters in the above list which are valid in a Unicode identifier and remove the characters that ISO 19162 recommends to ignore, the only characters
left are letters and digits.- Since:
- 0.3
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final Characters.FilterThe subset of all characters for whichCharacter.isLetterOrDigit(int)returnstrue.static final Characters.FilterThe subset of all characters for whichCharacter.isUnicodeIdentifierPart(int)returnstrue, excluding ignorable characters. -
Method Summary
Modifier and TypeMethodDescriptionbooleancontains(int codePoint) Returnstrueif this subset contains the given Unicode character.static Characters.FilterforTypes(byte... types) Returns a subset representing the union of all Unicode characters of the given types.Methods inherited from class Character.Subset
equals, hashCode, toString
-
Field Details
-
LETTERS_AND_DIGITS
The subset of all characters for whichCharacter.isLetterOrDigit(int)returnstrue. This subset includes the following general categories:
SIS uses this filter when comparing two identified object names. See the Relationship with international standards section in this class javadoc for more information.Character.LOWERCASE_LETTER,UPPERCASE_LETTER,TITLECASE_LETTER,MODIFIER_LETTER,OTHER_LETTERandDECIMAL_DIGIT_NUMBER.- See Also:
-
UNICODE_IDENTIFIER
The subset of all characters for whichCharacter.isUnicodeIdentifierPart(int)returnstrue, excluding ignorable characters. This subset includes all theLETTERS_AND_DIGITScategories with the addition of the following ones:Character.LETTER_NUMBER,CONNECTOR_PUNCTUATION,NON_SPACING_MARKandCOMBINING_SPACING_MARK.
-
-
Method Details
-
contains
public boolean contains(int codePoint) Returnstrueif this subset contains the given Unicode character.- Parameters:
codePoint- the Unicode character, as a code point value.- Returns:
trueif this subset contains the given character.
-
forTypes
Returns a subset representing the union of all Unicode characters of the given types.- Parameters:
types- the character types, asCharacterconstants.- Returns:
- the subset of Unicode characters of the given type.
- See Also:
-