regexps.com
This chapter describes the foundation of support for the Unicode character set in the Hackerlab C library.
This chapter is not a tutorial introduction to Unicode. We presume that readers are already somewhat familiar with Unicode. A very brief introduction can be found in An Absurdly Brief Introduction to Unicode.
enum uni_encoding_schemes;
Values of the enumerated type uni_encoding_schemes
 are used in
interfaces throughout the Hackerlab C library to identify encoding
schemes for strings or streams of Unicode characters.  (See
An Absurdly Brief Introduction to Unicode.)
     enum uni_encoding_schemes
      {
        uni_iso8859_1,
        uni_utf8,
        uni_utf16,
        uni_utf16be,
        uni_utf16le,
      };     
uni_iso8859_1
 refers to a degenerate encoding scheme.  Each
character is stored in one byte.  Only characters in the
range U+0000 .. U+00FF
 can be represented.
uni_utf8
 refers to the UTF-8 encoding scheme.
uni_utf16
 refers to UTF-16 in the native byte order of
the machine.
uni_utf16be
 refers to UTF-16, explicitly in big-endian order.
uni_utf16le
 refers to UTF-16, explicitly in little-endian order.
Some low-level functions in the Hackerlab C library work with 
any of these five encodings.  Higher-level functions work
only with uni_iso8859_1
, uni_utf8
, and uni_utf16
.
Code units in a uni_utf8
 string are of type t_uchar
 (unsigned,
8-bit integer).  Code units in a uni_utf16
 string are of type
t_unichar
 (unsigned 16-bit integer).  Unicode code points are
of type t_unicode
.  (See Machine-Specific Definitions.)
The Hackerlab C Library is designed to operate correctly for programs which internally use any combination of the encodings iso8859-1, utf-8, and utf-16. (Future releases are likely to add support for utf-32.)
typedef struct uni__undefined_struct * uni_string;
The type uni_string
 is pointer to a value of unknown size.  It is
used to represent the address of a Unicode string or an address
within a Unicode string.
Any two uni_string
 pointers may be compared for equality.
uni_string
 pointers within a single string may be compared
using any relational operator (<
, >
, etc.).
uni_string
 pointers are created from UTF-8 pointers (t_uchar *
)
and from UTF-16 pointers (t_unichar *
) by means of a cast:
     uni_string s = (uni_string)utf_8_string;
     uni_string t = (uni_string)utf_16_string;
By convention, all functions that operate on Unicode strings accept two parameters for each string: an encoding form, and a string pointer as in this function declaration:
     void uni_fn (enum uni_encoding_scheme encoding,
                  uni_string s);
By convention, the length of a Unicode string is always measured in code units, no matter what the size of those code units. Integer string indexes are also measured in code units.
These functions were not ready for the current release of the Hackerlab C Library. They will be included in future releases.
The functions and macros in this chapter present programs with an interface to various properties extracted from the Unicode Character Database as published by the Unicode consortium.
For information about the version of the database used and the implications of using these functions on program size, see Data Sheet for the Hackerlab Unicode Database.
Function 
unidata_is_assigned_code_point
int unidata_is_assigned_code_point (t_unicode c);
Return 1
 if c
 is an assigned code point, 0
 otherwise.
A code point is assigned if it has an entry in unidata.txt
or is part of a range of characters whose end-points are
defined in unidata.txt
.
Type 
enum unidata_general_category
enum uni_general_category;
The General Category of a Unicode character is represented by an enumerated value of this type.
The primary category values are:
     uni_general_category_Lu         Letter, uppercase
     uni_general_category_Ll         Letter, lowercase
     uni_general_category_Lt         Letter, titlecase
     uni_general_category_Lm         Letter, modifier
     uni_general_category_Lo         Letter, other"
     uni_general_category_Mn         Mark, nonspacing
     uni_general_category_Mc         Mark, spacing combining
     uni_general_category_Me         Mark, enclosing
     uni_general_category_Nd         Number, decimal digit
     uni_general_category_Nl         Number, letter
     uni_general_category_No         Number, other
     uni_general_category_Zs         Separator, space
     uni_general_category_Zl         Separator, line
     uni_general_category_Zp         Separator, paragraph
     uni_general_category_Cc         Other, control
     uni_general_category_Cf         Other, format
     uni_general_category_Cs         Other, surrogate
     uni_general_category_Co         Other, private use
     uni_general_category_Cn         Other, not assigned
     uni_general_category_Pc         Punctuation, connector
     uni_general_category_Pd         Punctuation, dash
     uni_general_category_Ps         Punctuation, open
     uni_general_category_Pe         Punctuation, close
     uni_general_category_Pi         Punctuation, initial quote
     uni_general_category_Pf         Punctuation, final quote
     uni_general_category_Po         Punctuation, other
     uni_general_category_Sm         Symbol, math
     uni_general_category_Sc         Symbol, currency
     uni_general_category_Sk         Symbol, modifier
     uni_general_category_So         Symbol, other
Seven additional synthetic categories are defined. These are:
     uni_general_category_L          Letter
     uni_general_category_M          Mark
     uni_general_category_N          Number
     uni_general_category_Z          Separator
     uni_general_category_C          Other
     uni_general_category_P          Punctuation
     uni_general_category_S          Symbol
No character is given a synthetic category as its general category. Rather, the synthetic categories are used in some interfaces to refer to all characters having a general category within one of the synthetic categories.
Function 
unidata_general_category
enum uni_general_category unidata_general_category (t_unicode c);
Return the general category of c
.
The category returned for unassigned code points is
uni_general_category_Cn
 (Other, Not Assigned).
Function 
unidata_decimal_digit_value
int unidata_decimal_digit_value (t_unicode c);
If c
 is a decimal digit (regardless of script) return
its digit value.  Otherwise, return -1
.
Type 
enum unidata_bidi_category
enum uni_bidi_category;
The Bidrectional Category of a Unicode character is represented by an enumerated value of this type.
The bidi category values are:
     uni_bidi_L      Left-to-Right
     uni_bidi_LRE    Left-to-Right Embedding
     uni_bidi_LRO    Left-to-Right Override
     uni_bidi_R      Right-to-Left
     uni_bidi_AL     Right-to-Left Arabic
     uni_bidi_RLE    Right-to-Left Embedding
     uni_bidi_RLO    Right-to-Left Override
     uni_bidi_PDF    Pop Directional Format
     uni_bidi_EN     European Number
     uni_bidi_ES     European Number Separator
     uni_bidi_ET     European Number Terminator
     uni_bidi_AN     Arabic Number
     uni_bidi_CS     Common Number Separator
     uni_bidi_NSM    Non-Spacing Mark
     uni_bidi_BN     Boundary Neutral
     uni_bidi_B      Paragraph Separator
     uni_bidi_S      Segment Separator
     uni_bidi_WS     Whitspace
     uni_bidi_ON     Other Neutrals
Function 
unidata_bidi_category
enum uni_bidi_category unidata_bidi_category (t_unicode c);
Return the bidirectional category of c
.
The category returned for unassigned code points is 
uni_bidi_ON
 (other neutrals).
int unidata_is_mirrored (t_unicode c);
Return 1
 if c
 is mirrored in bidirectional text, 0
 
otherwise.
Macro 
unidata_canonical_combining_class
#define unidata_canonical_combining_class(C)
Return the canonical combining class of a Unicode character.
Combining classes are represented as unsigned 8-bit integers.
These functions use the case mappings in unidata.txt
.
t_unicode unidata_to_upper (t_unicode c);
If c
 has a default uppercase mapping, return that mapping.
Otherwise, return c
.
t_unicode unidata_to_lower (t_unicode c);
If c
 has a default lowercase mapping, return that mapping.
Otherwise, return c
.
t_unicode unidata_to_title (t_unicode c);
If c
 has a default titlecase mapping, return that mapping.
Otherwise, return c
.
Type 
enum uni_decomposition_type
enum uni_decomposition_type;
The decomposition mapping of a character is described by values of this enumerated type:
     uni_decomposition_none
     uni_decomposition_canonical
     uni_decomposition_font
     uni_decomposition_noBreak
     uni_decomposition_initial
     uni_decomposition_medial
     uni_decomposition_final
     uni_decomposition_isolated
     uni_decomposition_circle
     uni_decomposition_super
     uni_decomposition_sub
     uni_decomposition_vertical
     uni_decomposition_wide
     uni_decomposition_narrow
     uni_decomposition_small
     uni_decomposition_square
     uni_decomposition_fraction
     uni_decomposition_compat
The value uni_decomposition_none
 indicates that a character
has no decomposition mapping.
Type 
struct uni_decomposition_mapping
struct uni_decomposition_mapping;
A character's decomposition mapping is described by this structure. It has the fields:
     enum uni_decomposition_type type;
     t_unicode * decomposition;
type
 is the type of decomposition.
If type
 is not uni_decomposition_none
, then decomposition
is a 0-termianted array of code points which are the decomposition
of the character.
Macro 
unidata_character_decomposition_mapping
#define unidata_character_decomposition_mapping(C)
Return the decomposition mapping of C
.  This macro returns
a pointer to a struct uni_decomposition_mapping
.
struct uni_block;
Structures of this type describe one of the standard blocks of
Unicode characters ("Basic Latin"
, "Latin-1 Supplement"
, etc.)
struct uni_block
{
  t_uchar * name;       /* name of the block */
  t_unichar start;      /* first character in the block */
  t_unichar end;        /* last character in the block */
};
extern struct uni_block uni_blocks[];
The names of the standard Unicode blocks. This array is sorted in code-point order, from least to greatest.
n_uni_blocks
 is the number of blocks in uni_blocks
.
     uni_blocks[n_uni_blocks].name == 0
extern const struct uni_block uni_blocks[]; extern const int n_uni_blocks;
bits uni_universal_bitset (void);
Return the set of all assigned code points which are not surrogate code points and are not private use code points. The set is represented as a shared bitset tree. (See Shared Bitset Trees.)
The shared bitset tree returned by this function uses the tree
structure defined by uni_bits_tree_rule
.  (See Unicode Character Bitsets.)
Programs should not attempt to modify the set returned by this function.
Function 
uni_general_category_bitset
bits uni_general_category_bitset (enum uni_general_category c);
Return the set of all assigned code points having the indicated general category or synthetic general category. The set is represented as a shared bitset tree. (See Shared Bitset Trees.)
c
 indicates which category to return.  It may be a Unicode
general category or a synthetic general category.  (See
General Category.)
The shared bitset tree returned by this function uses the tree
structure defined by uni_bits_tree_rule
.  (See Unicode Character Bitsets.)
Programs should not attempt to modify the set returned by this function.
regexps.com