1.1. uniseg.codepoint — Unicode code point

Unicode code point

uniseg.codepoint.ord(c, index=None)

Return the integer value of the Unicode code point c

NOTE: Some Unicode code points may be expressed with a couple of other code points (“surrogate pair”). This function treats surrogate pairs as representations of original code points; e.g. ord(u'\ud842\udf9f') returns 134047 (0x20b9f). u'\ud842\udf9f' is a surrogate pair expression which means u'\U00020b9f'.

>>> ord('a')
97
>>> ord('\u3042')
12354
>>> ord('\U00020b9f')
134047
>>> ord('abc')
Traceback (most recent call last):
  ...
TypeError: need a single Unicode code point as parameter

It returns the result of built-in ord() when c is a single str object for compatibility:

>>> ord('a')
97

When index argument is specified (to not None), this function treats c as a Unicode string and returns integer value of code point at c[index] (or may be c[index:index+2]):

>>> ord('hello', 0)
104
>>> ord('hello', 1)
101
>>> ord('a\U00020b9f', 1)
134047
uniseg.codepoint.unichr(cp)

Return the unicode object represents the code point integer cp

>>> unichr(0x61) == 'a'
True

Notice that some Unicode code points may be expressed with a couple of other code points (“surrogate pair”) in narrow-build Python. In those cases, this function will return a unicode object of which length is more than one; e.g. unichr(0x20b9f) returns u'\U00020b9f' while built-in unichr() may raise ValueError.

>>> unichr(0x20b9f) == '\U00020b9f'
True
uniseg.codepoint.code_points(s)

Iterate every Unicode code points of the unicode string s

>>> s = 'hello'
>>> list(code_points(s)) == ['h', 'e', 'l', 'l', 'o']
True

The number of iteration may differ from the len(s), because some code points may be represented as a couple of other code points (“surrogate pair”) in narrow-build Python.

>>> s = 'abc\U00020b9f\u3042'
>>> list(code_points(s)) == ['a', 'b', 'c', '\U00020b9f', '\u3042']
True