1.2. uniseg.graphemecluster — Grapheme cluster

Unicode grapheme cluster breaking

UAX #29: Unicode Text Segmentation (Unicode 6.2.0) http://www.unicode.org/reports/tr29/tr29-21.html

uniseg.graphemecluster.grapheme_cluster_break(c, index=0)

Return the Grapheme_Cluster_Break property of c

c must be a single Unicode code point string.

>>> print(grapheme_cluster_break('\x0d'))
CR
>>> print(grapheme_cluster_break('\x0a'))
LF
>>> print(grapheme_cluster_break('a'))
Other

If index is specified, this function consider c as a unicode string and return Grapheme_Cluster_Break property of the code point at c[index].

>>> print(grapheme_cluster_break(u'a\x0d', 1))
CR
uniseg.graphemecluster.grapheme_cluster_breakables(s)

Iterate grapheme cluster breaking opportunities for every position of s

1 for “break” and 0 for “do not break”. The length of iteration will be the same as len(s).

>>> list(grapheme_cluster_breakables(u'ABC'))
[1, 1, 1]
>>> list(grapheme_cluster_breakables(u'g̈'))
[1, 0]
>>> list(grapheme_cluster_breakables(u''))
[]
uniseg.graphemecluster.grapheme_cluster_boundaries(s, tailor=None)

Iterate indices of the grapheme cluster boundaries of s

This function yields from 0 to the end of the string (== len(s)).

>>> list(grapheme_cluster_boundaries('ABC'))
[0, 1, 2, 3]
>>> list(grapheme_cluster_boundaries('g̈'))
[0, 2]
>>> list(grapheme_cluster_boundaries(''))
[]
uniseg.graphemecluster.grapheme_clusters(s, tailor=None)

Iterate every grapheme cluster token of s

Grapheme clusters (both legacy and extended):

>>> list(grapheme_clusters('g̈')) == ['g̈']
True
>>> list(grapheme_clusters('각')) == ['각']
True
>>> list(grapheme_clusters('각')) == ['각']
True

Extended grapheme clusters:

>>> list(grapheme_clusters('நி')) == ['நி']
True
>>> list(grapheme_clusters('षि')) == ['षि']
True

Empty string leads the result of empty sequence:

>>> list(grapheme_clusters('')) == []
True

You can customize the default breaking behavior by modifying breakable table so as to fit the specific locale in tailor function. It receives s and its default breaking sequence (iterator) as its arguments and returns the sequence of customized breaking opportunities:

>>> def tailor_grapheme_cluster_breakables(s, breakables):
...     
...     for i, breakable in enumerate(breakables):
...         # don't break between 'c' and 'h'
...         if s.endswith('c', 0, i) and s.startswith('h', i):
...             yield 0
...         else:
...             yield breakable
... 
>>> s = 'Czech'
>>> list(grapheme_clusters(s)) == ['C', 'z', 'e', 'c', 'h']
True
>>> list(grapheme_clusters(s, tailor_grapheme_cluster_breakables)) == ['C', 'z', 'e', 'ch']
True