1.3. uniseg.wordbreak — Word break

Unicode word breaking

UAX #29: Unicode Text Segmentation (Unicode 6.2.0) http://www.unicode.org/reports/tr29/tr29-21.html

uniseg.wordbreak.word_break(c, index=0)

Return the Word_Break property of c

c must be a single Unicode code point string.

>>> print(word_break('\x0d'))
CR
>>> print(word_break('\x0b'))
Newline
>>> print(word_break('ア'))
Katakana

If index is specified, this function consider c as a unicode string and return Word_Break property of the code point at c[index].

>>> print(word_break('Aア', 1))
Katakana
uniseg.wordbreak.word_breakables(s)

Iterate word breaking opportunities for every position of s

1 for “break” and 0 for “do not break”. The length of iteration will be the same as len(s).

>>> list(word_breakables(u'ABC'))
[1, 0, 0]
>>> list(word_breakables(u'Hello, world.'))
[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1]
>>> list(word_breakables(u'\x01̈\x01'))
[1, 0, 1]
uniseg.wordbreak.word_boundaries(s, tailor=None)

Iterate indices of the word boundaries of s

This function yields indices from the first boundary position (> 0) to the end of the string (== len(s)).

uniseg.wordbreak.words(s, tailor=None)

Iterate user-perceived words of s

These examples bellow is from http://www.unicode.org/reports/tr29/tr29-15.html#Word_Boundaries

>>> s = 'The quick (“brown”) fox can’t jump 32.3 feet, right?'
>>> print('|'.join(words(s)))
The| |quick| |(|“|brown|”|)| |fox| |can’t| |jump| |32.3| |feet|,| |right|?
>>> list(words(u''))
[]