1.3. uniseg.wordbreak
— Word break¶
Unicode word breaking
UAX #29: Unicode Text Segmentation (Unicode 6.2.0) http://www.unicode.org/reports/tr29/tr29-21.html
-
uniseg.wordbreak.
word_break
(c, index=0)¶ Return the Word_Break property of c
c must be a single Unicode code point string.
>>> print(word_break('\x0d')) CR >>> print(word_break('\x0b')) Newline >>> print(word_break('ア')) Katakana
If index is specified, this function consider c as a unicode string and return Word_Break property of the code point at c[index].
>>> print(word_break('Aア', 1)) Katakana
-
uniseg.wordbreak.
word_breakables
(s)¶ Iterate word breaking opportunities for every position of s
1 for “break” and 0 for “do not break”. The length of iteration will be the same as
len(s)
.>>> list(word_breakables(u'ABC')) [1, 0, 0] >>> list(word_breakables(u'Hello, world.')) [1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1] >>> list(word_breakables(u'\x01̈\x01')) [1, 0, 1]
-
uniseg.wordbreak.
word_boundaries
(s, tailor=None)¶ Iterate indices of the word boundaries of s
This function yields indices from the first boundary position (> 0) to the end of the string (== len(s)).
-
uniseg.wordbreak.
words
(s, tailor=None)¶ Iterate user-perceived words of s
These examples bellow is from http://www.unicode.org/reports/tr29/tr29-15.html#Word_Boundaries
>>> s = 'The quick (“brown”) fox can’t jump 32.3 feet, right?' >>> print('|'.join(words(s))) The| |quick| |(|“|brown|”|)| |fox| |can’t| |jump| |32.3| |feet|,| |right|? >>> list(words(u'')) []