You can now use Intl.Segmenter
for locale-sensitive text segmentation to split a string into words, sentences, or graphemes.
Many non-Latin languages, such as Chinese and Japanese, don't use spaces to separate words. Therefore, using the JavaScript split()
method on whitespace to split text into words, will return incorrect results.
When creating a new Intl.Segmenter
object with the Intl.segmenter()
constructor, pass in a locale
and options including granularity
, which can have values of "grapheme"
, "word"
, or "sentence"
. The following example creates a new Intl.Segmenter
object for Japanese, splitting on words.
const segmenter = new Intl.Segmenter('ja-JP', { granularity: 'word' });
Calling the segment()
method on an Intl.Segmenter
object with a string of text returns an iterable:
const segments = segmenter.segment(str); console.table(Array.from(segments));
Read Using the Intl.Segmenter API on the Polypane blog for an excellent tutorial on how to use this feature.
International Text Segmentation with Intl.Segmenter in JavaScript has more examples, including how to use Intl.Segmenter
with emoji.