Skip to main content

đŸ“Ļ @bntk/tokenization

tokenizeToSentences()​

function tokenizeToSentences(text): string[];

Defined in: sentence.ts:50

Tokenizes a Bangla text into an array of sentences.

Parameters​

ParameterTypeDescription
textstringThe input Bangla text to tokenize. Can contain mixed content including URLs, emails, and special characters.

Returns​

string[]

An array of cleaned and tokenized sentences, with duplicates removed.

Description​

This function performs the following steps:

  1. Splits text by line breaks
  2. Further splits by Bangla sentence separators
  3. Cleans each sentence by:
    • Removing text within parentheses, brackets, braces, and angle brackets
    • Removing URLs and email addresses
    • Removing HTML entities
    • Removing Latin characters
    • Keeping only Bangla characters, spaces, and essential punctuation
    • Normalizing spaces and punctuation
  4. Filters sentences based on the following criteria:
    • Must contain Bangla characters (Unicode range: \u0980-\u09FF)
    • Must have more than 3 words
    • Must not be empty
  5. Returns a Set to remove duplicates

Examples​

Basic usage with simple Bangla text:

const text = "āφāĻŽāĻŋ āĻŦāĻžāĻ‚āϞāĻžāϝāĻŧ āĻ—āĻžāύ āĻ—āĻžāχāĨ¤ āϤ⧁āĻŽāĻŋ āĻ•āĻŋ āĻļ⧁āύāĻŦ⧇?";
console.log(tokenizeToSentences(text));
// Output: ["āφāĻŽāĻŋ āĻŦāĻžāĻ‚āϞāĻžāϝāĻŧ āĻ—āĻžāύ āĻ—āĻžāχ", "āϤ⧁āĻŽāĻŋ āĻ•āĻŋ āĻļ⧁āύāĻŦ⧇"]

Handling mixed content:

const mixedText =
"āφāĻŽāĻŋ āĻŦāĻžāĻ‚āϞāĻžāϝāĻŧ āĻ—āĻžāύ āĻ—āĻžāχāĨ¤ Visit https://example.com or email@example.com";
console.log(tokenizeToSentences(mixedText));
// Output: ["āφāĻŽāĻŋ āĻŦāĻžāĻ‚āϞāĻžāϝāĻŧ āĻ—āĻžāύ āĻ—āĻžāχ"]

Handling text with special characters:

const specialText =
"āĻŦāĻžāĻ‚āϞāĻž āĻŸā§‡āĻ•ā§āϏāϟ (āχāĻ‚āϰ⧇āϜāĻŋ āĻŸā§‡āĻ•ā§āϏāϟ) [āĻŦāĻ¨ā§āϧāύ⧀ āĻŸā§‡āĻ•ā§āϏāϟ] {āϕ⧋āρāĻ•āĻĄāĻŧāĻž āĻŸā§‡āĻ•ā§āϏāϟ}";
console.log(tokenizeToSentences(specialText));
// Output: ["āĻŦāĻžāĻ‚āϞāĻž āĻŸā§‡āĻ•ā§āϏāϟ"]

tokenizeToWords()​

function tokenizeToWords(text): string[];

Defined in: word.ts:57

Tokenizes a Bangla text string into an array of words.

Parameters​

ParameterTypeDescription
textstringThe input Bangla text to tokenize. Can contain mixed content including punctuation and special characters.

Returns​

string[]

An array of cleaned and tokenized words, with empty strings removed.

Description​

This function performs the following steps:

  1. Cleans the input text by:
    • Removing non-Bangla characters (keeping only Unicode range: \u0980-\u09FF)
    • Preserving essential punctuation marks (āĨ¤, ,, ;, :, ', ", ?, !)
    • Preserving hyphens for compound words
  2. Splits the text by whitespace
  3. Further splits each segment by punctuation (excluding hyphens)
  4. Cleans each word by:
    • Removing trailing hyphens
    • Removing Bangla digits from start and end
    • Trimming whitespace
  5. Filters out empty strings

Examples​

Basic usage with simple Bangla text:

const text = "āφāĻŽāĻŋ āĻŦāĻžāĻ‚āϞāĻžāϝāĻŧ āĻ—āĻžāύ āĻ—āĻžāχ";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āφāĻŽāĻŋ", "āĻŦāĻžāĻ‚āϞāĻžāϝāĻŧ", "āĻ—āĻžāύ", "āĻ—āĻžāχ"]

Handling text with punctuation:

const text = "āφāĻŽāĻŋ, āĻŦāĻžāĻ‚āϞāĻžāϝāĻŧ āĻ—āĻžāύ āĻ—āĻžāχāĨ¤ āϤ⧁āĻŽāĻŋ āĻ•āĻŋ āĻļ⧁āύāĻŦ⧇?";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āφāĻŽāĻŋ", "āĻŦāĻžāĻ‚āϞāĻžāϝāĻŧ", "āĻ—āĻžāύ", "āĻ—āĻžāχ", "āϤ⧁āĻŽāĻŋ", "āĻ•āĻŋ", "āĻļ⧁āύāĻŦ⧇"]

Handling compound words with hyphens:

const text = "āφāĻŽāĻŋ-āϤ⧁āĻŽāĻŋ āĻŦāĻžāĻ‚āϞāĻž-āĻ­āĻžāώāĻž āĻļāĻŋāĻ–āĻ›āĻŋ";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āφāĻŽāĻŋ-āϤ⧁āĻŽāĻŋ", "āĻŦāĻžāĻ‚āϞāĻž-āĻ­āĻžāώāĻž", "āĻļāĻŋāĻ–āĻ›āĻŋ"]

Handling text with Bangla digits:

const text = "ā§§āϟāĻŋ āĻŦāχ ⧍āϟāĻŋ āĻ–āĻžāϤāĻž";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āϟāĻŋ", "āĻŦāχ", "āϟāĻŋ", "āĻ–āĻžāϤāĻž"]