đĻ @bntk/tokenization
tokenizeToSentences()â
function tokenizeToSentences(text): string[];
Defined in: sentence.ts:50
Tokenizes a Bangla text into an array of sentences.
Parametersâ
| Parameter | Type | Description |
|---|---|---|
text | string | The input Bangla text to tokenize. Can contain mixed content including URLs, emails, and special characters. |
Returnsâ
string[]
An array of cleaned and tokenized sentences, with duplicates removed.
Descriptionâ
This function performs the following steps:
- Splits text by line breaks
- Further splits by Bangla sentence separators
- Cleans each sentence by:
- Removing text within parentheses, brackets, braces, and angle brackets
- Removing URLs and email addresses
- Removing HTML entities
- Removing Latin characters
- Keeping only Bangla characters, spaces, and essential punctuation
- Normalizing spaces and punctuation
- Filters sentences based on the following criteria:
- Must contain Bangla characters (Unicode range: \u0980-\u09FF)
- Must have more than 3 words
- Must not be empty
- Returns a Set to remove duplicates
Examplesâ
Basic usage with simple Bangla text:
const text = "āĻāĻŽāĻŋ āĻŦāĻžāĻāϞāĻžāϝāĻŧ āĻāĻžāύ āĻāĻžāĻāĨ¤ āϤā§āĻŽāĻŋ āĻāĻŋ āĻļā§āύāĻŦā§?";
console.log(tokenizeToSentences(text));
// Output: ["āĻāĻŽāĻŋ āĻŦāĻžāĻāϞāĻžāϝāĻŧ āĻāĻžāύ āĻāĻžāĻ", "āϤā§āĻŽāĻŋ āĻāĻŋ āĻļā§āύāĻŦā§"]
Handling mixed content:
const mixedText =
"āĻāĻŽāĻŋ āĻŦāĻžāĻāϞāĻžāϝāĻŧ āĻāĻžāύ āĻāĻžāĻāĨ¤ Visit https://example.com or email@example.com";
console.log(tokenizeToSentences(mixedText));
// Output: ["āĻāĻŽāĻŋ āĻŦāĻžāĻāϞāĻžāϝāĻŧ āĻāĻžāύ āĻāĻžāĻ"]
Handling text with special characters:
const specialText =
"āĻŦāĻžāĻāϞāĻž āĻā§āĻā§āϏāĻ (āĻāĻāϰā§āĻāĻŋ āĻā§āĻā§āϏāĻ) [āĻŦāύā§āϧāύ⧠āĻā§āĻā§āϏāĻ] {āĻā§āĻāĻāĻĄāĻŧāĻž āĻā§āĻā§āϏāĻ}";
console.log(tokenizeToSentences(specialText));
// Output: ["āĻŦāĻžāĻāϞāĻž āĻā§āĻā§āϏāĻ"]
tokenizeToWords()â
function tokenizeToWords(text): string[];
Defined in: word.ts:57
Tokenizes a Bangla text string into an array of words.
Parametersâ
| Parameter | Type | Description |
|---|---|---|
text | string | The input Bangla text to tokenize. Can contain mixed content including punctuation and special characters. |
Returnsâ
string[]
An array of cleaned and tokenized words, with empty strings removed.
Descriptionâ
This function performs the following steps:
- Cleans the input text by:
- Removing non-Bangla characters (keeping only Unicode range: \u0980-\u09FF)
- Preserving essential punctuation marks (āĨ¤, ,, ;, :, ', ", ?, !)
- Preserving hyphens for compound words
- Splits the text by whitespace
- Further splits each segment by punctuation (excluding hyphens)
- Cleans each word by:
- Removing trailing hyphens
- Removing Bangla digits from start and end
- Trimming whitespace
- Filters out empty strings
Examplesâ
Basic usage with simple Bangla text:
const text = "āĻāĻŽāĻŋ āĻŦāĻžāĻāϞāĻžāϝāĻŧ āĻāĻžāύ āĻāĻžāĻ";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āĻāĻŽāĻŋ", "āĻŦāĻžāĻāϞāĻžāϝāĻŧ", "āĻāĻžāύ", "āĻāĻžāĻ"]
Handling text with punctuation:
const text = "āĻāĻŽāĻŋ, āĻŦāĻžāĻāϞāĻžāϝāĻŧ āĻāĻžāύ āĻāĻžāĻāĨ¤ āϤā§āĻŽāĻŋ āĻāĻŋ āĻļā§āύāĻŦā§?";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āĻāĻŽāĻŋ", "āĻŦāĻžāĻāϞāĻžāϝāĻŧ", "āĻāĻžāύ", "āĻāĻžāĻ", "āϤā§āĻŽāĻŋ", "āĻāĻŋ", "āĻļā§āύāĻŦā§"]
Handling compound words with hyphens:
const text = "āĻāĻŽāĻŋ-āϤā§āĻŽāĻŋ āĻŦāĻžāĻāϞāĻž-āĻāĻžāώāĻž āĻļāĻŋāĻāĻāĻŋ";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āĻāĻŽāĻŋ-āϤā§āĻŽāĻŋ", "āĻŦāĻžāĻāϞāĻž-āĻāĻžāώāĻž", "āĻļāĻŋāĻāĻāĻŋ"]
Handling text with Bangla digits:
const text = "ā§§āĻāĻŋ āĻŦāĻ ā§¨āĻāĻŋ āĻāĻžāϤāĻž";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āĻāĻŋ", "āĻŦāĻ", "āĻāĻŋ", "āĻāĻžāϤāĻž"]