tessl/npm-natural

Comprehensive natural language processing library with tokenization, stemming, classification, sentiment analysis, phonetics, distance algorithms, and WordNet integration.

—

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Pending

The risk profile of this skill

Overview

Eval results

Files

Text Processing

Name: tessl/npm-natural
Author: tessl

Comprehensive text preprocessing tools including tokenization, stemming, and normalization for multiple languages. These are essential building blocks for preparing raw text data for natural language processing tasks.

Capabilities

Text Analysis

Sentence Analyzer

Analyzes sentence structure and provides readability metrics.

/**
 * Sentence analyzer for readability and complexity metrics
 */
class SentenceAnalyzer {
  /** Analyze sentence structure and readability */
  static analyze(sentence: string): {
    numWords: number;
    numChars: number;
    averageWordsPerSentence: number;
    numSentences: number;
  };
}

Example usage:

const { SentenceAnalyzer } = require('natural');

const analysis = SentenceAnalyzer.analyze('This is a sample sentence.');
console.log(analysis);
// { numWords: 5, numChars: 25, averageWordsPerSentence: 5, numSentences: 1 }

Tokenization

Breaking text into individual tokens (words, punctuation, etc.) using various strategies.

Word Tokenizer

Basic word tokenization that splits on whitespace and punctuation.

/**
 * Basic word tokenizer
 */
class WordTokenizer {
  /** Tokenize text into words */
  static tokenize(text: string): string[];
}

Regular Expression Tokenizer

Flexible tokenizer using regular expressions for custom tokenization patterns.

/**
 * Regular expression-based tokenizer
 * @param options - Tokenization options including pattern
 */
class RegexpTokenizer {
  constructor(options?: {pattern?: RegExp, discardEmpty?: boolean});
  
  /** Tokenize text using regex pattern */
  tokenize(text: string): string[];
}

/**
 * Orthography-aware tokenizer
 */
class OrthographyTokenizer extends RegexpTokenizer {
  constructor();
}

/**
 * Word and punctuation tokenizer
 */
class WordPunctTokenizer extends RegexpTokenizer {
  constructor();
}

Aggressive Tokenizers

Language-specific aggressive tokenizers that handle language-specific tokenization rules.

/**
 * Base aggressive tokenizer
 */
class AggressiveTokenizer {
  constructor();
  tokenize(text: string): string[];
}

// Language-specific aggressive tokenizers
class AggressiveTokenizerNl extends AggressiveTokenizer {} // Dutch
class AggressiveTokenizerFr extends AggressiveTokenizer {} // French
class AggressiveTokenizerDe extends AggressiveTokenizer {} // German
class AggressiveTokenizerEs extends AggressiveTokenizer {} // Spanish
class AggressiveTokenizerIt extends AggressiveTokenizer {} // Italian
class AggressiveTokenizerRu extends AggressiveTokenizer {} // Russian
class AggressiveTokenizerPt extends AggressiveTokenizer {} // Portuguese
class AggressiveTokenizerNo extends AggressiveTokenizer {} // Norwegian
class AggressiveTokenizerSv extends AggressiveTokenizer {} // Swedish
class AggressiveTokenizerPl extends AggressiveTokenizer {} // Polish
class AggressiveTokenizerVi extends AggressiveTokenizer {} // Vietnamese
class AggressiveTokenizerFa extends AggressiveTokenizer {} // Persian/Farsi
class AggressiveTokenizerId extends AggressiveTokenizer {} // Indonesian
class AggressiveTokenizerHi extends AggressiveTokenizer {} // Hindi
class AggressiveTokenizerUk extends AggressiveTokenizer {} // Ukrainian

Other Tokenizers

/**
 * Case-preserving tokenizer
 */
class CaseTokenizer {
  constructor();
  tokenize(text: string): string[];
}

/**
 * Penn Treebank word tokenizer
 */
class TreebankWordTokenizer {
  constructor();
  tokenize(text: string): string[];
}

/**
 * Japanese tokenizer
 */
class TokenizerJa {
  constructor();
  tokenize(text: string): string[];
}

/**
 * Sentence tokenizer
 */
class SentenceTokenizer {
  constructor();
  tokenize(text: string): string[];
}

Usage Examples:

const natural = require('natural');

// Basic word tokenization
const tokens = natural.WordTokenizer.tokenize('Hello world, how are you?');
console.log(tokens); // ['Hello', 'world', 'how', 'are', 'you']

// Regular expression tokenizer
const regexTokenizer = new natural.RegexpTokenizer({pattern: /\s+/, discardEmpty: true});
const regexTokens = regexTokenizer.tokenize('Hello   world');
console.log(regexTokens); // ['Hello', 'world']

// Aggressive tokenizer
const aggressive = new natural.AggressiveTokenizer();
const aggressiveTokens = aggressive.tokenize("Don't you think?");
console.log(aggressiveTokens); // ['Don', 't', 'you', 'think']

// Language-specific tokenizer
const frenchTokenizer = new natural.AggressiveTokenizerFr();
const frenchTokens = frenchTokenizer.tokenize("Bonjour, comment allez-vous?");

// Sentence tokenizer
const sentenceTokenizer = new natural.SentenceTokenizer();
const sentences = sentenceTokenizer.tokenize('Hello world. How are you? Fine, thanks.');
console.log(sentences); // ['Hello world.', 'How are you?', 'Fine, thanks.']

Stemming

Reducing words to their root form by removing suffixes and prefixes.

Porter Stemmer

The classic Porter stemming algorithm with support for multiple languages.

/**
 * Porter stemmer for English
 */
class PorterStemmer {
  /** Stem a single word */
  static stem(word: string): string;
  
  /** Stem an array of tokens */
  static stemTokens(tokens: string[]): string[];
}

// Language-specific Porter stemmers
class PorterStemmerFr { static stem(word: string): string; } // French
class PorterStemmerDe { static stem(word: string): string; } // German
class PorterStemmerEs { static stem(word: string): string; } // Spanish
class PorterStemmerIt { static stem(word: string): string; } // Italian
class PorterStemmerRu { static stem(word: string): string; } // Russian
class PorterStemmerPt { static stem(word: string): string; } // Portuguese
class PorterStemmerNo { static stem(word: string): string; } // Norwegian
class PorterStemmerSv { static stem(word: string): string; } // Swedish
class PorterStemmerNl { static stem(word: string): string; } // Dutch
class PorterStemmerFa { static stem(word: string): string; } // Persian/Farsi
class PorterStemmerUk { static stem(word: string): string; } // Ukrainian

Other Stemmers

/**
 * Lancaster stemmer (more aggressive than Porter)
 */
class LancasterStemmer {
  static stem(word: string): string;
}

/**
 * Japanese stemmer
 */
class StemmerJa {
  static stem(word: string): string;
}

/**
 * Indonesian stemmer
 */
class StemmerId {
  static stem(word: string): string;
}

/**
 * French Carry stemmer
 */
class CarryStemmerFr {
  static stem(word: string): string;
}

/**
 * Token class for advanced stemming operations and morphological analysis
 * Provides detailed control over stemming processes with region-based operations
 */
class Token {
  constructor(string: string);
  
  /** Set vowels for this token language */
  usingVowels(vowels: string | string[]): Token;
  
  /** Mark a region in the token by index or callback */
  markRegion(region: string, args: number | any[], callback?: Function, context?: object): Token;
  
  /** Replace all instances of a string with another */
  replaceAll(find: string, replace: string): Token;
  
  /** Replace suffix if it exists within specified region */
  replaceSuffixInRegion(suffix: string | string[], replace: string, region: string): Token;
  
  /** Check if token has vowel at specific index */
  hasVowelAtIndex(index: number): boolean;
  
  /** Find next vowel index starting from position */
  nextVowelIndex(start: number): number;
  
  /** Find next consonant index starting from position */
  nextConsonantIndex(start: number): number;
  
  /** Check if token has specific suffix */
  hasSuffix(suffix: string): boolean;
  
  /** Check if token has suffix within specified region */
  hasSuffixInRegion(suffix: string, region: string): boolean;
  
  /** Get current token string */
  toString(): string;
  
  /** Token string (mutable) */
  string: string;
  
  /** Original token string (immutable) */
  original: string;
  
  /** Vowels definition for this token */
  vowels: string;
  
  /** Defined regions for morphological operations */
  regions: {[key: string]: number};
}

Usage Examples:

const natural = require('natural');

// English Porter stemming
console.log(natural.PorterStemmer.stem('running')); // 'run'
console.log(natural.PorterStemmer.stem('flies')); // 'fli'

// Stem multiple tokens
const tokens = ['running', 'flies', 'dying', 'lying'];
const stemmed = natural.PorterStemmer.stemTokens(tokens);
console.log(stemmed); // ['run', 'fli', 'die', 'lie']

// Lancaster stemmer (more aggressive)
console.log(natural.LancasterStemmer.stem('running')); // 'run'
console.log(natural.LancasterStemmer.stem('maximum')); // 'maxim'

// Language-specific stemming
console.log(natural.PorterStemmerFr.stem('courante')); // French stemming
console.log(natural.PorterStemmerDe.stem('laufende')); // German stemming

// Token-based stemming
const token = new natural.Token('running');
console.log(token.stem()); // 'run'

Normalization

Text normalization for cleaning and standardizing text data.

/**
 * Normalize array of tokens
 * @param tokens - Array of token strings
 * @returns Normalized token array
 */
function normalize(tokens: string[]): string[];

/**
 * Japanese text normalization
 * @param text - Japanese text to normalize
 * @returns Normalized Japanese text
 */
function normalizeJa(text: string): string;

/**
 * Norwegian text normalization (diacritic removal)
 * @param text - Norwegian text to normalize
 * @returns Text with diacritics removed
 */
function normalizeNo(text: string): string;

/**
 * Swedish text normalization
 * @param text - Swedish text to normalize
 * @returns Normalized Swedish text
 */
function normalizeSv(text: string): string;

/**
 * Remove diacritical marks from text
 * @param text - Text with diacritics
 * @returns Text without diacritics
 */
function removeDiacritics(text: string): string;

/**
 * Japanese character conversion utilities
 */
interface Converters {
  hiraganaToKatakana(text: string): string;
  katakanaToHiragana(text: string): string;
  romajiToHiragana(text: string): string;
  romajiToKatakana(text: string): string;
}

Usage Examples:

const natural = require('natural');

// Basic normalization
const tokens = ['Hello', 'WORLD', 'Test'];
const normalized = natural.normalize(tokens);
console.log(normalized); // Normalized tokens

// Remove diacritics
const textWithDiacritics = 'café naïve résumé';
const clean = natural.removeDiacritics(textWithDiacritics);
console.log(clean); // 'cafe naive resume'

// Japanese normalization
const japaneseText = 'こんにちは世界';
const normalizedJa = natural.normalizeJa(japaneseText);

// Norwegian diacritic removal
const norwegianText = 'Hålløj verðen';
const normalizedNo = natural.normalizeNo(norwegianText);

// Japanese character conversion
const hiragana = 'こんにちは';
const katakana = natural.Converters.hiraganaToKatakana(hiragana);
console.log(katakana); // 'コンニチハ'

Inflection

Word inflection for grammatical transformations.

/**
 * English noun inflector (singular/plural)
 */
class NounInflector {
  /** Convert singular noun to plural */
  pluralize(noun: string): string;
  
  /** Convert plural noun to singular */
  singularize(noun: string): string;
}

/**
 * French noun inflector
 */
class NounInflectorFr {
  pluralize(noun: string): string;
  singularize(noun: string): string;
}

/**
 * Japanese noun inflector
 */
class NounInflectorJa {
  pluralize(noun: string): string;
}

/**
 * Present tense verb inflector
 */
class PresentVerbInflector {
  /** Convert verb to present tense form */
  present(verb: string): string;
}

/**
 * Count inflector for numbers
 */
class CountInflector {
  /** Get ordinal form of number */
  nth(number: number): string;
}

/**
 * French count inflector
 */
class CountInflectorFr {
  nth(number: number): string;
}

Usage Examples:

const natural = require('natural');

// Noun inflection
const nounInflector = new natural.NounInflector();
console.log(nounInflector.pluralize('cat')); // 'cats'
console.log(nounInflector.singularize('cats')); // 'cat'

// Count inflection
const countInflector = new natural.CountInflector();
console.log(countInflector.nth(1)); // '1st'
console.log(countInflector.nth(2)); // '2nd'
console.log(countInflector.nth(3)); // '3rd'
console.log(countInflector.nth(21)); // '21st'