docs
evals
scenario-1
scenario-10
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
Build a language pattern analyzer that can identify patterns in text samples written in different European languages using statistical analysis techniques. The system should analyze byte-level patterns in encoded text and provide confidence scores for pattern matches.
You are building a text analysis tool that needs to identify language patterns from text samples that may be encoded in various single-byte character encodings (like ISO-8859 or Windows code pages). The tool should be efficient enough to process large volumes of text data.
Implement a language pattern analyzer with the following capabilities:
Text Normalization: Normalize text by converting uppercase letters to lowercase and replacing control characters with spaces. This should be done at the byte level.
N-gram Extraction: Extract 3-byte sequences (trigrams) from normalized text. Each n-gram should be treated as a 24-bit integer value for efficient processing.
Pattern Matching: Compare extracted n-grams against a predefined list of common language patterns. Use an efficient search algorithm that can handle sorted pattern lists of exactly 64 entries.
Confidence Scoring: Calculate a confidence score based on the ratio of matched patterns to total n-grams extracted. The score should be scaled appropriately (e.g., multiply by 300 and cap at 100).
Input:
Output:
matchCount: Number of n-grams that matched the pattern listtotalNgrams: Total number of n-grams analyzedconfidence: Confidence score (0-100) indicating how well the text matches the patternsImplement your solution in src/analyzer.ts and tests in src/analyzer.test.ts.
Input:
const text = Buffer.from("The quick brown fox jumps over the lazy dog. This is a test.");
const patterns = [0x206120, 0x206520, 0x206920, 0x206f20, ...]; // 64 common English trigramsExpected Output:
{
matchCount: 15, // approximately
totalNgrams: 58,
confidence: 77 // approximately (15/58 * 300, capped at 100)
}Description: Verify that the pattern matching uses binary search by testing with a large text sample (10KB) and ensuring it completes within reasonable time.
Input:
const largeText = Buffer.from("a".repeat(10000));
const patterns = [0x616161, ...]; // 64 sorted patterns including "aaa"Expected Behavior:
Input:
const text = Buffer.from("Hello WORLD\n\tTest");Expected Behavior:
Character encoding detection library that provides text analysis capabilities.