Build a language pattern analyzer that can identify patterns in text samples written in different European languages using statistical analysis techniques. The system should analyze byte-level patterns in encoded text and provide confidence scores for pattern matches.

Context { .context }

You are building a text analysis tool that needs to identify language patterns from text samples that may be encoded in various single-byte character encodings (like ISO-8859 or Windows code pages). The tool should be efficient enough to process large volumes of text data.

Requirements { .requirements }

Core Functionality

Implement a language pattern analyzer with the following capabilities:

Text Normalization: Normalize text by converting uppercase letters to lowercase and replacing control characters with spaces. This should be done at the byte level.
N-gram Extraction: Extract 3-byte sequences (trigrams) from normalized text. Each n-gram should be treated as a 24-bit integer value for efficient processing.
Pattern Matching: Compare extracted n-grams against a predefined list of common language patterns. Use an efficient search algorithm that can handle sorted pattern lists of exactly 64 entries.
Confidence Scoring: Calculate a confidence score based on the ratio of matched patterns to total n-grams extracted. The score should be scaled appropriately (e.g., multiply by 300 and cap at 100).

Input and Output

Input:

A byte array (Buffer or Uint8Array) containing text in a single-byte encoding
A sorted array of 64 pattern values representing common trigrams for a specific language

Output:

An object containing:
- matchCount: Number of n-grams that matched the pattern list
- totalNgrams: Total number of n-grams analyzed
- confidence: Confidence score (0-100) indicating how well the text matches the patterns

Performance Requirements

The pattern search must use binary search to achieve O(log n) lookup time
The solution should handle text buffers of at least 10KB efficiently
The normalization step should be completed in a single pass

Test Cases { .test-cases }

Implement your solution in src/analyzer.ts and tests in src/analyzer.test.ts.

Test 1: English Text Pattern Matching { .test-case @test }

Input:

const text = Buffer.from("The quick brown fox jumps over the lazy dog. This is a test.");
const patterns = [0x206120, 0x206520, 0x206920, 0x206f20, ...]; // 64 common English trigrams

Expected Output:

{
  matchCount: 15, // approximately
  totalNgrams: 58,
  confidence: 77 // approximately (15/58 * 300, capped at 100)
}

Test 2: Binary Search Efficiency { .test-case @test }

Description: Verify that the pattern matching uses binary search by testing with a large text sample (10KB) and ensuring it completes within reasonable time.

Input:

const largeText = Buffer.from("a".repeat(10000));
const patterns = [0x616161, ...]; // 64 sorted patterns including "aaa"

Expected Behavior:

Function completes in less than 50ms
Returns valid confidence score

Test 3: Normalization Correctness { .test-case @test }

Input:

const text = Buffer.from("Hello WORLD\n\tTest");

Expected Behavior:

Text should be normalized to lowercase
Control characters (\n, \t) should become spaces
N-grams extracted should reflect normalized text (e.g., "hel", "ell", "llo", " wo", etc.)

Dependencies { .dependencies }

chardet { .dependency }

Character encoding detection library that provides text analysis capabilities.

Notes { .notes }

Focus on correctness and efficiency of the binary search implementation
Consider edge cases like text shorter than 3 bytes
The 64-entry pattern list is a common size used in production encoding detection systems
All byte values should be treated as unsigned integers (0-255)

Version

Files

task.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-4/

Language Pattern Analyzer

Summary { .summary }

Context { .context }

Requirements { .requirements }

Core Functionality

Input and Output

Performance Requirements

Test Cases { .test-cases }

Test 1: English Text Pattern Matching { .test-case @test }

Test 2: Binary Search Efficiency { .test-case @test }

Test 3: Normalization Correctness { .test-case @test }

Dependencies { .dependencies }

chardet { .dependency }

Notes { .notes }

task.mdevals/scenario-4/