or run

npx @tessl/cli init
Log in

Version

Files

docs

index.md
tile.json

task.mdevals/scenario-4/

Language Pattern Analyzer

Summary { .summary }

Build a language pattern analyzer that can identify patterns in text samples written in different European languages using statistical analysis techniques. The system should analyze byte-level patterns in encoded text and provide confidence scores for pattern matches.

Context { .context }

You are building a text analysis tool that needs to identify language patterns from text samples that may be encoded in various single-byte character encodings (like ISO-8859 or Windows code pages). The tool should be efficient enough to process large volumes of text data.

Requirements { .requirements }

Core Functionality

Implement a language pattern analyzer with the following capabilities:

  1. Text Normalization: Normalize text by converting uppercase letters to lowercase and replacing control characters with spaces. This should be done at the byte level.

  2. N-gram Extraction: Extract 3-byte sequences (trigrams) from normalized text. Each n-gram should be treated as a 24-bit integer value for efficient processing.

  3. Pattern Matching: Compare extracted n-grams against a predefined list of common language patterns. Use an efficient search algorithm that can handle sorted pattern lists of exactly 64 entries.

  4. Confidence Scoring: Calculate a confidence score based on the ratio of matched patterns to total n-grams extracted. The score should be scaled appropriately (e.g., multiply by 300 and cap at 100).

Input and Output

Input:

  • A byte array (Buffer or Uint8Array) containing text in a single-byte encoding
  • A sorted array of 64 pattern values representing common trigrams for a specific language

Output:

  • An object containing:
    • matchCount: Number of n-grams that matched the pattern list
    • totalNgrams: Total number of n-grams analyzed
    • confidence: Confidence score (0-100) indicating how well the text matches the patterns

Performance Requirements

  • The pattern search must use binary search to achieve O(log n) lookup time
  • The solution should handle text buffers of at least 10KB efficiently
  • The normalization step should be completed in a single pass

Test Cases { .test-cases }

Implement your solution in src/analyzer.ts and tests in src/analyzer.test.ts.

Test 1: English Text Pattern Matching { .test-case @test }

Input:

const text = Buffer.from("The quick brown fox jumps over the lazy dog. This is a test.");
const patterns = [0x206120, 0x206520, 0x206920, 0x206f20, ...]; // 64 common English trigrams

Expected Output:

{
  matchCount: 15, // approximately
  totalNgrams: 58,
  confidence: 77 // approximately (15/58 * 300, capped at 100)
}

Test 2: Binary Search Efficiency { .test-case @test }

Description: Verify that the pattern matching uses binary search by testing with a large text sample (10KB) and ensuring it completes within reasonable time.

Input:

const largeText = Buffer.from("a".repeat(10000));
const patterns = [0x616161, ...]; // 64 sorted patterns including "aaa"

Expected Behavior:

  • Function completes in less than 50ms
  • Returns valid confidence score

Test 3: Normalization Correctness { .test-case @test }

Input:

const text = Buffer.from("Hello WORLD\n\tTest");

Expected Behavior:

  • Text should be normalized to lowercase
  • Control characters (\n, \t) should become spaces
  • N-grams extracted should reflect normalized text (e.g., "hel", "ell", "llo", " wo", etc.)

Dependencies { .dependencies }

chardet { .dependency }

Character encoding detection library that provides text analysis capabilities.

Notes { .notes }

  • Focus on correctness and efficiency of the binary search implementation
  • Consider edge cases like text shorter than 3 bytes
  • The 64-entry pattern list is a common size used in production encoding detection systems
  • All byte values should be treated as unsigned integers (0-255)