CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-org-bytedeco--tesseract

JavaCPP Presets for Tesseract - Java wrapper library providing JNI bindings to the native Tesseract OCR library version 5.5.1, enabling optical character recognition capabilities in Java applications

Pending
Overview
Eval results
Files

data-structures.mddocs/

Data Structures and Types

Supporting data types for progress monitoring, character information, Unicode handling, container classes, and callback functions used throughout the Tesseract API.

Capabilities

Progress Monitoring

Structure for monitoring OCR progress and implementing cancellation callbacks.

/**
 * Progress monitoring and cancellation structure for OCR operations
 * Provides real-time feedback and cancellation capability during recognition
 */
public class ETEXT_DESC extends Pointer {
    public ETEXT_DESC();
    
    // Progress information (read-only fields)
    public short count();              // Character count processed
    public short progress();           // Progress percentage (0-100)
    public byte more_to_come();        // More processing flag
    public byte ocr_alive();           // OCR engine alive flag  
    public byte err_code();            // Error code (0 = no error)
    
    // Callback functions
    public CANCEL_FUNC cancel();                    // Cancellation callback
    public PROGRESS_FUNC progress_callback();       // Progress callback
    public Pointer cancel_this();                   // Cancellation context
    
    // Character data access
    public EANYCODE_CHAR text(int i);              // Character data array
    
    // Deadline management
    public void set_deadline_msecs(int deadline_msecs); // Set processing deadline
    public boolean deadline_exceeded();                 // Check if deadline exceeded
}

Progress Monitoring Example:

import org.bytedeco.tesseract.*;
import static org.bytedeco.tesseract.global.tesseract.*;

// Create progress monitor
ETEXT_DESC monitor = TessMonitorCreate();

// Set 30-second deadline
monitor.set_deadline_msecs(30000);

TessBaseAPI api = new TessBaseAPI();
api.Init(null, "eng");

PIX image = pixRead("large-document.png");
api.SetImage(image);

// Perform OCR with progress monitoring
int result = api.Recognize(monitor);

if (result == 0) {
    System.out.println("OCR completed successfully");
    System.out.println("Final progress: " + monitor.progress() + "%");
    System.out.println("Characters processed: " + monitor.count());
} else {
    System.err.println("OCR failed with error code: " + monitor.err_code());
    if (monitor.deadline_exceeded()) {
        System.err.println("Processing deadline exceeded");
    }
}

// Cleanup
TessMonitorDelete(monitor);
pixDestroy(image);
api.End();

Character Information

Detailed character description with position, font, and formatting information.

/**
 * Single character description with position, font, and formatting information
 * Used in ETEXT_DESC for detailed character-level analysis
 */
public class EANYCODE_CHAR extends Pointer {
    public EANYCODE_CHAR();
    
    // Character identification
    public short char_code();          // UTF-8 character code
    
    // Position information  
    public short left();               // Left coordinate
    public short right();              // Right coordinate
    public short top();                // Top coordinate
    public short bottom();             // Bottom coordinate
    
    // Font and style information
    public short font_index();         // Font identifier
    public byte point_size();          // Font size in points
    public byte formatting();          // Formatting flags (bold, italic, etc.)
    
    // Recognition quality
    public byte confidence();          // Confidence score (0=perfect, 100=reject)
    public byte blanks();              // Number of spaces before character
}

Unicode Character Handling

Unicode character representation supporting UTF-8 encoding and ligatures.

/**
 * Unicode character representation supporting UTF-8 encoding and ligatures
 * Handles complex Unicode characters and multi-byte sequences
 */
public class UNICHAR extends Pointer {
    public UNICHAR();
    public UNICHAR(String utf8_str, int len);
    public UNICHAR(int unicode);
    
    // Character access methods
    public int first_uni();            // Get first character as UCS-4
    public int utf8_len();             // Get UTF-8 byte length
    public BytePointer utf8();         // Get UTF-8 bytes (not null-terminated)
    public BytePointer utf8_str();     // Get terminated UTF-8 string (must delete)
    
    // Static utility methods
    public static int utf8_step(String utf8_str);                    // Get bytes in first character
    public static IntPointer UTF8ToUTF32(String utf8_str);           // Convert UTF-8 to UTF-32
    public static String UTF32ToUTF8(IntPointer str32);              // Convert UTF-32 to UTF-8
    
    // Nested iterator class for UTF-8 strings
    public static class const_iterator extends Pointer {
        public const_iterator increment();    // Step to next character
        public int multiply();                 // Get current UCS-4 value
        public int get_utf8(byte[] buf);       // Get UTF-8 bytes
        public boolean is_legal();             // Check if current position is legal UTF-8
    }
}

Unicode Processing Example:

// Create UNICHAR from UTF-8 string
UNICHAR unichar = new UNICHAR("Hello 世界", 11);

System.out.println("UTF-8 length: " + unichar.utf8_len());
System.out.println("First character UCS-4: " + unichar.first_uni());

// Get terminated UTF-8 string
BytePointer utf8_str = unichar.utf8_str();
System.out.println("UTF-8 string: " + utf8_str.getString());
utf8_str.deallocate(); // Must delete UTF-8 string

// Iterate through UTF-8 string character by character
String text = "Café";
UNICHAR.const_iterator it = new UNICHAR.const_iterator();
// Iterator usage would require additional setup

Container Classes

Vector containers for strings and other data types with Java-friendly interfaces.

/**
 * Vector container for strings (std::vector<std::string>)
 * Provides Java-friendly interface to C++ string vectors
 */
public class StringVector extends Pointer {
    public StringVector();                      // Empty vector
    public StringVector(long n);                // Vector with n elements
    public StringVector(String value);          // Single string vector
    public StringVector(String... array);       // Vector from string array
    
    // Size management
    public long size();                         // Get size
    public boolean empty();                     // Check if empty
    public void clear();                        // Clear all elements
    public void resize(long n);                 // Resize vector
    
    // Element access
    public BytePointer get(long i);             // Get element at index
    public StringVector put(long i, String value);  // Set element at index
    public BytePointer front();                 // Get first element
    public BytePointer back();                  // Get last element
    
    // Modification
    public StringVector push_back(String value); // Add element to end
    public BytePointer pop_back();              // Remove and return last element
    public StringVector put(String... array);   // Set from array
}

/**
 * Vector container for bytes (std::vector<char>)
 * Similar interface to StringVector for byte data
 */
public class ByteVector extends Pointer {
    public ByteVector();
    public ByteVector(long n);
    public long size();
    public boolean empty();
    public void clear();
    public byte get(long i);
    public ByteVector put(long i, byte value);
    public ByteVector push_back(byte value);
}

/**
 * Complex nested vector structure for LSTM timestep data
 * Used for neural network output analysis
 */
public class StringFloatPairVectorVector extends Pointer {
    public StringFloatPairVectorVector();
    public long size();
    // Additional methods for LSTM-specific data manipulation
}

Container Usage Examples:

// Create and populate string vector
StringVector languages = new StringVector();
languages.push_back("eng");
languages.push_back("fra");
languages.push_back("deu");

System.out.println("Languages count: " + languages.size());
for (int i = 0; i < languages.size(); i++) {
    System.out.println("Language " + i + ": " + languages.get(i).getString());
}

// Use with Tesseract API
TessBaseAPI api = new TessBaseAPI();
api.Init(null, "eng");

StringVector availableLanguages = new StringVector();
api.GetAvailableLanguagesAsVector(availableLanguages);

System.out.println("Available languages:");
for (int i = 0; i < availableLanguages.size(); i++) {
    System.out.println("  " + availableLanguages.get(i).getString());
}

api.End();

Callback Function Types

Function pointer interfaces for progress monitoring and cancellation.

/**
 * Callback for cancelling OCR operations
 * Return true to cancel processing
 */
public abstract class TessCancelFunc extends FunctionPointer {
    public abstract boolean call(Pointer cancel_this, int words);
}

/**
 * Progress monitoring callback  
 * Called periodically during OCR processing
 */
public abstract class TessProgressFunc extends FunctionPointer {
    public abstract boolean call(ETEXT_DESC ths, int left, int right, int top, int bottom);
}

/**
 * Dictionary validation function
 * Used for custom dictionary checking
 */
public abstract class DictFunc extends FunctionPointer {
    public abstract int call(Dict o, Pointer arg0, UNICHARSET arg1, int arg2, boolean arg3);
}

/**
 * Context-based probability calculation
 * Used in advanced recognition scenarios
 */
public abstract class ProbabilityInContextFunc extends FunctionPointer {
    public abstract double call(Dict o, String arg0, String arg1, int arg2, String arg3, int arg4);
}

/**
 * File reading callback for custom input sources
 * Allows custom file handling implementations
 */
public abstract class FileReader extends FunctionPointer {
    public abstract boolean call(String filename, ByteVector data);
}

// Legacy callback types for compatibility
public abstract class CANCEL_FUNC extends FunctionPointer { }
public abstract class PROGRESS_FUNC extends FunctionPointer { }
public abstract class PROGRESS_FUNC2 extends FunctionPointer { }

Custom Progress Callback Example:

// Create custom progress callback
TessProgressFunc progressCallback = new TessProgressFunc() {
    @Override
    public boolean call(ETEXT_DESC desc, int left, int right, int top, int bottom) {
        int progress = desc.progress();
        System.out.println("OCR Progress: " + progress + "% - Processing region (" + 
                          left + "," + top + ") to (" + right + "," + bottom + ")");
        
        // Return false to continue, true to cancel
        return progress > 50; // Cancel after 50% for demo
    }
};

// Create cancellation callback
TessCancelFunc cancelCallback = new TessCancelFunc() {
    @Override
    public boolean call(Pointer cancel_this, int words) {
        System.out.println("Processed " + words + " words so far");
        // Return true to cancel processing
        return false; // Don't cancel
    }
};

// Use callbacks with monitor
ETEXT_DESC monitor = TessMonitorCreate();
// Note: Setting callbacks requires additional native integration

Internal and Opaque Classes

Classes representing internal Tesseract structures with limited exposed functionality.

// Core engine classes (opaque - limited functionality)
public class Tesseract extends Pointer { }              // Core Tesseract engine
public class ImageThresholder extends Pointer { }       // Image processing
public class OSResults extends Pointer { }              // Orientation/script detection results

// Dictionary and language model classes  
public class Dict extends Pointer { }                   // Dictionary management
public class Dawg extends Pointer { }                   // Directed Acyclic Word Graph
public class UNICHARSET extends Pointer { }             // Character set management

// Analysis and detection classes
public class EquationDetect extends Pointer { }         // Equation detection
public class ParagraphModel extends Pointer { }         // Paragraph modeling
public class BlamerBundle extends Pointer { }           // Training/debugging information

// Internal result structures
public class PAGE_RES extends Pointer { }               // Page results
public class PAGE_RES_IT extends Pointer { }            // Page results iterator
public class WERD extends Pointer { }                   // Word structure
public class WERD_RES extends Pointer { }               // Word results
public class BLOB_CHOICE_IT extends Pointer { }         // Blob choice iterator
public class C_BLOB_IT extends Pointer { }              // C blob iterator
public class BLOCK_LIST extends Pointer { }             // Block list structure

Constants and Enumerations

Important constants used throughout the API.

// Version Constants
public static final int TESSERACT_MAJOR_VERSION = 5;
public static final int TESSERACT_MINOR_VERSION = 5;
public static final int TESSERACT_MICRO_VERSION = 1;
public static final String TESSERACT_VERSION_STR = "5.5.1";

// Unicode Constants
public static final int UNICHAR_LEN = 30;              // Maximum UNICHAR length
public static final int INVALID_UNICHAR_ID = -1;       // Invalid character ID

// Script Direction Constants
public static final int DIR_NEUTRAL = 0;               // Neutral direction
public static final int DIR_LEFT_TO_RIGHT = 1;         // Left-to-right text
public static final int DIR_RIGHT_TO_LEFT = 2;         // Right-to-left text
public static final int DIR_MIX = 3;                   // Mixed directions

// Orientation Constants
public static final int ORIENTATION_PAGE_UP = 0;       // Page upright
public static final int ORIENTATION_PAGE_RIGHT = 1;    // Page rotated right
public static final int ORIENTATION_PAGE_DOWN = 2;     // Page upside down
public static final int ORIENTATION_PAGE_LEFT = 3;     // Page rotated left

// Writing Direction Constants
public static final int WRITING_DIRECTION_LEFT_TO_RIGHT = 0;
public static final int WRITING_DIRECTION_RIGHT_TO_LEFT = 1;
public static final int WRITING_DIRECTION_TOP_TO_BOTTOM = 2;

// Text Line Order Constants  
public static final int TEXTLINE_ORDER_LEFT_TO_RIGHT = 0;
public static final int TEXTLINE_ORDER_RIGHT_TO_LEFT = 1;
public static final int TEXTLINE_ORDER_TOP_TO_BOTTOM = 2;

// Paragraph Justification Constants
public static final int JUSTIFICATION_UNKNOWN = 0;
public static final int JUSTIFICATION_LEFT = 1;
public static final int JUSTIFICATION_CENTER = 2;
public static final int JUSTIFICATION_RIGHT = 3;

PolyBlock Type Constants

Constants for identifying different types of page regions.

// PolyBlock Type Constants for page layout analysis
public static final int PT_UNKNOWN = 0;             // Unknown region type
public static final int PT_FLOWING_TEXT = 1;        // Flowing text within column
public static final int PT_HEADING_TEXT = 2;        // Heading text spanning columns
public static final int PT_PULLOUT_TEXT = 3;        // Pull-out text region
public static final int PT_EQUATION = 4;            // Mathematical equation
public static final int PT_INLINE_EQUATION = 5;     // Inline equation
public static final int PT_TABLE = 6;               // Table region
public static final int PT_VERTICAL_TEXT = 7;       // Vertical text line
public static final int PT_CAPTION_TEXT = 8;        // Image caption text
public static final int PT_FLOWING_IMAGE = 9;       // Image within column
public static final int PT_HEADING_IMAGE = 10;      // Image spanning columns
public static final int PT_PULLOUT_IMAGE = 11;      // Pull-out image region
public static final int PT_HORZ_LINE = 12;          // Horizontal line
public static final int PT_VERT_LINE = 13;          // Vertical line
public static final int PT_NOISE = 14;              // Noise outside columns

Memory Management Guidelines

Important Data Structure Guidelines:

  • Always call deallocate() on BytePointer objects returned by text methods
  • Use TessMonitorCreate() and TessMonitorDelete() for progress monitors
  • StringVector and other containers are automatically managed by JavaCPP
  • UNICHAR utf8_str() method requires manual deallocate() call
  • Callback functions are automatically managed through JavaCPP
  • Check for null pointers before accessing data structure members

Performance Considerations:

  • Reuse containers when possible to reduce memory allocation
  • Clear large containers when done to free memory promptly
  • Use appropriate container sizes to avoid frequent reallocations
  • Monitor memory usage when processing large batches of documents

Install with Tessl CLI

npx tessl i tessl/maven-org-bytedeco--tesseract

docs

basic-ocr.md

configuration.md

data-structures.md

index.md

iterators.md

renderers.md

tile.json