tessl/maven-org-bytedeco--tesseract

JavaCPP Presets for Tesseract - Java wrapper library providing JNI bindings to the native Tesseract OCR library version 5.5.1, enabling optical character recognition capabilities in Java applications

—

Pending

Overview

Eval results

Files

Basic OCR Operations

Name: tessl/maven-org-bytedeco--tesseract
Author: tessl

Core text recognition functionality providing the primary interface for extracting text from images using the Tesseract OCR engine.

Capabilities

TessBaseAPI Class

The main entry point for Tesseract OCR operations, providing initialization, configuration, image processing, and text extraction capabilities.

/**
 * Main Tesseract OCR API class providing complete OCR functionality
 */
public class TessBaseAPI extends Pointer {
    public TessBaseAPI();
    
    // Initialization and cleanup
    public int Init(String datapath, String language);
    public int Init(String datapath, String language, int oem);
    public void InitForAnalysePage();
    public void End();
    
    // Image input
    public void SetImage(PIX pix);
    public void SetImage(byte[] imagedata, int width, int height, int bytes_per_pixel, int bytes_per_line);
    public void SetInputImage(PIX pix);
    public PIX GetInputImage();
    public void SetSourceResolution(int ppi);
    public void SetRectangle(int left, int top, int width, int height);
    
    // OCR processing
    public int Recognize(ETEXT_DESC monitor);
    public BytePointer TesseractRect(byte[] imagedata, int bytes_per_pixel, int bytes_per_line, 
                                     int left, int top, int width, int height);
    
    // Text output
    public BytePointer GetUTF8Text();
    public BytePointer GetHOCRText(int page_number);
    public BytePointer GetAltoText(int page_number);
    public BytePointer GetPAGEText(int page_number);
    public BytePointer GetTSVText(int page_number);
    public BytePointer GetBoxText(int page_number);
    public BytePointer GetLSTMBoxText(int page_number);
    public BytePointer GetUNLVText();
    
    // Analysis results
    public PageIterator AnalyseLayout();
    public ResultIterator GetIterator();
    public MutableIterator GetMutableIterator();
    public int MeanTextConf();
    public IntPointer AllWordConfidences();
    
    // Image processing results
    public PIX GetThresholdedImage();
    
    // Static utilities
    public static BytePointer Version();
    public static void ClearPersistentCache();
}

Basic OCR Example:

import org.bytedeco.javacpp.*;
import org.bytedeco.leptonica.*;
import org.bytedeco.tesseract.*;
import static org.bytedeco.leptonica.global.leptonica.*;
import static org.bytedeco.tesseract.global.tesseract.*;

// Initialize Tesseract
TessBaseAPI api = new TessBaseAPI();
if (api.Init(null, "eng") != 0) {
    System.err.println("Could not initialize Tesseract.");
    System.exit(1);
}

// Load image using Leptonica
PIX image = pixRead("document.png");
api.SetImage(image);

// Extract text
BytePointer text = api.GetUTF8Text();
System.out.println("Extracted text: " + text.getString());

// Get confidence score
int confidence = api.MeanTextConf();
System.out.println("Average confidence: " + confidence + "%");

// Cleanup
api.End();
text.deallocate();
pixDestroy(image);

Initialization Methods

Initialize the Tesseract engine with language models and configuration.

/**
 * Initialize Tesseract with default OCR engine mode
 * @param datapath Path to tessdata directory (null for system default)
 * @param language Language code (e.g., "eng", "eng+fra", "chi_sim")
 * @return 0 on success, -1 on failure
 */
public int Init(String datapath, String language);

/**
 * Initialize Tesseract with specific OCR engine mode
 * @param datapath Path to tessdata directory (null for system default)  
 * @param language Language code
 * @param oem OCR Engine Mode (OEM_LSTM_ONLY, OEM_DEFAULT, etc.)
 * @return 0 on success, -1 on failure
 */
public int Init(String datapath, String language, int oem);

/**
 * Initialize only for layout analysis (faster than full OCR)
 */
public void InitForAnalysePage();

/**
 * Shutdown Tesseract and free resources
 */
public void End();

Image Input Methods

Set the input image for OCR processing using various formats.

/**
 * Set image from Leptonica PIX structure (recommended)
 * @param pix Leptonica PIX image structure
 */
public void SetImage(PIX pix);

/**
 * Set image from raw image data
 * @param imagedata Raw image bytes
 * @param width Image width in pixels
 * @param height Image height in pixels
 * @param bytes_per_pixel Bytes per pixel (1, 3, or 4)
 * @param bytes_per_line Bytes per line (width * bytes_per_pixel + padding)
 */
public void SetImage(byte[] imagedata, int width, int height, int bytes_per_pixel, int bytes_per_line);

/**
 * Set source image resolution for better accuracy
 * @param ppi Pixels per inch (typical values: 200-300)
 */
public void SetSourceResolution(int ppi);

/**
 * Set rectangular region of interest for OCR
 * @param left Left coordinate
 * @param top Top coordinate  
 * @param width Width of region
 * @param height Height of region
 */
public void SetRectangle(int left, int top, int width, int height);

OCR Processing Methods

Perform the actual OCR recognition with optional progress monitoring.

/**
 * Perform OCR recognition with optional progress monitoring
 * @param monitor Progress monitor (can be null)
 * @return 0 on success, negative on failure
 */
public int Recognize(ETEXT_DESC monitor);

/**
 * One-shot OCR for rectangular region of raw image data
 * @param imagedata Raw image bytes
 * @param bytes_per_pixel Bytes per pixel
 * @param bytes_per_line Bytes per line
 * @param left Left coordinate of region
 * @param top Top coordinate of region
 * @param width Width of region
 * @param height Height of region
 * @return Recognized text as BytePointer (must deallocate)
 */
public BytePointer TesseractRect(byte[] imagedata, int bytes_per_pixel, int bytes_per_line,
                                 int left, int top, int width, int height);

Text Output Methods

Extract recognized text in various formats.

/**
 * Get recognized text as UTF-8 encoded string
 * @return Text as BytePointer (must call deallocate())
 */
public BytePointer GetUTF8Text();

/**
 * Get text in hOCR HTML format with position information
 * @param page_number Page number (0-based)
 * @return hOCR HTML as BytePointer (must deallocate)
 */
public BytePointer GetHOCRText(int page_number);

/**
 * Get text in ALTO XML format
 * @param page_number Page number (0-based)
 * @return ALTO XML as BytePointer (must deallocate)
 */
public BytePointer GetAltoText(int page_number);

/**
 * Get text in PAGE XML format
 * @param page_number Page number (0-based)
 * @return PAGE XML as BytePointer (must deallocate)
 */
public BytePointer GetPAGEText(int page_number);

/**
 * Get text in Tab Separated Values format
 * @param page_number Page number (0-based)
 * @return TSV data as BytePointer (must deallocate)
 */
public BytePointer GetTSVText(int page_number);

/**
 * Get character bounding boxes in training format
 * @param page_number Page number (0-based)
 * @return Box coordinates as BytePointer (must deallocate)
 */
public BytePointer GetBoxText(int page_number);

Multi-format Output Example:

// Get plain text
BytePointer plainText = api.GetUTF8Text();
System.out.println("Plain text: " + plainText.getString());

// Get hOCR with position information
BytePointer hocrText = api.GetHOCRText(0);
Files.write(Paths.get("output.hocr"), hocrText.getString().getBytes());

// Get searchable PDF (requires different approach with renderers)
TessPDFRenderer pdfRenderer = new TessPDFRenderer("output", "/usr/share/tesseract-ocr/4.00/tessdata");
pdfRenderer.BeginDocument("OCR Results");
pdfRenderer.AddImage(api);  
pdfRenderer.EndDocument();

// Cleanup
plainText.deallocate();
hocrText.deallocate();

Analysis Result Methods

Get confidence scores and detailed analysis results.

/**
 * Get average confidence score for all recognized text
 * @return Confidence percentage (0-100)
 */
public int MeanTextConf();

/**
 * Get confidence scores for all individual words
 * @return Array of confidence scores (must call deallocate())
 */
public IntPointer AllWordConfidences();

/**
 * Get layout analysis iterator (without OCR)
 * @return PageIterator for layout structure
 */
public PageIterator AnalyseLayout();

/**
 * Get OCR results iterator
 * @return ResultIterator for detailed OCR results
 */
public ResultIterator GetIterator();

/**
 * Get processed binary image used for OCR
 * @return PIX structure with thresholded image
 */
public PIX GetThresholdedImage();

Advanced Layout Analysis Methods

Extract detailed layout components including regions, textlines, strips, words, and connected components.

/**
 * Get page regions as bounding boxes and images
 * @param pixa Output parameter for region images
 * @return BOXA with region bounding boxes
 */
public BOXA GetRegions(PIXA pixa);

/**
 * Get textlines with detailed positioning information
 * @param raw_image If true, extract from original image instead of thresholded
 * @param raw_padding Padding pixels for raw image extraction  
 * @param pixa Output parameter for textline images
 * @param blockids Output parameter for block IDs of each line
 * @param paraids Output parameter for paragraph IDs within blocks
 * @return BOXA with textline bounding boxes
 */
public BOXA GetTextlines(boolean raw_image, int raw_padding, PIXA pixa, 
                        IntPointer blockids, IntPointer paraids);
public BOXA GetTextlines(PIXA pixa, IntPointer blockids);

/**
 * Get textlines and strips for non-rectangular regions
 * @param pixa Output parameter for strip images
 * @param blockids Output parameter for block IDs
 * @return BOXA with strip bounding boxes
 */
public BOXA GetStrips(PIXA pixa, IntPointer blockids);

/**
 * Get individual words as bounding boxes and images
 * @param pixa Output parameter for word images
 * @return BOXA with word bounding boxes
 */
public BOXA GetWords(PIXA pixa);

/**
 * Get connected components (individual character shapes)
 * @param pixa Output parameter for component images
 * @return BOXA with component bounding boxes
 */
public BOXA GetConnectedComponents(PIXA pixa);

/**
 * Get component images after layout analysis
 * @param level Page iterator level (block, paragraph, textline, word)
 * @param text_only If true, only return text components
 * @param raw_image If true, extract from original image
 * @param raw_padding Padding for raw image extraction
 * @param pixa Output parameter for component images
 * @param blockids Output parameter for block IDs
 * @param paraids Output parameter for paragraph IDs  
 * @return BOXA with component bounding boxes
 */
public BOXA GetComponentImages(int level, boolean text_only, boolean raw_image, 
                              int raw_padding, PIXA pixa, IntPointer blockids, 
                              IntPointer paraids);

Orientation and Script Detection

Detect document orientation and script direction for proper text processing.

/**
 * Detect page orientation and script information
 * @param results Output parameter for orientation results
 * @return True if orientation was detected successfully
 */
public boolean DetectOrientationScript(OSResults results);

/**
 * Detect orientation and script with LSTM support
 * @param orient Output parameter for detected orientation (0-3)
 * @param script_dir Output parameter for script direction
 * @param out_conf Output parameter for confidence score
 * @param is_para_ltr Output parameter for paragraph left-to-right flag
 * @return True if detection was successful
 */
public boolean DetectOS(IntPointer orient, IntPointer script_dir, 
                       FloatPointer out_conf, BoolPointer is_para_ltr);

Adaptive Training Methods

Advanced functionality for improving recognition accuracy through adaptive training.

/**
 * Adapt the classifier to recognize a specific word
 * Improves accuracy for repeated words in similar contexts
 * @param mode Training mode (0=simple, 1=detailed)
 * @param wordstr The word string to adapt to
 * @return True if adaptation was successful
 */
public boolean AdaptToWordStr(int mode, String wordstr);

/**
 * Check if a word is valid according to the current language model
 * @param word Word to validate
 * @return True if word is considered valid
 */
public boolean IsValidWord(String word);

/**
 * Check if a character is valid in the current character set
 * @param utf8_character UTF-8 encoded character to check
 * @return True if character is valid
 */
public boolean IsValidCharacter(String utf8_character);

LSTM Advanced Methods

Access to LSTM neural network specific features and raw recognition data.

/**
 * Get raw LSTM timestep data for detailed analysis
 * @return Vector of symbol-confidence pairs for each timestep
 */
public StringFloatPairVectorVector GetRawLSTMTimesteps();

/**
 * Get best symbol choices from LSTM at each position
 * @return Vector of symbol-confidence pairs for best choices
 */
public StringFloatPairVectorVector GetBestLSTMSymbolChoices();

Static Utility Methods

Version information and cache management.

/**
 * Get Tesseract version string
 * @return Version string as BytePointer (do not deallocate)
 */
public static BytePointer Version();

/**
 * Clear internal caches to free memory
 */
public static void ClearPersistentCache();

Memory Management

Important: JavaCPP uses native memory management. Always:

Call deallocate() on BytePointer objects returned by text methods
Call End() on TessBaseAPI before program exit
Use pixDestroy() on PIX images when done
Check for null pointers before accessing results

Error Handling

Initialization Errors: Init() returns 0 on success, -1 on failure Recognition Errors: Recognize() returns negative values on failure
Memory Errors: Check for null results from getter methods Resource Errors: Always call cleanup methods to prevent memory leaks

Install with Tessl CLI

npx tessl i tessl/maven-org-bytedeco--tesseract

docs