CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-dev-langchain4j--langchain4j-chroma

LangChain4j integration for Chroma embedding store enabling storage, retrieval, and similarity search of vector embeddings with metadata filtering support for both API V1 and V2.

Pending
Overview
Eval results
Files

add.mddocs/operations/

Add Operations

Adding embeddings to the Chroma vector store.

Single Embedding

With Auto-Generated ID

import dev.langchain4j.data.embedding.Embedding;

Embedding embedding = Embedding.from(new float[]{0.1f, 0.2f, 0.3f});
String id = store.add(embedding);
// Returns: auto-generated UUID string

With Custom ID

Embedding embedding = Embedding.from(new float[]{0.1f, 0.2f, 0.3f});
store.add("custom-id-123", embedding);
// Returns: void

With Text Segment

import dev.langchain4j.data.segment.TextSegment;

Embedding embedding = Embedding.from(new float[]{0.1f, 0.2f, 0.3f});
TextSegment segment = TextSegment.from("This is the document text");

String id = store.add(embedding, segment);
// Returns: auto-generated UUID string

With Text Segment and Metadata

import dev.langchain4j.data.document.Metadata;

Embedding embedding = Embedding.from(new float[]{0.1f, 0.2f, 0.3f});

Metadata metadata = new Metadata()
    .put("author", "John Doe")
    .put("year", 2024)
    .put("category", "technology");

TextSegment segment = TextSegment.from("Document text", metadata);

String id = store.add(embedding, segment);

Batch Operations

Multiple Embeddings (Auto-Generated IDs)

List<Embedding> embeddings = Arrays.asList(
    Embedding.from(new float[]{0.1f, 0.2f, 0.3f}),
    Embedding.from(new float[]{0.4f, 0.5f, 0.6f}),
    Embedding.from(new float[]{0.7f, 0.8f, 0.9f})
);

List<String> ids = store.addAll(embeddings);
// Returns: list of auto-generated IDs

With Custom IDs

List<String> ids = Arrays.asList("id1", "id2", "id3");

List<Embedding> embeddings = Arrays.asList(
    Embedding.from(new float[]{0.1f, 0.2f, 0.3f}),
    Embedding.from(new float[]{0.4f, 0.5f, 0.6f}),
    Embedding.from(new float[]{0.7f, 0.8f, 0.9f})
);

store.addAll(ids, embeddings, null);
// Returns: void

With Text Segments and Metadata

List<String> ids = Arrays.asList("doc1", "doc2", "doc3");

List<Embedding> embeddings = Arrays.asList(emb1, emb2, emb3);

List<TextSegment> segments = Arrays.asList(
    TextSegment.from("First document", new Metadata().put("index", 1)),
    TextSegment.from("Second document", new Metadata().put("index", 2)),
    TextSegment.from("Third document", new Metadata().put("index", 3))
);

store.addAll(ids, embeddings, segments);
// Returns: void

With Auto-Generated IDs and Segments

List<Embedding> embeddings = Arrays.asList(emb1, emb2, emb3);

List<TextSegment> segments = Arrays.asList(
    TextSegment.from("First document"),
    TextSegment.from("Second document"),
    TextSegment.from("Third document")
);

List<String> ids = store.addAll(embeddings, segments);
// Returns: list of auto-generated IDs

Creating Embeddings

From Float Array

Embedding embedding = Embedding.from(new float[]{0.1f, 0.2f, 0.3f});

From List

List<Float> vector = Arrays.asList(0.1f, 0.2f, 0.3f);
Embedding embedding = Embedding.from(vector);

With Embedding Model

import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.embedding.AllMiniLmL6V2EmbeddingModel;

EmbeddingModel model = new AllMiniLmL6V2EmbeddingModel();
Embedding embedding = model.embed("text to embed").content();

Performance Considerations

Batch vs Individual Adds

// INEFFICIENT: Multiple HTTP requests
for (Embedding embedding : embeddings) {
    store.add(embedding);  // N requests
}

// EFFICIENT: Single HTTP request
List<String> ids = store.addAll(embeddings);  // 1 request

Batch operations are significantly faster for multiple embeddings because they use a single HTTP request instead of N requests.

When to Use Each Method

Use add(embedding) when:

  • Adding a single embedding
  • Need immediate ID return value
  • Adding embeddings one at a time as they're generated

Use addAll(embeddings) when:

  • Adding multiple embeddings from a collection
  • Processing batch of documents
  • Performance matters (always prefer for multiple items)

Metadata Supported Types

Metadata metadata = new Metadata()
    .put("string_field", "value")           // String
    .put("int_field", 42)                   // Integer
    .put("long_field", 123456789L)          // Long
    .put("float_field", 3.14f)              // Float
    .put("double_field", 3.14159)           // Double
    .put("uuid_field", UUID.randomUUID());  // UUID

// NOT SUPPORTED: Boolean type is not supported by Chroma
// metadata.put("bool_field", true);  // Will fail

Common Patterns

Indexing Documents from List

import dev.langchain4j.model.embedding.EmbeddingModel;

List<String> documents = loadDocuments();
EmbeddingModel model = createEmbeddingModel();

List<Embedding> embeddings = new ArrayList<>();
List<TextSegment> segments = new ArrayList<>();

for (String doc : documents) {
    embeddings.add(model.embed(doc).content());
    segments.add(TextSegment.from(doc));
}

List<String> ids = store.addAll(embeddings, segments);

Indexing with Metadata

record Document(String text, String author, int year, String category) {}

List<Document> documents = loadDocuments();

List<String> ids = new ArrayList<>();
List<Embedding> embeddings = new ArrayList<>();
List<TextSegment> segments = new ArrayList<>();

for (Document doc : documents) {
    ids.add(doc.id());
    embeddings.add(model.embed(doc.text()).content());

    Metadata metadata = new Metadata()
        .put("author", doc.author())
        .put("year", doc.year())
        .put("category", doc.category());

    segments.add(TextSegment.from(doc.text(), metadata));
}

store.addAll(ids, embeddings, segments);

Chunked Batch Indexing

For very large datasets, process in chunks to avoid memory issues:

int batchSize = 100;
List<Document> allDocuments = loadLargeDataset();

for (int i = 0; i < allDocuments.size(); i += batchSize) {
    int end = Math.min(i + batchSize, allDocuments.size());
    List<Document> batch = allDocuments.subList(i, end);

    List<Embedding> embeddings = new ArrayList<>();
    List<TextSegment> segments = new ArrayList<>();

    for (Document doc : batch) {
        embeddings.add(model.embed(doc.text()).content());
        segments.add(TextSegment.from(doc.text(), doc.metadata()));
    }

    store.addAll(embeddings, segments);
    System.out.println("Indexed batch: " + (i/batchSize + 1));
}

Error Scenarios

Dimension Mismatch

All embeddings in a collection must have the same dimensions:

// First embedding: 3 dimensions
store.add(Embedding.from(new float[]{0.1f, 0.2f, 0.3f}));

// ERROR: Different dimensions (4)
store.add(Embedding.from(new float[]{0.1f, 0.2f, 0.3f, 0.4f}));
// Throws: ChromaException about dimension mismatch

Invalid Metadata Types

Metadata metadata = new Metadata();

// VALID
metadata.put("name", "value");
metadata.put("count", 42);
metadata.put("score", 3.14);

// INVALID - Boolean not supported
// metadata.put("active", true);  // Will cause issues in Chroma

Connection Errors

try {
    String id = store.add(embedding);
} catch (java.net.http.HttpConnectTimeoutException e) {
    System.err.println("Cannot connect to Chroma: " + e.getMessage());
} catch (java.net.http.HttpTimeoutException e) {
    System.err.println("Add operation timed out: " + e.getMessage());
}

API Reference

See: ChromaEmbeddingStore API for complete method signatures.

Related:

Install with Tessl CLI

npx tessl i tessl/maven-dev-langchain4j--langchain4j-chroma

docs

index.md

tile.json