Ingests unstructured and semi-structured documents into Neo4j as a knowledge graph. Use when chunking PDFs, HTML, plain text, or Markdown; extracting entities and relationships from text with an LLM (SimpleKGPipeline, neo4j-graphrag); loading JSON via apoc.load.json; building Document→Chunk→Entity graph structures; or connecting LangChain/LlamaIndex document loaders to Neo4j. Covers neo4j-graphrag SimpleKGPipeline, LLM Graph Builder web UI, entity resolution, chunking strategies, and graph schema design for RAG pipelines. Does NOT handle structured CSV/relational import — use neo4j-import-skill. Does NOT handle GraphRAG retrieval after ingestion — use neo4j-graphrag-skill. Does NOT handle vector index creation — use neo4j-vector-search-skill.
71
88%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Advisory
Suggest reviewing before use
:Chunk nodes with embeddingsSimpleKGPipeline (neo4j-graphrag) programmaticallyapoc.load.jsonneo4j-import-skillneo4j-graphrag-skillneo4j-vector-search-skillneo4j-cypher-skill| Situation | Approach |
|---|---|
| No code; drag-and-drop UX wanted | LLM Graph Builder web UI |
| Programmatic pipeline; PDFs/text | SimpleKGPipeline (neo4j-graphrag) |
| JSON / REST API responses | apoc.load.json or Python + UNWIND |
| LangChain already in stack | Neo4jGraph + document loader |
| LlamaIndex already in stack | Neo4jQueryEngine / Neo4jVectorStore |
| Chunk-only (no entity extraction) | Manual chunking + MERGE pattern |
pip install neo4j-graphrag # includes SimpleKGPipeline
pip install neo4j-graphrag[openai] # + OpenAI LLM/embedder
pip install neo4j-graphrag[anthropic] # + Anthropic Claude
pip install neo4j-graphrag[google] # + Vertex AI / Gemini
# spaCy entity resolver (Python <= 3.13 only — unsupported on 3.14+):
pip install neo4j-graphrag[nlp]Requires: neo4j>=6.0.0, Python>=3.10, Neo4j>=5.18.1 (Aura>=5.18.0).
Schema controls what the LLM extracts. Define before pipeline construction.
# Option A — Simple string lists (LLM infers descriptions)
entities = ["Person", "Organization", "Location", "Product", "Event"]
relations = ["WORKS_AT", "LOCATED_IN", "KNOWS", "MENTIONS", "PART_OF"]
patterns = [
("Person", "WORKS_AT", "Organization"),
("Organization", "LOCATED_IN", "Location"),
("Person", "KNOWS", "Person"),
("Article", "MENTIONS", "Organization"),
]
# Option B — Rich schema (better extraction quality)
from neo4j_graphrag.experimental.components.schema import (
SchemaBuilder, SchemaEntity, SchemaRelation
)
schema = SchemaBuilder().create_schema_from_dict({
"entities": {
"Person": {"description": "A human individual", "properties": {"name": "str", "role": "str"}},
"Organization": {"description": "A company or institution", "properties": {"name": "str", "industry": "str"}},
},
"relations": {
"WORKS_AT": {"description": "Employment relationship"},
},
"patterns": [("Person", "WORKS_AT", "Organization")],
})
# Option C — Auto-extract schema from text (no constraints)
schema = "EXTRACTED" # LLM infers types; noisier output
schema = "FREE" # No schema guidance; most noiseUse Option B for production; Option A for prototyping; "EXTRACTED" only for exploration.
import asyncio
from neo4j import GraphDatabase
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
driver = GraphDatabase.driver(
"neo4j+s://xxxx.databases.neo4j.io",
auth=("neo4j", "password")
)
llm = OpenAILLM(
model_name="gpt-4o",
model_params={"temperature": 0, "response_format": {"type": "json_object"}},
)
embedder = OpenAIEmbeddings() # OPENAI_API_KEY from env
pipeline = SimpleKGPipeline(
llm=llm,
driver=driver,
embedder=embedder,
entities=entities, # from Step 1
relations=relations,
patterns=patterns,
from_file=True, # False → pass text= instead of file_path=
on_error="IGNORE", # RAISE to surface extraction failures
perform_entity_resolution=True,
neo4j_database="neo4j", # omit to use default
)LLM alternatives (same interface):
AnthropicLLM(model_name="claude-3-5-sonnet-20241022")VertexAILLM(model_name="gemini-1.5-pro-002")OllamaLLM(model_name="llama3") — local; no API key needed# From PDF or Markdown file:
result = asyncio.run(pipeline.run_async(
file_path="report.pdf",
document_metadata={"source": "Q4 report", "year": 2025},
))
# From raw text:
result = asyncio.run(pipeline.run_async(
text=document_text,
))
# Batch — process multiple files:
async def ingest_all(paths):
for p in paths:
await pipeline.run_async(file_path=str(p))
asyncio.run(ingest_all(list(pdf_dir.glob("*.pdf"))))document_metadata dict is stored as properties on the :Document node.
Default splitter: FixedSizeSplitter(chunk_size=300, chunk_overlap=50).
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
splitter = FixedSizeSplitter(
chunk_size=512, # tokens; 300–512 typical for GPT-4o
chunk_overlap=50, # ~10% of chunk_size; preserves boundary context
approximate=True, # respect sentence/word boundaries when possible
)
pipeline = SimpleKGPipeline(
...,
text_splitter=splitter,
)Chunking guidance:
| Document type | chunk_size | chunk_overlap |
|---|---|---|
| Dense technical text | 256–512 | 50–80 |
| Narrative / news articles | 512–1024 | 80–128 |
| Legal / financial docs | 256–384 | 40–64 |
Rule: chunk must fit within LLM context for extraction + within embedding model limits. GPT-4o: 128k context; text-embedding-3-small: 8191 tokens. Never set chunk_size > 2048.
Merge duplicate extracted entities after pipeline run.
from neo4j_graphrag.experimental.components.resolver import (
SinglePropertyExactMatchResolver, # identical name → merge
FuzzyMatchResolver, # Levenshtein similarity; needs rapidfuzz
SpaCySemanticMatchResolver, # cosine similarity; needs neo4j-graphrag[nlp]
)
# Exact match (fastest; good baseline)
resolver = SinglePropertyExactMatchResolver(driver)
asyncio.run(resolver.run())
# Fuzzy match (handles typos / alternate spellings)
from neo4j_graphrag.experimental.components.resolver import FuzzyMatchResolver
resolver = FuzzyMatchResolver(driver, threshold=0.9)
asyncio.run(resolver.run())
# Scope resolution to specific labels only:
resolver = SinglePropertyExactMatchResolver(
driver,
filter_query="WHERE n:Organization OR n:Person",
)
asyncio.run(resolver.run())Run resolvers after ingestion, not inline — bulk merges are faster.
Pipeline always produces this lexical graph layer:
(:Document {id, fileName, status, ...metadata})
-[:HAS_CHUNK]->
(:Chunk {id, text, index, embedding, ...})
-[:NEXT_CHUNK]-> ← linked list for ordered traversal
(:Chunk {...})
(:Chunk)-[:FROM_DOCUMENT]->(:Document) ← back-pointerEntity extraction adds:
(:Chunk)-[:MENTIONS]->(:Person {name, ...})
(:Chunk)-[:MENTIONS]->(:Organization {name, ...})
(:Person)-[:WORKS_AT]->(:Organization)Verify after ingestion:
CYPHER 25
MATCH (d:Document)-[:HAS_CHUNK]->(c:Chunk)
RETURN d.fileName, count(c) AS chunks LIMIT 10;
MATCH (c:Chunk)-[:MENTIONS]->(e)
RETURN labels(e)[0] AS type, count(*) AS cnt ORDER BY cnt DESC LIMIT 20;Use when: non-developers need to ingest docs; rapid prototyping; no Python environment.
Hosted: https://llm-graph-builder.neo4jlabs.com/
Local (Docker):
git clone https://github.com/neo4j-labs/llm-graph-builder
cd llm-graph-builder
# Set OPENAI_API_KEY (or other provider keys) in .env
docker-compose up
# Opens at http://localhost:8080Supported sources: PDF, plain text, Markdown, images, web pages, YouTube transcripts, S3/GCS bucket uploads.
LLM providers: OpenAI, Gemini, Claude, Llama3, Diffbot, Qwen.
Limitations: best with long-form English text; poor on tabular data (use neo4j-import-skill for CSV/Excel); visual diagrams not extracted.
Use when source is JSON from REST APIs, S3, or file exports.
CYPHER 25
CALL apoc.load.json("https://example.com/articles.json") YIELD value
UNWIND value.articles AS article
CALL (article) {
MERGE (d:Document {id: article.id})
SET d.title = article.title, d.url = article.url, d.publishedAt = article.publishedAt
FOREACH (tag IN article.tags |
MERGE (t:Tag {name: tag})
MERGE (d)-[:HAS_TAG]->(t)
)
} IN TRANSACTIONS OF 1000 ROWSLocal file: apoc.load.json("file:///import/data.json"). File must be in $NEO4J_HOME/import/ or APOC allowlist configured.
Check APOC available: RETURN apoc.version(). APOC is included on all Aura tiers.
from langchain_community.graphs import Neo4jGraph
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from neo4j import GraphDatabase
graph = Neo4jGraph(
url="neo4j+s://xxxx.databases.neo4j.io",
username="neo4j",
password="password",
)
loader = PyPDFLoader("report.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(docs)
embedder = OpenAIEmbeddings()
driver = GraphDatabase.driver(url, auth=("neo4j", "password"))
for i, chunk in enumerate(chunks):
emb = embedder.embed_query(chunk.page_content)
driver.execute_query(
"""
MERGE (doc:Document {id: $doc_id})
SET doc.source = $source
CREATE (c:Chunk {id: $chunk_id, text: $text, embedding: $emb, index: $idx})
CREATE (doc)-[:HAS_CHUNK]->(c)
""",
doc_id=chunk.metadata.get("source", "unknown"),
source=chunk.metadata.get("source"),
chunk_id=f"chunk-{i}",
text=chunk.page_content,
emb=emb,
idx=i,
)For entity extraction with LangChain: use LLMGraphTransformer (from langchain_experimental.graph_transformers). Produces same :Document/:Chunk/entity pattern.
CYPHER 25
// Prevent duplicate documents
CREATE CONSTRAINT doc_id_unique IF NOT EXISTS
FOR (d:Document) REQUIRE d.id IS UNIQUE;
// Prevent duplicate chunks
CREATE CONSTRAINT chunk_id_unique IF NOT EXISTS
FOR (c:Chunk) REQUIRE c.id IS UNIQUE;
// Entity deduplication
CREATE CONSTRAINT person_name_unique IF NOT EXISTS
FOR (p:Person) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT org_name_unique IF NOT EXISTS
FOR (o:Organization) REQUIRE o.name IS UNIQUE;
// Vector index for chunk embeddings (adjust dims for your model)
CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS
FOR (c:Chunk) ON c.embedding
OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}};
// Poll until index ONLINE:
// SHOW INDEXES YIELD name, state WHERE state <> 'ONLINE'Do not start ingestion until all indexes are ONLINE:
SHOW INDEXES YIELD name, state WHERE state <> 'ONLINE';If rows returned: wait, then re-run. ONLINE = safe to ingest.
| Error | Cause | Fix |
|---|---|---|
| LLM extracts node types not in schema | Schema too loose or "EXTRACTED" mode | Define explicit entities + patterns; use Option B schema |
MissingEmbedderError | embedder= omitted | Always pass embedder= even if not doing vector search — pipeline stores embeddings on Chunk nodes |
| Zero entities extracted | LLM context overflow | Reduce chunk_size; switch to model with larger context |
| Duplicate entity nodes after ingestion | Entity resolution not run | Run SinglePropertyExactMatchResolver after bulk ingest |
apoc.load.json permission denied | APOC allowlist not configured | Add URL to apoc.import.file.enabled=true and dbms.security.allow_csv_import_from_file_urls=true |
| Chunking loses sentence mid-way | approximate=False (default) cuts at exact token count | Set approximate=True in FixedSizeSplitter |
chunk_size too large → LLM timeouts | Extraction prompt + chunk exceeds context | Keep chunk_size ≤ 512 for GPT-4o extraction; ≤ 2048 absolute max |
SpaCySemanticMatchResolver fails on Python 3.14 | spaCy not supported on 3.14+ | Use FuzzyMatchResolver or downgrade to Python 3.13 |
neo4j-driver package not found | Deprecated package name since 6.0 | Use neo4j package: pip install neo4j>=6.0.0 |
chunk_size within embedding model limit (≤2048; ≤512 for extraction)chunk_overlap set to 10–15% of chunk_sizeDocument→HAS_CHUNK→Chunk pattern used (enables graph traversal in retrieval)document_metadata populated with source identifierapoc.version() confirmed if using apoc.load.json.env has API keys; .env in .gitignoreMATCH (d:Document)-[:HAS_CHUNK]->(c:Chunk) RETURN count(c)MATCH (c:Chunk)-[:MENTIONS]->(e) RETURN labels(e)[0], count(*)entities/relations/potential_schema deprecated since 1.7.1. Use schema=GraphSchema(...):
from neo4j_graphrag.experimental.components.schema import (
GraphSchema, NodeType, RelationshipType, PropertyType
)
schema = GraphSchema(
node_types=[
NodeType(label="Person", properties=[PropertyType(name="name", type="STRING")]),
NodeType(label="Organization", properties=[PropertyType(name="name", type="STRING")]),
],
relationship_types=[RelationshipType(label="WORKS_AT")],
patterns=[("Person", "WORKS_AT", "Organization")],
)
pipeline = SimpleKGPipeline(llm=llm, driver=driver, embedder=embedder, schema=schema)schema="FREE" (no guidance) or schema="EXTRACTED" (LLM infers) — exploration only, noisier output.
Override default lexical layer labels (keep defaults unless integrating with existing graph):
from neo4j_graphrag.experimental.components.types import LexicalGraphConfig
# All fields have sensible defaults — only override what differs from your graph's conventions
config = LexicalGraphConfig(
document_node_label="Article", # default: "Document"
chunk_node_label="Passage", # default: "Chunk"
node_to_chunk_relationship_type="HAS_ENTITY", # default: "MENTIONS"
chunk_text_property="content", # default: "text"
)
pipeline = SimpleKGPipeline(..., lexical_graph_config=config)Default file_loader auto-dispatches by extension (.pdf→PdfLoader, .md→MarkdownLoader).
Supports fsspec URIs (s3://, gcs://). Subclass DataLoader for HTML/web/custom formats:
from neo4j_graphrag.experimental.components.data_loader import DataLoader
from neo4j_graphrag.experimental.components.types import DocumentInfo, LoadedDocument
class WebPageLoader(DataLoader):
async def run(self, filepath, metadata=None):
import httpx
text = httpx.get(filepath).text # strip HTML in real impl
return LoadedDocument(text=text,
document_info=DocumentInfo(path=filepath, metadata=metadata))
pipeline = SimpleKGPipeline(..., file_loader=WebPageLoader(), from_file=True)Chunking strategy by use-case and full resolver config: references/kg-construction.md.
Load on demand:
66ed0e1
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.