Use semantic consistency auditor for academic writing workflows that need structured execution, explicit assumptions, and clear output boundaries.
39
Quality
24%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./scientific-skills/Academic Writing/semantic-consistency-auditor/SKILL.mdID: 212
Name: semantic-consistency-auditor
Description: Introduces BERTScore and COMET algorithms to evaluate the semantic consistency between AI-generated clinical notes and expert gold standards from the "semantic entailment" level.
scripts/main.py.references/ for task-specific guidance.See ## Prerequisites above for related details.
Python: 3.10+. Repository baseline for current packaged skills.bert_score: unspecified. Declared in requirements.txt.comet: unspecified. Declared in requirements.txt.dataclasses: unspecified. Declared in requirements.txt.numpy: unspecified. Declared in requirements.txt.torch: unspecified. Declared in requirements.txt.yaml: unspecified. Declared in requirements.txt.See ## Usage above for related details.
cd "20260318/scientific-skills/Academic Writing/semantic-consistency-auditor"
python -m py_compile scripts/main.py
python scripts/main.py --helpExample run plan:
CONFIG block or documented parameters if the script uses fixed settings.python scripts/main.py with the validated inputs.See ## Workflow above for related details.
scripts/main.py.references/ contains supporting rules, prompts, or checklists.Use this command to verify that the packaged script entry point can be parsed before deeper execution.
python -m py_compile scripts/main.pyUse these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
python -m py_compile scripts/main.py
python scripts/main.py --helpSemantic Consistency Auditor is a medical AI evaluation tool used to assess the semantic consistency between AI-generated clinical notes and expert-written gold standards from a semantic level. This tool is not limited to traditional string matching or bag-of-words models, but uses deep learning models to understand semantic entailment relationships, capable of identifying expressions with different wording but similar meaning.
BERTScore uses pre-trained BERT model contextual embeddings to calculate similarity between candidate text and reference text:
COMET is a neural network-based evaluation metric originally used for machine translation evaluation, applicable to semantic entailment tasks:
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
# Or venv\Scripts\activate # Windows
# Install dependencies
pip install bertscore comet-ml transformers torchConfigure in ~/.openclaw/skills/semantic-consistency-auditor/config.yaml:
# BERTScore Configuration
bertscore:
model: "microsoft/deberta-xlarge-mnli" # Or "bert-base-chinese" for Chinese
lang: "zh" # Language code: zh, en, etc.
rescale_with_baseline: true
device: "auto" # auto, cpu, cuda
# COMET Configuration
comet:
model: "Unbabel/wmt22-comet-da" # COMET model
batch_size: 8
device: "auto"
# Evaluation Thresholds
thresholds:
bertscore_f1: 0.85
comet_score: 0.75
semantic_consistency: 0.80 # Comprehensive score threshold# Evaluate single case pair
python scripts/main.py \
--ai-generated "Patient presented with fever for 3 days, highest temperature 39°C, accompanied by cough." \
--gold-standard "Patient chief complaint of fever for 3 days, highest temperature 39°C, accompanied by cough symptoms." \
--output results.json
# Batch evaluation from JSON file
python scripts/main.py \
--input-file batch_cases.json \
--output results.json \
--format detailed
# Use specific model
python scripts/main.py \
--ai-generated "..." \
--gold-standard "..." \
--bert-model "bert-base-chinese" \
--comet-model "Unbabel/wmt20-comet-da"from semantic_consistency_auditor import SemanticConsistencyAuditor
# Initialize evaluator
auditor = SemanticConsistencyAuditor(
bert_model="microsoft/deberta-xlarge-mnli",
comet_model="Unbabel/wmt22-comet-da",
lang="zh"
)
# Evaluate single case
result = auditor.evaluate(
ai_text="Patient presented with fever for 3 days...",
gold_text="Patient chief complaint of fever for 3 days..."
)
print(f"BERTScore F1: {result['bertscore']['f1']:.4f}")
print(f"COMET Score: {result['comet']['score']:.4f}")
print(f"Consistency: {result['consistency']:.4f}")
print(f"Passed: {result['passed']}")
# Batch evaluation
results = auditor.evaluate_batch([
{"ai": "...", "gold": "..."},
{"ai": "...", "gold": "..."}
])Pass text directly through --ai-generated and --gold-standard parameters.
[
{
"case_id": "CASE001",
"ai_generated": "Patient presented with fever for 3 days, highest temperature 39°C, accompanied by cough.",
"gold_standard": "Patient chief complaint of fever for 3 days, highest temperature 39°C, accompanied by cough symptoms.",
"metadata": {
"department": "Respiratory",
"disease_type": "Upper respiratory infection"
}
},
{
"case_id": "CASE002",
"ai_generated": "...",
"gold_standard": "..."
}
]{
"overall": {
"total_cases": 100,
"passed_cases": 85,
"pass_rate": 0.85,
"avg_bertscore_f1": 0.8923,
"avg_comet_score": 0.8234,
"avg_consistency": 0.8579
},
"thresholds": {
"bertscore_f1": 0.85,
"comet_score": 0.75,
"semantic_consistency": 0.80
}
}{
"cases": [
{
"case_id": "CASE001",
"ai_generated": "Patient presented with fever for 3 days...",
"gold_standard": "Patient chief complaint of fever for 3 days...",
"metrics": {
"bertscore": {
"precision": 0.9123,
"recall": 0.8934,
"f1": 0.9028
},
"comet": {
"score": 0.8234,
"system_score": 0.8156
},
"semantic_consistency": 0.8631
},
"passed": true,
"details": {
"semantic_gaps": [],
"matched_concepts": ["fever for 3 days", "temperature 39°C", "cough"]
}
}
],
"summary": { ... }
}scripts/main.py fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.# Python dependencies
pip install -r requirements.txtEvery final response should make these items explicit when they are relevant:
This skill accepts requests that match the documented purpose of semantic-consistency-auditor and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
semantic-consistency-auditoronly handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
Use the following fixed structure for non-trivial requests:
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
4a48721
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.