CtrlK
BlogDocsLog inGet started
Tessl Logo

unstructured-medical-text-miner

Mine unstructured clinical text from MIMIC-IV to extract diagnostic logic and treatment details

50

2.00x
Quality

31%

Does it follow best practices?

Impact

76%

2.00x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./scientific-skills/Data analysis/unstructured-medical-text-miner/SKILL.md
SKILL.md
Quality
Evals
Security

Unstructured Medical Text Miner (ID: 213)

Overview

Mine "text data" that has been long overlooked in MIMIC-IV, extracting unstructured diagnostic logic, order details, and progress notes.

Purpose

The MIMIC-IV database contains large amounts of structured data (vital signs, laboratory results, etc.), but its true clinical value is often hidden in unstructured text:

  • Diagnostic reasoning chains in discharge summaries
  • Subtle finding descriptions in imaging reports
  • Treatment decision logic in progress notes
  • Personalized medication considerations in orders

This Skill provides a complete text mining toolchain to transform raw medical text into analyzable structured insights.

Features

1. Text Extraction

  • NOTEEVENTS: Extract clinical notes from MIMIC-IV NOTE module
  • Radiology Reports: Extract imaging diagnostic text
  • ECG Reports: Parse ECG interpretation text
  • Discharge Summaries: Extract complete diagnostic and treatment course

2. Information Extraction

  • Entity Recognition: Diseases, symptoms, medications, procedures, anatomical sites
  • Relation Extraction: Medication-disease treatment relationships, symptom-disease diagnostic relationships
  • Timeline Extraction: Event occurrence times, disease progression sequence
  • Negation Detection: Identify negated clinical findings (e.g., "no fever")

3. Clinical Logic Parsing

  • Diagnostic Reasoning Chain: Reasoning path from symptoms → examination → diagnosis
  • Treatment Decision Tree: Clinical basis for medication selection and dosage adjustment
  • Disease Progression: Disease progression and outcome descriptions

4. Structured Output

  • FHIR-compatible clinical document format
  • Knowledge graph-friendly triple format
  • Temporal event sequences

Usage

from skills.unstructured_medical_text_miner.scripts.main import MedicalTextMiner

# Initialize miner
miner = MedicalTextMiner()

# Load MIMIC-IV note data
miner.load_notes(notes_path="path/to/noteevents.csv")

# Extract all text records for a specific patient
patient_texts = miner.get_patient_texts(subject_id=10000032)

# Execute complete information extraction
insights = miner.extract_insights(
    text=patient_texts,
    extract_entities=True,
    extract_relations=True,
    extract_timeline=True
)

Input

Data Sources

  • MIMIC-IV NOTEEVENTS table (csv/parquet format)
  • Discharge summary files
  • Imaging report files
  • Custom medical text

Field Requirements

Field NameDescriptionRequired
subject_idPatient unique identifierYes
hadm_idHospital admission record identifierNo
note_typeNote type (DS/RR/ECG, etc.)Yes
note_textNote text contentYes
charttimeRecord timeNo

Output

Entity Extraction Results

{
  "entities": [
    {
      "text": "acute myocardial infarction",
      "type": "DISEASE",
      "start": 156,
      "end": 183,
      "confidence": 0.94
    },
    {
      "text": "aspirin 81mg",
      "type": "MEDICATION",
      "start": 245,
      "end": 257,
      "attributes": {
        "dose": "81mg",
        "frequency": "daily"
      }
    }
  ]
}

Clinical Logic Graph

{
  "clinical_logic": {
    "presenting_complaint": "chest pain",
    "differential_diagnoses": ["ACS", "PE", "aortic dissection"],
    "workup": ["ECG", "troponin", "CTA chest"],
    "final_diagnosis": "STEMI",
    "treatment_plan": ["PCI", "dual antiplatelet"]
  }
}

Temporal Events

{
  "timeline": [
    {
      "time": "2020-03-15 08:30",
      "event": "admission",
      "description": "presented with chest pain"
    },
    {
      "time": "2020-03-15 09:15",
      "event": "ECG",
      "description": "ST elevation in V1-V4"
    }
  ]
}

Dependencies

pandas>=1.3.0
spacy>=3.4.0
scispacy>=0.5.1
radlex (for radiology terminology)
negspacy (for negation detection)

Configuration

# config.yaml
extraction:
  entity_types: ["DISEASE", "SYMPTOM", "MEDICATION", "PROCEDURE", "ANATOMY"]
  relation_types: ["TREATS", "CAUSES", "CONTRAINDICATED_WITH"]
  enable_negation_detection: true
  
models:
  ner_model: "en_core_sci_lg"  # or "en_core_sci_scibert"
  relation_model: "custom_relation_extractor"
  
output:
  format: "json"  # json/fhir/kg
  include_raw_text: false

CLI Usage

# Process single file
python -m skills.unstructured_medical_text_miner.scripts.main \
  --input notes.csv \
  --output extracted.json \
  --extract all

# Process specific patient
python -m skills.unstructured_medical_text_miner.scripts.main \
  --subject-id 10000032 \
  --db-path mimic_iv.db \
  --output patient_insights.json

References

  1. MIMIC-IV Clinical Database: https://physionet.org/content/mimiciv/
  2. scispacy: https://allenai.github.io/scispacy/
  3. NegEx/negspacy for negation detection
  4. FHIR Clinical Document specifications

Author

Skill ID: 213 Category: Medical Data Mining Complexity: Advanced

Risk Assessment

Risk IndicatorAssessmentLevel
Code ExecutionPython/R scripts executed locallyMedium
Network AccessNo external API callsLow
File System AccessRead input files, write output filesMedium
Instruction TamperingStandard prompt guidelinesLow
Data ExposureOutput files saved to workspaceLow

Security Checklist

  • No hardcoded credentials or API keys
  • No unauthorized file system access (../)
  • Output does not expose sensitive information
  • Prompt injection protections in place
  • Input file paths validated (no ../ traversal)
  • Output directory restricted to workspace
  • Script execution in sandboxed environment
  • Error messages sanitized (no stack traces exposed)
  • Dependencies audited

Prerequisites

# Python dependencies
pip install -r requirements.txt

Evaluation Criteria

Success Metrics

  • Successfully executes main functionality
  • Output meets quality standards
  • Handles edge cases gracefully
  • Performance is acceptable

Test Cases

  1. Basic Functionality: Standard input → Expected output
  2. Edge Case: Invalid input → Graceful error handling
  3. Performance: Large dataset → Acceptable processing time

Lifecycle Status

  • Current Stage: Draft
  • Next Review Date: 2026-03-06
  • Known Issues: None
  • Planned Improvements:
    • Performance optimization
    • Additional feature support
Repository
aipoch/medical-research-skills
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.