CtrlK
BlogDocsLog inGet started
Tessl Logo

data-engineering

Transforms, validates, and loads data in ETL pipelines. Use when building scrapers, validating NDJSON feeds, or importing data into CMS/DB targets.

100

Quality

100%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

SKILL.md
Quality
Evals
Security

Data Engineering

Generic pipeline patterns. For project-specific sources and full schema references see REFERENCE.md.

Scraper Architecture

Launch a headless browser cluster (Puppeteer Cluster / Playwright) with retryLimit: 3, retryDelay: 5000, timeout: 30000, args: ['--no-sandbox', '--disable-setuid-sandbox'].

NDJSON Output

One record per line. Schema:

FieldTypeNotes
nameRequiredPreserve original encoding
lat/lngRequiredGPS coordinates
addressRequiredFull text
sourceRequirede.g. google-maps
sourceIdRequiredSource-unique ID
categoryRequiredDomain category
rating, reviewCount, phone, website, openingHours, photos, priceLevelOptional

Recommended Workflow (numbered, with validation)

  1. Scrape: run scraper in --dry-run to collect a sample (50–200 records).
    • Checkpoint: sample contains expected fields and geo data.
    • Recovery: fix extractor selectors, re-run sample.
  2. Validate NDJSON: run line-by-line JSON parse + schema validator (see validate-ndjson.js example).
    • Checkpoint: 0 parse errors, required fields present.
    • Recovery: run ndjson-filter to isolate failing records and inspect source HTML.
  3. Dry-run import: import into staging with createOrReplace disabled; check counts and duplicates.
    • Checkpoint: counts match expectation ±5% and no duplicates inserted.
    • Recovery: revert staging and adjust dedupe key.
  4. Backup: snapshot current target (DB export) and store with timestamp.
  5. Import: run import with idempotent keys and monitor logs; on failure revert to backup.

Quick executable pipeline (copy & adapt)

node ./scripts/scrape-to-ndjson.js --out=data.ndjson --pages=100
node ./scripts/validate-ndjson.js data.ndjson
node ./scripts/dry-import.js data.ndjson --target=staging
node ./scripts/import.js data.ndjson --target=production

Inline: minimal NDJSON validator

const fs = require('fs'), rl = require('readline'), { z } = require('zod');
const schema = z.object({ name: z.string(), source: z.string(), sourceId: z.string() });
const iface = rl.createInterface({ input: fs.createReadStream(process.argv[2]) });
let line = 0, errors = 0;
for await (const l of iface) { line++; try { schema.parse(JSON.parse(l)); } catch(e) { console.error(`Line ${line}:`, e.message); errors++; } }
if (errors) { console.error(`${errors} errors`); process.exit(2); }
console.log('OK');

Full scraper and extended validator: see REFERENCE.md.

Repository
monkilabs/opencastle
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.