CtrlK
BlogDocsLog inGet started
Tessl Logo

alonso-skills/sql-ml-features

Use when preparing data for machine learning from SQL Server — feature engineering in T-SQL, building training/test datasets, statistical aggregations for ML pipelines, sampling strategies, data normalization and encoding in SQL, writing queries that feed pandas or scikit-learn, exporting to Parquet or CSV for model training, or when a data scientist asks for a 'feature table' or 'training set' from a SQL Server database.

100

Quality

100%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

SKILL.md

name:
sql-ml-features
description:
Use when preparing data for machine learning from SQL Server — feature engineering in T-SQL, building training/test datasets, statistical aggregations for ML pipelines, sampling strategies, data normalization and encoding in SQL, writing queries that feed pandas or scikit-learn, exporting to Parquet or CSV for model training, or when a data scientist asks for a 'feature table' or 'training set' from a SQL Server database.

SQL for Machine Learning

Patterns for extracting ML-ready features from SQL Server — turning normalized relational data into the wide, denormalized, NULL-free datasets that models consume.

When to Use

  • Building a feature table or training dataset from application data
  • Feature engineering in T-SQL (rolling aggregations, lag features, RFM scoring)
  • Encoding categorical variables (one-hot, ordinal, frequency encoding)
  • Handling NULLs for ML (imputation strategies in SQL)
  • Sampling and train/test splitting from SQL Server
  • Exporting large datasets to CSV, Parquet, or pandas
  • Preventing data leakage in temporal feature queries
  • Writing queries that feed scikit-learn, XGBoost, or PyTorch pipelines

When NOT to use: application schema design (table design, naming conventions, access control), query performance tuning (execution plans, index tuning, wait stats), BI dashboards and summary reports (GROUPING SETS, pivot tables, dashboard queries), or running R/Python inside SQL Server (SQL Server ML Services).

Feature Table Build Workflow

Follow these steps in order. Each step has a validation checkpoint.

  1. Set snapshot date — Anchor all features to a fixed @SnapshotDate. Never use GETDATE() inside feature queries.

    • Validate: SELECT @SnapshotDate returns the intended date.
  2. Build base entity tableSELECT DISTINCT the entity key into a temp table #Base. One row per entity, primary key only — no features yet.

    SELECT DISTINCT CustomerId
    INTO #Base
    FROM Customers
    WHERE SignupDate < @SnapshotDate;
    • Validate: SELECT COUNT(*), COUNT(DISTINCT CustomerId) FROM #Base — counts must match (no duplicates).
  3. Join features with temporal bounds — LEFT JOIN each feature set to #Base. Bound every join with AND EventDate <= @SnapshotDate.

    SELECT b.CustomerId,
           DATEDIFF(DAY, f.LastOrder, @SnapshotDate) AS Recency,
           f.OrderCount AS Frequency
    FROM #Base b
    LEFT JOIN FeatureCTE f ON f.CustomerId = b.CustomerId;
    • Validate: Row count equals #Base row count. No feature references dates after @SnapshotDate.
  4. Impute NULLs — Replace every NULL with an explicit value. Use COALESCE with mean, median, zero, or a sentinel depending on the column semantics.

    • Validate: SELECT SUM(CASE WHEN col IS NULL THEN 1 ELSE 0 END) FROM FeatureTable returns 0 for every column.
  5. Encode categoricals — One-hot, ordinal, or frequency encode all non-numeric columns.

    • Validate: SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'FeatureTable' — all columns are numeric types (int, float, decimal, bit).
  6. Split train/test — Use hash-based split for cross-sectional data or cutoff-date split for time series.

    • Validate: SELECT SplitLabel, COUNT(*) FROM FeatureTable GROUP BY SplitLabel — verify expected proportions (e.g., 80/20).
  7. Export — BCP, pandas.read_sql, or FOR JSON depending on the consumer.

Feature Engineering Taxonomy

CategoryExamplesT-SQL tools
NumericRaw values, LOG(col + 1) for skewed, ratios, z-scoresLOG(col + 1), SQRT, window AVG/STDEV
CategoricalOne-hot encoding, ordinal, frequencyCASE, PIVOT, DENSE_RANK, COUNT ratios
TemporalRecency, duration, day-of-weekDATEDIFF, DATEPART, DATENAME
Rolling window7-day sum, 30-day averageSUM/AVG OVER (ROWS BETWEEN ...)
Lag / offsetPrevious value, delta from priorLAG, LEAD
InteractionProduct, ratio of two featuresComputed expressions in SELECT
Text signalsLength, keyword presenceLEN, CHARINDEX, PATINDEX
MissingnessIs this value missing?CASE WHEN col IS NULL THEN 1 ELSE 0 END

Quick Reference: SQL Pattern → ML Concept

ML conceptT-SQL pattern
Recency featureDATEDIFF(DAY, LastEventDate, @SnapshotDate)
Frequency featureCOUNT(*) OVER (PARTITION BY entity ORDER BY dt ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)
Rolling 7-day revenueSUM(Amount) OVER (ORDER BY OrderDate ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
Lag feature (t-1)LAG(Amount, 1) OVER (PARTITION BY CustomerId ORDER BY OrderDate)
One-hot encodeCASE WHEN Category = 'A' THEN 1 ELSE 0 END AS Category_A
Ordinal encodeDENSE_RANK() OVER (ORDER BY Category)
Frequency encodeCOUNT(*) OVER (PARTITION BY Category) * 1.0 / COUNT(*) OVER ()
Log transformLOG(Amount + 1) — the +1 offset handles zero values; omit only when zeros are impossible
Quantile bucketNTILE(10) OVER (ORDER BY Score)
Row hash for split5-step chain: (1) CAST(Id AS NVARCHAR(20)) → (2) HASHBYTES('SHA2_256', ...) → (3) CAST(... AS BINARY(8)) → (4) CAST(... AS BIGINT) → (5) ABS(...) % 10
Mean imputationCOALESCE(col, AVG(col) OVER ())
Median imputationPERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col) OVER ()
NULL indicatorCASE WHEN col IS NULL THEN 1 ELSE 0 END AS col_is_missing
Min-max scaling(col - MIN(col) OVER ()) / NULLIF(MAX(col) OVER () - MIN(col) OVER (), 0)
Date bucketingDATE_BUCKET(WEEK, 1, EventDate) (SQL Server 2022+)
Date truncationDATETRUNC(MONTH, EventDate) (SQL Server 2022+)
First value in seriesFIRST_VALUE(col) OVER (PARTITION BY entity ORDER BY dt ROWS UNBOUNDED PRECEDING)
Rolling volatilitySTDEV(col) OVER (PARTITION BY entity ORDER BY dt ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)

Common Mistakes

MistakeWhat goes wrongFix
Using AVG(col) without understanding NULL exclusionAVG divides by non-NULL count, not row count — mean is inflated when NULLs represent zerosUse AVG(COALESCE(col, 0)) when NULLs mean zero
Rolling window with RANGE instead of ROWSTied timestamps include extra rows, producing different counts per rowAlways use ROWS BETWEEN n PRECEDING AND CURRENT ROW
Joining to future events in feature queryLeaks information that wasn't available at prediction timeBound all joins with AND EventDate <= @SnapshotDate
Computing global statistics before the train/test splitMean and variance incorporate test data into training featuresCompute statistics only over training rows
Using LAG without checking temporal orderLAG requires ORDER BY — missing it gives non-deterministic offsetsAlways include PARTITION BY entity ORDER BY timestamp
Forward-filling sensor NULLs across entity boundariesPARTITION BY is missing, so last value bleeds across entitiesAlways PARTITION BY entity_id in forward-fill window
HASHBYTES on NULL keyHASHBYTES(algo, NULL) returns NULL — entity lands in no splitExclude or handle NULL keys before splitting
UNPIVOT for categorical to indicator columnsUNPIVOT silently drops NULL valuesUse CROSS APPLY VALUES to preserve NULLs as zeros
Random split for time-series dataTest set leaks future patterns into training windowUse a cutoff date for time-series train/test split
Including the label column in lag featuresLAG(label) leaks ground truth about adjacent rowsExclude any derivative of the target from feature set

Reference Files

  • Feature Engineering — numeric, categorical, temporal, rolling windows, lag, RFM, text
  • Sampling and Splitting — TABLESAMPLE, hash-based splits, NTILE k-fold, stratified, time-based
  • NULL Imputation — mean/median/mode/forward-fill/constant, missingness indicators
  • Export Patterns — BCP, pandas.read_sql, chunked reads, FOR JSON, Parquet, feature stores
  • Data Leakage — temporal, target, train/test leakage; prevention patterns in SQL

SKILL.md

tile.json