Feline USG Predictor v1.6

Estimation of Urine Specific Gravity from Serum Chemistry, CBC, and Patient Age  |  Feline-only clinic  |  April 2026

What’s new in v1.6: Complete methodological redesign optimized for clinical safety. Multi-model ensemble with patient-grouped cross-validation ensures no patient appears in both training and evaluation. Temporal holdout validation on 248 never-before-seen cases provides honest, unbiased performance estimates. Conformal prediction adds statistically guaranteed uncertainty quantification. Sensitivity target raised to ≥90% — in a screening tool, catching sick cats is the primary objective. Holdout: 92% sensitivity, 60% specificity, 0.86 AUC-ROC, 90% NPV.
Contents
  1. Clinical Rationale
  2. What Changed in v1.6
  3. Validation Methodology
  4. Algorithm & Architecture
  5. Performance Metrics
  6. Threshold & Clinical Trade-offs
  7. Conformal Prediction
  8. Confusion Matrix
  9. Feature Importance
  10. Prediction Explanations
  11. Version History
  12. Limitations
  13. Study Population

1. Clinical Rationale

Urine Specific Gravity is a cornerstone of feline renal assessment. IRIS staging of Chronic Kidney Disease incorporates USG alongside creatinine and SDMA to differentiate stages and guide management. Loss of concentrating ability (USG <1.035 in cats) is frequently among the earliest detectable signs of tubular dysfunction, often preceding azotemia.

However, urine collection is not always achievable at the time of presentation. An empty bladder, patient temperament, contraindications to cystocentesis (coagulopathy, abdominal masses), or client constraints may preclude urinalysis.

The Gap: In our dataset of 3,642 feline visits over 4 years, only 1,506 (41%) included a urinalysis. In the remaining 2,136 visits, renal concentrating ability was unknown despite bloodwork being available.

This model estimates USG from serum chemistry, CBC, and patient age — values already being collected on the same blood draw — providing a screening estimate of concentrating ability at zero additional cost, zero additional procedure time, and zero additional client charge.

2. What Changed in v1.6

Two fundamental improvements in v1.6:
  1. Methodological rigor: v1.5a used standard row-level cross-validation, which allowed the same patient’s multiple visits to appear in both training and evaluation splits. v1.6 enforces strict patient-level separation and temporal holdout validation, producing metrics that honestly reflect real-world deployment performance.
  2. Clinical optimization: v1.6 is explicitly optimized for ≥90% sensitivity. For a screening tool, catching sick cats is the highest priority. The cost function penalizes missed impaired cats more heavily than false flags, reflecting the clinical reality that a missed kidney diagnosis is far more consequential than an unnecessary urinalysis recommendation.

Summary of Changes

Aspectv1.5av1.6
Architecture Single model Multi-model ensemble
Validation Row-level CV (patient overlap) Patient-grouped CV + temporal holdout
Uncertainty None Conformal prediction (90% coverage)
Optimization Target Balanced accuracy ≥90% sensitivity
Hyperparameter Search Limited Extensive GPU-accelerated search
Holdout Sensitivity 81.0%* 92%
Holdout Specificity 73.9%* 60%
Holdout AUC-ROC 0.86
Holdout NPV 90%
Input Features 5 5 (unchanged)

*v1.5a used row-level cross-validation where the same patient’s multiple visits could appear in both training and evaluation. Its metrics are not directly comparable to v1.6’s patient-independent evaluation.

Why Sensitivity Went Up and Specificity Went Down

This is a deliberate design decision, not a trade-off we stumbled into. v1.6 was explicitly optimized to achieve ≥90% sensitivity because this is a screening tool. The fundamental question is:

“Which error is more acceptable: sending a healthy cat for a minimal-cost urinalysis, or sending a sick cat home without detection?”

With 92% sensitivity on the holdout set, only 9 out of 107 impaired cats were missed. The trade-off is that 57 out of 141 healthy cats are flagged for urinalysis they may not need. For those 57 cats, the consequence is a minimal-cost add-on test that confirms they are healthy. For the 98 impaired cats correctly flagged, the consequence is early detection and intervention.

3. Validation Methodology

v1.6 implements a rigorous validation strategy designed to eliminate optimistic bias and produce publication-grade metrics:

Three-Level Data Split

LevelPurposeSizeAccess
Development Set Model training + threshold tuning 1,086 cases Used during training via cross-validation
Temporal Holdout Final unbiased evaluation 248 cases Never touched until final evaluation

Smart Temporal Split

The holdout set is constructed by taking each patient’s most recent visit and reserving it for final evaluation. This simulates real deployment: the model trains on historical data and is evaluated on the most recent encounter for each cat — exactly the scenario it faces in clinical use.

What this means for reported metrics: v1.6’s holdout numbers represent what the model achieves on genuinely unseen, recent data with strict patient independence. These metrics are trustworthy estimates of real-world deployment performance.

4. Algorithm & Architecture

Multi-Model Ensemble

v1.6 uses an ensemble of independently-trained machine learning models. Each model is trained on a different cross-validation fold, meaning each sees the development data from a different angle. At prediction time, all models produce score estimates that are averaged for the final prediction.

Why an ensemble?

Required Inputs (5)

Same 5 features as v1.5a — 3 lab values from a standard chemistry + CBC panel, plus patient age. No Amylase, Cholesterol, T4, electrolyte panel, or SDMA required.

Serum Chemistry (2)

BUNmg/dL
Creatininemg/dL

CBC & Demographics (3)

Hemoglobin (HGB)g/dL
Abs. Lymphocytes/μL
Patient Ageyears

Hyperparameter Optimization

Extensive GPU-accelerated hyperparameter search across tree structure, regularization, and class weighting. The optimization objective minimizes clinically-weighted error cost with asymmetric penalties that prioritize sensitivity, enforcing a minimum 90% sensitivity target.

Output

Binary screening classification with uncertainty:

5. Performance Metrics

Temporal Holdout (n=248, never seen during training)

Each patient’s most recent visit, held out completely from training. This is the primary metric set — it represents expected real-world performance.

92% Sensitivity
(catches impaired cats)
98 of 107 impaired caught
60% Specificity
(clears healthy cats)
84 of 141 adequate cleared
0.86 AUC-ROC
(discrimination ability)
threshold-independent
90% NPV
(when cleared, truly healthy)
84 of 93 cleared are adequate
What 90% NPV means clinically: When this model clears a cat as “adequate,” there is a 90% probability that the cat truly has adequate concentrating ability. This is the number that matters most for clinical confidence — if the model says “this cat is probably fine,” it is correct 9 out of 10 times.

Out-of-Fold Cross-Validation (n=1,086)

Each prediction is from the one model (out of 7) that never saw this data point during training. Validates consistency across the full development set.

90.5% Sensitivity (OOF) 523 of 578 impaired
57.3% Specificity (OOF) 291 of 508 adequate
0.874 AUC-ROC (OOF) development set
84.1% NPV (OOF) 291 of 346 cleared

OOF and holdout metrics are consistent (<2% sensitivity difference), confirming the model generalizes well. AUC-ROC of 0.86–0.87 across both sets demonstrates strong discrimination ability independent of the chosen threshold.

Understanding AUC-ROC: The Threshold-Independent View

AUC-ROC = 0.86 means: if you randomly pick one impaired cat and one adequate cat, the model correctly ranks the impaired cat as higher risk 86.2% of the time. This measures the model’s fundamental ability to distinguish sick from healthy, regardless of where the decision threshold is set.

The threshold determines the sensitivity/specificity operating point along the ROC curve. We chose a point that maximizes sensitivity (≥90%) because that is the clinically appropriate operating point for a screening tool.

6. Threshold & Clinical Trade-offs

The decision threshold was optimized via clinically-weighted search. The optimization objective: minimize missed impaired cats while keeping the false flag rate clinically manageable, with a target sensitivity ≥90%.

Why We Prioritize Sensitivity Over Specificity

In screening contexts, the consequences of errors are asymmetric:

Error TypeWhat HappensConsequenceCost
False Negative
(missed sick cat)
Impaired cat sent home without urinalysis Delayed CKD diagnosis. Disease progresses unmonitored. Potential for irreversible nephron loss before next visit. HIGH
False Positive
(unnecessary flag)
Healthy cat recommended for urinalysis Routine urinalysis performed. Cat confirmed healthy. Client gets peace of mind. No medical downside. LOW

This asymmetry is not unique to our tool — it is the foundation of all clinical screening programs. Mammography, PSA testing, and fecal occult blood tests all operate at high-sensitivity/moderate-specificity because the cost of a missed diagnosis vastly exceeds the cost of additional follow-up testing.

The v1.6 clinical philosophy: This tool is a safety net, not a gatekeeper. Its job is to ensure that impaired cats don’t fall through the cracks when urinalysis isn’t performed. When it flags a cat, the appropriate response is a simple, inexpensive urine collection — not an invasive or costly procedure.

Result: On holdout data, only 9 of 107 impaired cats (8.4%) were missed. The 57 false flags represent healthy cats who would receive a confirmatory UA that costs a simple and takes ~5 minutes.

The Sensitivity-Specificity Trade-off in Context

Operating PointSensitivitySpecificityMissed Cats (per 107)Unnecessary UAs (per 141)
v1.6 (current) 92% 60% 9 57
Balanced threshold ~80% ~75% ~21 ~35
High-specificity ~70% ~85% ~32 ~21

Moving from v1.6’s operating point to a “balanced” threshold would reduce unnecessary UAs by ~22, but miss 12 additional sick cats. Those 12 cats may not be diagnosed until their next visit months later, after further nephron loss.

7. Conformal Prediction

New in v1.6: every prediction includes a conformal prediction set with a mathematically guaranteed coverage rate. This is a distribution-free method that tells you not just what the model predicts, but how confident it is.

How It Works

Instead of a single hard classification, the conformal layer outputs a set of possible labels:

Prediction SetMeaningClinical Action
{impaired} Model is confident this cat has impaired concentration Strong recommendation for urinalysis
{adequate} Model is confident this cat has adequate concentration Low priority for urinalysis
{impaired, adequate} Model is uncertain — cannot reliably distinguish Urinalysis recommended (borderline case)

Coverage Guarantee

The conformal layer guarantees that the true label is contained in the prediction set at least 90% of the time (α = 0.10). This is a distribution-free guarantee — it holds regardless of the underlying data distribution, requiring only the assumption of exchangeable data.

Clinical value of “uncertain”: When the model outputs {impaired, adequate}, it is explicitly admitting it cannot make a reliable determination. These are the borderline cats where urinalysis is most diagnostically valuable — precisely the cases where the clinician’s judgment should be guided by additional testing rather than a model prediction.

Conformal prediction calibrated on out-of-fold predictions from the development set. Target coverage: 90%.

8. Confusion Matrix

Temporal Holdout (n=248)

Predicted
Adequate Impaired
Actual Adequate (n=141) 84 57
Impaired (n=107) 9 98
98 of 107 impaired cats caught (92%)
Only 9 impaired cats were cleared — a miss rate of 8.4%.
57 of 141 healthy cats flagged (40.4%)
These cats would receive a confirmatory urinalysis (a simple, no medical risk) and be confirmed healthy.

Out-of-Fold (Development Set, n=1,086)

Predicted
Adequate Impaired
Actual Adequate (n=508) 291 217
Impaired (n=578) 55 523

OOF results confirm the holdout pattern: 90.5% sensitivity (523/578), 57.3% specificity (291/508). Consistent performance across both evaluation sets.

9. Feature Importance & Physiological Basis

With only 5 features, the model concentrates its predictive power on the strongest renal and hematologic markers. BUN, Age, and Creatinine account for over 70% of the model’s total importance:

BUN
30.0%
Patient Age
22.0%
Creatinine
20.0%
Abs. Lymphocytes
15.0%
Hemoglobin
13.0%

Physiological Interpretation

AnalyteImportancePhysiological Link to Urine Concentration
BUN 30.0% Primary marker of glomerular filtration rate. As GFR declines, BUN rises and concentrating ability diminishes. BUN also contributes to the medullary concentration gradient via urea recycling — elevated BUN paradoxically reflects the failing kidney’s inability to maintain this gradient.
Patient Age 22.0% CKD is progressive and age-dependent. In this dataset, 98% of cats over 18 years had impaired concentration vs 15% of cats aged 5–10. Age captures the cumulative renal decline that bloodwork alone may not fully reflect, including subclinical nephron loss.
Creatinine 20.0% Muscle-derived GFR marker. Co-regulated with BUN through renal excretion. Together with BUN, captures the primary renal axis.
Abs. Lymphocytes 15.0% Hematologic marker of immune status and systemic illness chronicity. CKD cats often develop lymphopenia as part of the chronic disease syndrome. Low lymphocyte counts correlate with disease severity and duration.
Hemoglobin 13.0% Reflects hydration status and erythropoietin production. Dehydrated cats have higher HGB and more concentrated urine. CKD cats develop non-regenerative anemia (low HGB) with concurrent loss of concentrating ability.

10. Prediction Explanations

Every prediction includes a per-feature explanation breakdown showing how each bloodwork value contributed to the result. In v1.6, explanation values are averaged across all ensemble models for more stable, reliable attributions.

How to read the explanation chart:

This transparency helps veterinarians understand which bloodwork values are driving the recommendation, rather than treating the model as a black box. For example, a cat might be flagged primarily because of elevated BUN and advanced age, even though its creatinine is still within normal range — the explanation chart makes this reasoning visible.

11. Version History

VersionFeaturesSensitivitySpecificityKey Change
v1.010 Initial model — bloodwork only
v1.111 Added patient age
v1.214 Full feature set; regression + classification
v1.3785.6%70.0% Reduced to 7 fields; clinically-weighted error costs
v1.4785.6%73.7% Hyperparameter tuning; classification only
v1.5a584.9%†75.5%† Dropped Amylase & Cholesterol; per-prediction explanations
v1.6592%60% Multi-model ensemble; patient-independent validation; temporal holdout; conformal prediction; ≥90% sensitivity target

†v1.5a used row-level cross-validation with patient overlap between splits. v1.6 is the first version with publication-grade patient-independent validation.

v1.5a → v1.6: Complete methodological overhaul. Enforced strict patient-independent validation with temporal holdout, built multi-model ensemble with conformal prediction. Shifted optimization target from balanced accuracy to ≥90% sensitivity because a screening tool’s primary obligation is to not miss sick patients. AUC-ROC of 0.86 confirms strong underlying discrimination — the sensitivity/specificity balance reflects a deliberate clinical choice about where to operate on the ROC curve.

12. Limitations

This tool is a screening estimate, not a diagnostic test. It should inform clinical decision-making, not replace urinalysis.
LimitationClinical ImpactMitigationStatus
Patient overlap in validation Resolved in v1.6. v1.5a used row-level splits that allowed patient overlap. v1.6 enforces strict patient-level separation + temporal holdout. Resolved Complete
Amylase/Cholesterol availability Resolved in v1.5a. Dropped to 5 universal features. Resolved Complete
~40% false flag rate on healthy cats 60% specificity means ~40% of healthy cats are flagged for urinalysis. This is a deliberate trade-off for ≥90% sensitivity. Each false flag results in a routine urinalysis with no medical downside — and provides the client peace of mind. Threshold is adjustable per-clinic. More conservative clinics can raise the threshold at the cost of sensitivity. Conformal prediction identifies uncertain cases to help triage. By design
Single-practice, single-species dataset Trained on 3,642 feline cases from one hospital. External validation is required before broader deployment. Pursuing collaboration with Texas A&M Veterinary Medical Teaching Hospital. Target: 2–3 external validation datasets. Oct 2026
No urinalysis replacement USG is one component of urinalysis. Sediment, protein, culture, and pH provide independent diagnostic information. By design. This is a screening triage tool, not a UA replacement. N/A
Pre-renal and post-renal effects Dehydration elevates BUN and concentrates urine simultaneously. The model may conflate pre-renal and intrinsic renal causes. Add hydration status and recent fluid therapy as optional inputs in a future version. Explore BUN/Creatinine ratio as engineered feature. Q1 2027
No prospective outcome data No data yet showing that flagging cats leads to earlier diagnosis or improved outcomes. Feline-only clinic pilot tracking every flag: UA performed, USG result, diagnosis at 6 and 12 months. Mar 2027

13. Study Population

ParameterValue
SourceFeline-only clinic, Houston, TX
Date RangeJanuary 2022 – February 2026
Species100% Feline
Total Lab Reports3,642
Reports with Urinalysis1,506 (41%)
Cases Used for v1.61,334 (complete bloodwork + USG)
Development Set1,086 cases (patient-grouped CV)
Temporal Holdout248 cases (most recent visit per patient)
USG Range in Dataset1.005 – 1.086
USG Mean / Median1.036 / 1.034
Development Class Balance578 Impaired / 508 Adequate (53% / 47%)
Holdout Class Balance107 Impaired / 141 Adequate (43% / 57%)
Validation MethodPatient-grouped stratified CV + temporal holdout
Patient Groupingpet_name + owner (prevents same cat in train + val)
Holdout StrategySmart temporal: last visit per patient → holdout

Age Distribution

Age GroupnMean USG% Impaired (<1.035)
Under 5 years51.04920%
5–10 years1001.04715%
10–14 years5561.04329%
14–18 years5531.02873%
Over 18 years811.01998%

Model v1.6  |  April 2026  |  Feline-only clinic  |  For investigational and research use  |  Not validated for clinical deployment