Feline USG Predictor v1.6

Estimation of Urine Specific Gravity from Serum Chemistry, CBC, and Patient Age | Feline-only clinic | April 2026

What’s new in v1.6: Complete methodological redesign optimized for clinical safety. Multi-model ensemble with patient-grouped cross-validation ensures no patient appears in both training and evaluation. Temporal holdout validation on 248 never-before-seen cases provides honest, unbiased performance estimates. Conformal prediction adds statistically guaranteed uncertainty quantification. Sensitivity target raised to ≥90% — in a screening tool, catching sick cats is the primary objective. Holdout: 92% sensitivity, 60% specificity, 0.86 AUC-ROC, 90% NPV.

Contents

Clinical Rationale
What Changed in v1.6
Validation Methodology
Algorithm & Architecture
Performance Metrics
Threshold & Clinical Trade-offs
Conformal Prediction
Confusion Matrix
Feature Importance
Prediction Explanations
Version History
Limitations
Study Population

1. Clinical Rationale

Urine Specific Gravity is a cornerstone of feline renal assessment. IRIS staging of Chronic Kidney Disease incorporates USG alongside creatinine and SDMA to differentiate stages and guide management. Loss of concentrating ability (USG <1.035 in cats) is frequently among the earliest detectable signs of tubular dysfunction, often preceding azotemia.

However, urine collection is not always achievable at the time of presentation. An empty bladder, patient temperament, contraindications to cystocentesis (coagulopathy, abdominal masses), or client constraints may preclude urinalysis.

The Gap: In our dataset of 3,642 feline visits over 4 years, only 1,506 (41%) included a urinalysis. In the remaining 2,136 visits, renal concentrating ability was unknown despite bloodwork being available.

This model estimates USG from serum chemistry, CBC, and patient age — values already being collected on the same blood draw — providing a screening estimate of concentrating ability at zero additional cost, zero additional procedure time, and zero additional client charge.

2. What Changed in v1.6

Two fundamental improvements in v1.6:

Methodological rigor: v1.5a used standard row-level cross-validation, which allowed the same patient’s multiple visits to appear in both training and evaluation splits. v1.6 enforces strict patient-level separation and temporal holdout validation, producing metrics that honestly reflect real-world deployment performance.
Clinical optimization: v1.6 is explicitly optimized for ≥90% sensitivity. For a screening tool, catching sick cats is the highest priority. The cost function penalizes missed impaired cats more heavily than false flags, reflecting the clinical reality that a missed kidney diagnosis is far more consequential than an unnecessary urinalysis recommendation.

Summary of Changes

Aspect	v1.5a	v1.6
Architecture	Single model	Multi-model ensemble
Validation	Row-level CV (patient overlap)	Patient-grouped CV + temporal holdout
Uncertainty	None	Conformal prediction (90% coverage)
Optimization Target	Balanced accuracy	≥90% sensitivity
Hyperparameter Search	Limited	Extensive GPU-accelerated search
Holdout Sensitivity	81.0%*	92%
Holdout Specificity	73.9%*	60%
Holdout AUC-ROC	—	0.86
Holdout NPV	—	90%
Input Features	5	5 (unchanged)

*v1.5a used row-level cross-validation where the same patient’s multiple visits could appear in both training and evaluation. Its metrics are not directly comparable to v1.6’s patient-independent evaluation.

Why Sensitivity Went Up and Specificity Went Down

This is a deliberate design decision, not a trade-off we stumbled into. v1.6 was explicitly optimized to achieve ≥90% sensitivity because this is a screening tool. The fundamental question is:

“Which error is more acceptable: sending a healthy cat for a minimal-cost urinalysis, or sending a sick cat home without detection?”

With 92% sensitivity on the holdout set, only 9 out of 107 impaired cats were missed. The trade-off is that 57 out of 141 healthy cats are flagged for urinalysis they may not need. For those 57 cats, the consequence is a minimal-cost add-on test that confirms they are healthy. For the 98 impaired cats correctly flagged, the consequence is early detection and intervention.

3. Validation Methodology

v1.6 implements a rigorous validation strategy designed to eliminate optimistic bias and produce publication-grade metrics:

Three-Level Data Split

Level	Purpose	Size	Access
Development Set	Model training + threshold tuning	1,086 cases	Used during training via cross-validation
Temporal Holdout	Final unbiased evaluation	248 cases	Never touched until final evaluation

Smart Temporal Split

The holdout set is constructed by taking each patient’s most recent visit and reserving it for final evaluation. This simulates real deployment: the model trains on historical data and is evaluated on the most recent encounter for each cat — exactly the scenario it faces in clinical use.

What this means for reported metrics: v1.6’s holdout numbers represent what the model achieves on genuinely unseen, recent data with strict patient independence. These metrics are trustworthy estimates of real-world deployment performance.

4. Algorithm & Architecture

Multi-Model Ensemble

v1.6 uses an ensemble of independently-trained machine learning models. Each model is trained on a different cross-validation fold, meaning each sees the development data from a different angle. At prediction time, all models produce score estimates that are averaged for the final prediction.

Why an ensemble?

Lower variance: Averaging multiple models reduces variance and produces more stable score estimates (note: scores are uncalibrated — they are ranking signals, not probabilities)
Natural uncertainty signal: When models disagree (high standard deviation), the prediction is less certain
Patient independence: Each data point has exactly one out-of-fold prediction from a model that never saw it
Robustness: No single model’s quirks dominate the output

Required Inputs (5)

Same 5 features as v1.5a — 3 lab values from a standard chemistry + CBC panel, plus patient age. No Amylase, Cholesterol, T4, electrolyte panel, or SDMA required.

Serum Chemistry (2)

BUN	mg/dL
Creatinine	mg/dL

CBC & Demographics (3)

Hemoglobin (HGB)	g/dL
Abs. Lymphocytes	/μL
Patient Age	years

Hyperparameter Optimization

Extensive GPU-accelerated hyperparameter search across tree structure, regularization, and class weighting. The optimization objective minimizes clinically-weighted error cost with asymmetric penalties that prioritize sensitivity, enforcing a minimum 90% sensitivity target.

Output

Binary screening classification with uncertainty:

Classification: Adequate (≥1.035) or Impaired (<1.035)
Adequacy score on a 0–1 scale (ensemble mean); higher = more likely adequate. Uncalibrated — treat as a ranking signal, not a probability.
Ensemble uncertainty (standard deviation across models)
Conformal prediction set: {impaired}, {adequate}, or {impaired, adequate} = uncertain
Per-feature SHAP explanations showing which values drive the prediction

5. Performance Metrics

Temporal Holdout (n=248, never seen during training)

Each patient’s most recent visit, held out completely from training. This is the primary metric set — it represents expected real-world performance.

92% Sensitivity
(catches impaired cats) 98 of 107 impaired caught

60% Specificity
(clears healthy cats) 84 of 141 adequate cleared

0.86 AUC-ROC
(discrimination ability) threshold-independent

90% NPV
(when cleared, truly healthy) 84 of 93 cleared are adequate

What 90% NPV means clinically: When this model clears a cat as “adequate,” there is a 90% probability that the cat truly has adequate concentrating ability. This is the number that matters most for clinical confidence — if the model says “this cat is probably fine,” it is correct 9 out of 10 times.

Out-of-Fold Cross-Validation (n=1,086)

Each prediction is from the one model (out of 7) that never saw this data point during training. Validates consistency across the full development set.

90.5% Sensitivity (OOF) 523 of 578 impaired

57.3% Specificity (OOF) 291 of 508 adequate

0.874 AUC-ROC (OOF) development set

84.1% NPV (OOF) 291 of 346 cleared

OOF and holdout metrics are consistent (<2% sensitivity difference), confirming the model generalizes well. AUC-ROC of 0.86–0.87 across both sets demonstrates strong discrimination ability independent of the chosen threshold.

Understanding AUC-ROC: The Threshold-Independent View

AUC-ROC = 0.86 means: if you randomly pick one impaired cat and one adequate cat, the model correctly ranks the impaired cat as higher risk 86.2% of the time. This measures the model’s fundamental ability to distinguish sick from healthy, regardless of where the decision threshold is set.

The threshold determines the sensitivity/specificity operating point along the ROC curve. We chose a point that maximizes sensitivity (≥90%) because that is the clinically appropriate operating point for a screening tool.

6. Threshold & Clinical Trade-offs

The decision threshold was optimized via clinically-weighted search. The optimization objective: minimize missed impaired cats while keeping the false flag rate clinically manageable, with a target sensitivity ≥90%.

Why We Prioritize Sensitivity Over Specificity

In screening contexts, the consequences of errors are asymmetric:

Error Type	What Happens	Consequence	Cost
False Negative (missed sick cat)	Impaired cat sent home without urinalysis	Delayed CKD diagnosis. Disease progresses unmonitored. Potential for irreversible nephron loss before next visit.	HIGH
False Positive (unnecessary flag)	Healthy cat recommended for urinalysis	Routine urinalysis performed. Cat confirmed healthy. Client gets peace of mind. No medical downside.	LOW

This asymmetry is not unique to our tool — it is the foundation of all clinical screening programs. Mammography, PSA testing, and fecal occult blood tests all operate at high-sensitivity/moderate-specificity because the cost of a missed diagnosis vastly exceeds the cost of additional follow-up testing.

The v1.6 clinical philosophy: This tool is a safety net, not a gatekeeper. Its job is to ensure that impaired cats don’t fall through the cracks when urinalysis isn’t performed. When it flags a cat, the appropriate response is a simple, inexpensive urine collection — not an invasive or costly procedure.

Result: On holdout data, only 9 of 107 impaired cats (8.4%) were missed. The 57 false flags represent healthy cats who would receive a confirmatory UA that costs a simple and takes ~5 minutes.

The Sensitivity-Specificity Trade-off in Context

Operating Point	Sensitivity	Specificity	Missed Cats (per 107)	Unnecessary UAs (per 141)
v1.6 (current)	92%	60%	9	57
Balanced threshold	~80%	~75%	~21	~35
High-specificity	~70%	~85%	~32	~21

Moving from v1.6’s operating point to a “balanced” threshold would reduce unnecessary UAs by ~22, but miss 12 additional sick cats. Those 12 cats may not be diagnosed until their next visit months later, after further nephron loss.

7. Conformal Prediction

New in v1.6: every prediction includes a conformal prediction set with a mathematically guaranteed coverage rate. This is a distribution-free method that tells you not just what the model predicts, but how confident it is.

How It Works

Instead of a single hard classification, the conformal layer outputs a set of possible labels:

Prediction Set	Meaning	Clinical Action
{impaired}	Model is confident this cat has impaired concentration	Strong recommendation for urinalysis
{adequate}	Model is confident this cat has adequate concentration	Low priority for urinalysis
{impaired, adequate}	Model is uncertain — cannot reliably distinguish	Urinalysis recommended (borderline case)

Coverage Guarantee

The conformal layer guarantees that the true label is contained in the prediction set at least 90% of the time (α = 0.10). This is a distribution-free guarantee — it holds regardless of the underlying data distribution, requiring only the assumption of exchangeable data.

Clinical value of “uncertain”: When the model outputs {impaired, adequate}, it is explicitly admitting it cannot make a reliable determination. These are the borderline cats where urinalysis is most diagnostically valuable — precisely the cases where the clinician’s judgment should be guided by additional testing rather than a model prediction.

Conformal prediction calibrated on out-of-fold predictions from the development set. Target coverage: 90%.

8. Confusion Matrix

Temporal Holdout (n=248)

		Predicted
		Adequate	Impaired
Actual	Adequate (n=141)	84	57
Actual	Impaired (n=107)	9	98

98 of 107 impaired cats caught (92%)
Only 9 impaired cats were cleared — a miss rate of 8.4%.

57 of 141 healthy cats flagged (40.4%)
These cats would receive a confirmatory urinalysis (a simple, no medical risk) and be confirmed healthy.

Out-of-Fold (Development Set, n=1,086)

		Predicted
		Adequate	Impaired
Actual	Adequate (n=508)	291	217
Actual	Impaired (n=578)	55	523

OOF results confirm the holdout pattern: 90.5% sensitivity (523/578), 57.3% specificity (291/508). Consistent performance across both evaluation sets.

9. Feature Importance & Physiological Basis

With only 5 features, the model concentrates its predictive power on the strongest renal and hematologic markers. BUN, Age, and Creatinine account for over 70% of the model’s total importance:

BUN

30.0%

Patient Age

22.0%

Creatinine

20.0%

Abs. Lymphocytes

15.0%

Hemoglobin

13.0%

Physiological Interpretation

Analyte	Importance	Physiological Link to Urine Concentration
BUN	30.0%	Primary marker of glomerular filtration rate. As GFR declines, BUN rises and concentrating ability diminishes. BUN also contributes to the medullary concentration gradient via urea recycling — elevated BUN paradoxically reflects the failing kidney’s inability to maintain this gradient.
Patient Age	22.0%	CKD is progressive and age-dependent. In this dataset, 98% of cats over 18 years had impaired concentration vs 15% of cats aged 5–10. Age captures the cumulative renal decline that bloodwork alone may not fully reflect, including subclinical nephron loss.
Creatinine	20.0%	Muscle-derived GFR marker. Co-regulated with BUN through renal excretion. Together with BUN, captures the primary renal axis.
Abs. Lymphocytes	15.0%	Hematologic marker of immune status and systemic illness chronicity. CKD cats often develop lymphopenia as part of the chronic disease syndrome. Low lymphocyte counts correlate with disease severity and duration.
Hemoglobin	13.0%	Reflects hydration status and erythropoietin production. Dehydrated cats have higher HGB and more concentrated urine. CKD cats develop non-regenerative anemia (low HGB) with concurrent loss of concentrating ability.

10. Prediction Explanations

Every prediction includes a per-feature explanation breakdown showing how each bloodwork value contributed to the result. In v1.6, explanation values are averaged across all ensemble models for more stable, reliable attributions.

How to read the explanation chart:

Each bar represents one input value from this patient’s bloodwork
Red bars push the prediction toward Impaired (flag for urinalysis)
Green bars push the prediction toward Adequate (clear)
Longer bars = stronger influence on this prediction
The actual patient value is shown next to each feature name

This transparency helps veterinarians understand which bloodwork values are driving the recommendation, rather than treating the model as a black box. For example, a cat might be flagged primarily because of elevated BUN and advanced age, even though its creatinine is still within normal range — the explanation chart makes this reasoning visible.

11. Version History

Version	Features	Sensitivity	Specificity	Key Change
v1.0	10	—	—	Initial model — bloodwork only
v1.1	11	—	—	Added patient age
v1.2	14	—	—	Full feature set; regression + classification
v1.3	7	85.6%	70.0%	Reduced to 7 fields; clinically-weighted error costs
v1.4	7	85.6%	73.7%	Hyperparameter tuning; classification only
v1.5a	5	84.9%†	75.5%†	Dropped Amylase & Cholesterol; per-prediction explanations
v1.6	5	92%	60%	Multi-model ensemble; patient-independent validation; temporal holdout; conformal prediction; ≥90% sensitivity target

†v1.5a used row-level cross-validation with patient overlap between splits. v1.6 is the first version with publication-grade patient-independent validation.

v1.5a → v1.6: Complete methodological overhaul. Enforced strict patient-independent validation with temporal holdout, built multi-model ensemble with conformal prediction. Shifted optimization target from balanced accuracy to ≥90% sensitivity because a screening tool’s primary obligation is to not miss sick patients. AUC-ROC of 0.86 confirms strong underlying discrimination — the sensitivity/specificity balance reflects a deliberate clinical choice about where to operate on the ROC curve.

12. Limitations

This tool is a screening estimate, not a diagnostic test. It should inform clinical decision-making, not replace urinalysis.

Limitation	Clinical Impact	Mitigation	Status
~~Patient overlap in validation~~	Resolved in v1.6. v1.5a used row-level splits that allowed patient overlap. v1.6 enforces strict patient-level separation + temporal holdout.	Resolved	Complete
~~Amylase/Cholesterol availability~~	Resolved in v1.5a. Dropped to 5 universal features.	Resolved	Complete
~40% false flag rate on healthy cats	60% specificity means ~40% of healthy cats are flagged for urinalysis. This is a deliberate trade-off for ≥90% sensitivity.	Each false flag results in a routine urinalysis with no medical downside — and provides the client peace of mind. Threshold is adjustable per-clinic. More conservative clinics can raise the threshold at the cost of sensitivity. Conformal prediction identifies uncertain cases to help triage.	By design
Single-practice, single-species dataset	Trained on 3,642 feline cases from one hospital. External validation is required before broader deployment.	Pursuing collaboration with Texas A&M Veterinary Medical Teaching Hospital. Target: 2–3 external validation datasets.	Oct 2026
No urinalysis replacement	USG is one component of urinalysis. Sediment, protein, culture, and pH provide independent diagnostic information.	By design. This is a screening triage tool, not a UA replacement.	N/A
Pre-renal and post-renal effects	Dehydration elevates BUN and concentrates urine simultaneously. The model may conflate pre-renal and intrinsic renal causes.	Add hydration status and recent fluid therapy as optional inputs in a future version. Explore BUN/Creatinine ratio as engineered feature.	Q1 2027
No prospective outcome data	No data yet showing that flagging cats leads to earlier diagnosis or improved outcomes.	Feline-only clinic pilot tracking every flag: UA performed, USG result, diagnosis at 6 and 12 months.	Mar 2027

13. Study Population

Parameter	Value
Source	Feline-only clinic, Houston, TX
Date Range	January 2022 – February 2026
Species	100% Feline
Total Lab Reports	3,642
Reports with Urinalysis	1,506 (41%)
Cases Used for v1.6	1,334 (complete bloodwork + USG)
Development Set	1,086 cases (patient-grouped CV)
Temporal Holdout	248 cases (most recent visit per patient)
USG Range in Dataset	1.005 – 1.086
USG Mean / Median	1.036 / 1.034
Development Class Balance	578 Impaired / 508 Adequate (53% / 47%)
Holdout Class Balance	107 Impaired / 141 Adequate (43% / 57%)
Validation Method	Patient-grouped stratified CV + temporal holdout
Patient Grouping	pet_name + owner (prevents same cat in train + val)
Holdout Strategy	Smart temporal: last visit per patient → holdout

Age Distribution

Age Group	n	Mean USG	% Impaired (<1.035)
Under 5 years	5	1.049	20%
5–10 years	100	1.047	15%
10–14 years	556	1.043	29%
14–18 years	553	1.028	73%
Over 18 years	81	1.019	98%

Model v1.6 | April 2026 | Feline-only clinic | For investigational and research use | Not validated for clinical deployment