Step 4: Evaluate models
After creating your experiment version, multiple models are generated and trained.
The top-performing model (topModelId
) is returned by the API,
but you can evaluate and compare all generated models to select one that aligns with your business objectives, for
example, preventing customer churn as precisely as possible.
List models
Retrieve all models generated by an experiment version with the following API call:
curl -L "https://<TENANT>/api/v1/ml/experiments/{experimentId}/models" ^
-H "Content-Type: application/json" ^
-H "Accept: application/json" ^
-H "Authorization: Bearer <ACCESS_TOKEN>" ^
Response example
{
"data": [
{
"type": "model",
"id": "db823841-c551-4e18-84e3-d3a2a98edd60",
"attributes": {
"id": "db823841-c551-4e18-84e3-d3a2a98edd60",
"experimentVersionId": "eb60cbd9-838f-4d4e-bb1c-18aaa6ad3ccf",
"createdAt": "2024-11-27T13:19:36.403719Z",
"updatedAt": "2024-11-27T13:19:46.442265Z",
"batchNum": 1,
"algorithm": "random_forest_classifier",
"name": "v03_RAFC_01_01",
"description": null,
"seqNum": 1,
"algoAbbrv": "RAFC",
"status": "ready",
"errorMessage": null,
"metrics": {
"binary": {
"truePositive": 311,
"falsePositive": 103,
"falseNegative": 54,
"trueNegative": 2060,
"accuracy": 0.9375,
"mcc": 0.7621741380656772,
"auc": 0.9629434005281857,
"logLoss": 0.17645366167456517,
"missRate": 0.14794520547945206,
"fallout": 0.047619047619047616,
"npv": 0.9744560075685903,
"specificity": 0.9523809523809523,
"recall": 0.852054794520548,
"precision": 0.751207729468599,
"f1": 0.7984595635430038,
"threshold": 0.30128437693042176,
"truePositiveTest": 77,
"falsePositiveTest": 30,
"falseNegativeTest": 14,
"trueNegativeTest": 511,
"accuracyTest": 0.930379746835443,
"mccTest": 0.7402187229033519,
"aucTest": 0.9540533403749669,
"logLossTest": 0.1789144736636058,
"missRateTest": 0.15384615384615385,
"falloutTest": 0.05545286506469501,
"npvTest": 0.9733333333333334,
"specificityTest": 0.944547134935305,
"recallTest": 0.8461538461538461,
"precisionTest": 0.719626168224299,
"f1Test": 0.7777777777777778,
"thresholdTest": 0.30128437693042176
}
},
"hpoNum": null,
"droppedFeatures": [
{
"name": "DaysSinceLastService",
"reason": "has_target_leakage"
}
],
"samplingRatio": 1,
"columns": [
"Territory",
"DeviceType",
"Promotion",
"HasRenewed",
"PlanType",
"BaseFee",
"AdditionalFeatureSpend",
"NumberOfPenalties",
"CurrentPeriodUsage",
"PriorPeriodUsage",
"ServiceRating",
"ServiceTickets",
"StartMonth",
"StartWeek",
"CustomerTenure",
"Churned"
],
"modelState": "inactive"
}
},
{
"type": "model",
"id": "6953dbe3-c997-4f82-af77-494bda9b1247",
"attributes": {
"id": "6953dbe3-c997-4f82-af77-494bda9b1247",
"experimentVersionId": "eb60cbd9-838f-4d4e-bb1c-18aaa6ad3ccf",
"createdAt": "2024-11-27T13:19:36.391646Z",
"updatedAt": "2024-11-27T13:19:44.029329Z",
"batchNum": 1,
"algorithm": "random_forest_classifier",
"name": "v03_RAFC_01_00",
"description": null,
"seqNum": 0,
"algoAbbrv": "RAFC",
"status": "ready",
"errorMessage": null,
"metrics": {
"binary": {
"truePositive": 275,
"falsePositive": 58,
"falseNegative": 90,
"trueNegative": 2105,
"accuracy": 0.9410601265822784,
"mcc": 0.753269275386698,
"auc": 0.9583315917136904,
"logLoss": 0.16340603954795424,
"missRate": 0.2465753424657534,
"fallout": 0.026814609338881183,
"npv": 0.958997722095672,
"specificity": 0.9731853906611189,
"recall": 0.7534246575342466,
"precision": 0.8258258258258259,
"f1": 0.7879656160458453,
"threshold": 0.4284064499957584,
"truePositiveTest": 73,
"falsePositiveTest": 16,
"falseNegativeTest": 18,
"trueNegativeTest": 525,
"accuracyTest": 0.9462025316455697,
"mccTest": 0.7798157631261908,
"aucTest": 0.9696735796550954,
"logLossTest": 0.14611503001799023,
"missRateTest": 0.1978021978021978,
"falloutTest": 0.029574861367837338,
"npvTest": 0.9668508287292817,
"specificityTest": 0.9704251386321626,
"recallTest": 0.8021978021978022,
"precisionTest": 0.8202247191011236,
"f1Test": 0.8111111111111112,
"thresholdTest": 0.4284064499957584
}
},
"hpoNum": null,
"droppedFeatures": [
{
"name": "DaysSinceLastService",
"reason": "has_target_leakage"
},
{
"name": "PriorPeriodUsage",
"reason": "highly_correlated"
},
{
"name": "Territory",
"reason": "feature_with_low_importance"
},
{
"name": "StartMonth",
"reason": "feature_with_low_importance"
},
{
"name": "CurrentPeriodUsage",
"reason": "feature_with_low_importance"
},
{
"name": "DeviceType",
"reason": "feature_with_low_importance"
},
{
"name": "StartWeek",
"reason": "feature_with_low_importance"
},
{
"name": "CustomerTenure",
"reason": "feature_with_low_importance"
}
],
"samplingRatio": 1,
"columns": [
"PlanType",
"NumberOfPenalties",
"HasRenewed",
"BaseFee",
"ServiceTickets",
"AdditionalFeatureSpend",
"ServiceRating",
"Promotion"
],
"modelState": "inactive"
}
},
{
"type": "model",
"id": "9fda6cb5-321a-4e3e-b9b9-b640bc8708f1",
"attributes": {
"id": "9fda6cb5-321a-4e3e-b9b9-b640bc8708f1",
"experimentVersionId": "eb60cbd9-838f-4d4e-bb1c-18aaa6ad3ccf",
"createdAt": "2024-11-27T13:19:20.880715Z",
"updatedAt": "2024-11-27T13:19:30.471811Z",
"batchNum": 0,
"algorithm": "random_forest_classifier",
"name": "v03_RAFC_00_00",
"description": null,
"seqNum": 0,
"algoAbbrv": "RAFC",
"status": "ready",
"errorMessage": null,
"metrics": {
"binary": {
"truePositive": 290,
"falsePositive": 76,
"falseNegative": 75,
"trueNegative": 2087,
"accuracy": 0.939873417721519,
"mcc": 0.7566444372668605,
"auc": 0.9604823336436583,
"logLoss": 0.17675936885512278,
"missRate": 0.2054794520547945,
"fallout": 0.03513638465094776,
"npv": 0.9653098982423681,
"specificity": 0.9648636153490523,
"recall": 0.7945205479452054,
"precision": 0.7923497267759563,
"f1": 0.7934336525307798,
"threshold": 0.36075704789531526,
"truePositiveTest": 74,
"falsePositiveTest": 24,
"falseNegativeTest": 17,
"trueNegativeTest": 517,
"accuracyTest": 0.935126582278481,
"mccTest": 0.7456978462730665,
"aucTest": 0.9520220998964067,
"logLossTest": 0.18079716644028357,
"missRateTest": 0.18681318681318682,
"falloutTest": 0.04436229205175601,
"npvTest": 0.9681647940074907,
"specificityTest": 0.955637707948244,
"recallTest": 0.8131868131868132,
"precisionTest": 0.7551020408163265,
"f1Test": 0.783068783068783,
"thresholdTest": 0.36075704789531526
}
},
"hpoNum": null,
"droppedFeatures": [
{
"name": "DaysSinceLastService",
"reason": "has_target_leakage"
},
{
"name": "PriorPeriodUsage",
"reason": "highly_correlated"
}
],
"samplingRatio": 1,
"columns": [
"Territory",
"DeviceType",
"Promotion",
"HasRenewed",
"PlanType",
"BaseFee",
"AdditionalFeatureSpend",
"NumberOfPenalties",
"CurrentPeriodUsage",
"ServiceRating",
"ServiceTickets",
"StartMonth",
"StartWeek",
"CustomerTenure",
"Churned"
],
"modelState": "inactive"
}
}
]
}
In this example, three models have been generated using the random_forest_classifier
algorithm.
These models were trained to predict customer churn (binary classification: churned vs. not churned).
Compare models
When evaluating models, you should ask yourself the following questions:
- Which model performs best for the business objective?
- How should metrics like accuracy, precision, recall, and F1 score influence the decision?
- What trade-offs exist in terms of feature usage and performance?
For more information, see Interpreting model scores on Qlik Help.
Example values
The following table includes performance metrics for all models generated in the previous example:
Metric | Model v03_RAFC_01_01 | Model v03_RAFC_01_00 | Model v03_RAFC_00_00 |
---|---|---|---|
Accuracy (Test) | 93.04% | 94.62% | 93.51% |
Precision (Test) | 71.96% | 82.02% | 75.51% |
Recall (Test) | 84.62% | 80.22% | 81.32% |
F1 score (Test) | 77.78% | 81.11% | 78.31% |
AUC (Test) | 95.41% | 96.97% | 95.20% |
Log loss (Test) | 0.1789 | 0.1461 | 0.1807 |
Dropped features | DaysSinceLastService | DaysSinceLastService , PriorPeriodUsage , Territory , StartMonth , CurrentPeriodUsage , DeviceType , StartWeek , CustomerTenure | DaysSinceLastService , PriorPeriodUsage |
Key metrics
Before selecting a model, understand what each metric signifies:
Metric | Description |
---|---|
Accuracy | Overall correctness of the model. |
Precision | Reduces false positives (in this example, non-churners incorrectly flagged as churners). |
Recall | Reduces false negatives (in this example, missed churners). |
F1 score | Balances precision and recall. |
AUC | Measures the ability to distinguish between classes (higher is better). |
Key insights for model selection
Based on the example metrics:
v03_RAFC_01_01
has the best recall (84.62%) and AUC (95.41%), making it suitable for use cases where identifying all churners is critical, even at the risk of more false positives.v03_RAFC_01_00
has the highest precision (82.02%) and F1 score (81.11%), making it ideal for minimizing false positives.v03_RAFC_00_00
has balanced metrics but slightly behind the other models.
For more information about evaluating binary classification models, see Scoring binary classification models on Qlik Help.
Model selection
Select a model based on your business objective. In this example:
- If reducing churn is critical, choose the model with higher recall and AUC.
- If precision is more important, prioritize models with fewer false positives.
Based on the evaluation, v03_RAFC_01_00 is selected for deployment due to its highest precision, aligning with the business objective of minimizing false positives.
Next step
With the best-performing model identified, deploy it and make it available for predictions.