Step 4: Evaluate models

After creating your experiment version, multiple models are generated and trained. The top-performing model (topModelId) is returned by the API, but you can evaluate and compare all generated models to select one that aligns with your business objectives, for example, preventing customer churn as precisely as possible.

List models

Retrieve all models generated by an experiment version with the following API call:

curl -L "https://<TENANT>/api/v1/ml/experiments/{experimentId}/models" ^
-H "Content-Type: application/json" ^
-H "Accept: application/json" ^
-H "Authorization: Bearer <ACCESS_TOKEN>" ^

Response example

{
    "data": [
        {
            "type": "model",
            "id": "db823841-c551-4e18-84e3-d3a2a98edd60",
            "attributes": {
                "id": "db823841-c551-4e18-84e3-d3a2a98edd60",
                "experimentVersionId": "eb60cbd9-838f-4d4e-bb1c-18aaa6ad3ccf",
                "createdAt": "2024-11-27T13:19:36.403719Z",
                "updatedAt": "2024-11-27T13:19:46.442265Z",
                "batchNum": 1,
                "algorithm": "random_forest_classifier",
                "name": "v03_RAFC_01_01",
                "description": null,
                "seqNum": 1,
                "algoAbbrv": "RAFC",
                "status": "ready",
                "errorMessage": null,
                "metrics": {
                    "binary": {
                        "truePositive": 311,
                        "falsePositive": 103,
                        "falseNegative": 54,
                        "trueNegative": 2060,
                        "accuracy": 0.9375,
                        "mcc": 0.7621741380656772,
                        "auc": 0.9629434005281857,
                        "logLoss": 0.17645366167456517,
                        "missRate": 0.14794520547945206,
                        "fallout": 0.047619047619047616,
                        "npv": 0.9744560075685903,
                        "specificity": 0.9523809523809523,
                        "recall": 0.852054794520548,
                        "precision": 0.751207729468599,
                        "f1": 0.7984595635430038,
                        "threshold": 0.30128437693042176,
                        "truePositiveTest": 77,
                        "falsePositiveTest": 30,
                        "falseNegativeTest": 14,
                        "trueNegativeTest": 511,
                        "accuracyTest": 0.930379746835443,
                        "mccTest": 0.7402187229033519,
                        "aucTest": 0.9540533403749669,
                        "logLossTest": 0.1789144736636058,
                        "missRateTest": 0.15384615384615385,
                        "falloutTest": 0.05545286506469501,
                        "npvTest": 0.9733333333333334,
                        "specificityTest": 0.944547134935305,
                        "recallTest": 0.8461538461538461,
                        "precisionTest": 0.719626168224299,
                        "f1Test": 0.7777777777777778,
                        "thresholdTest": 0.30128437693042176
                    }
                },
                "hpoNum": null,
                "droppedFeatures": [
                    {
                        "name": "DaysSinceLastService",
                        "reason": "has_target_leakage"
                    }
                ],
                "samplingRatio": 1,
                "columns": [
                    "Territory",
                    "DeviceType",
                    "Promotion",
                    "HasRenewed",
                    "PlanType",
                    "BaseFee",
                    "AdditionalFeatureSpend",
                    "NumberOfPenalties",
                    "CurrentPeriodUsage",
                    "PriorPeriodUsage",
                    "ServiceRating",
                    "ServiceTickets",
                    "StartMonth",
                    "StartWeek",
                    "CustomerTenure",
                    "Churned"
                ],
                "modelState": "inactive"
            }
        },
        {
            "type": "model",
            "id": "6953dbe3-c997-4f82-af77-494bda9b1247",
            "attributes": {
                "id": "6953dbe3-c997-4f82-af77-494bda9b1247",
                "experimentVersionId": "eb60cbd9-838f-4d4e-bb1c-18aaa6ad3ccf",
                "createdAt": "2024-11-27T13:19:36.391646Z",
                "updatedAt": "2024-11-27T13:19:44.029329Z",
                "batchNum": 1,
                "algorithm": "random_forest_classifier",
                "name": "v03_RAFC_01_00",
                "description": null,
                "seqNum": 0,
                "algoAbbrv": "RAFC",
                "status": "ready",
                "errorMessage": null,
                "metrics": {
                    "binary": {
                        "truePositive": 275,
                        "falsePositive": 58,
                        "falseNegative": 90,
                        "trueNegative": 2105,
                        "accuracy": 0.9410601265822784,
                        "mcc": 0.753269275386698,
                        "auc": 0.9583315917136904,
                        "logLoss": 0.16340603954795424,
                        "missRate": 0.2465753424657534,
                        "fallout": 0.026814609338881183,
                        "npv": 0.958997722095672,
                        "specificity": 0.9731853906611189,
                        "recall": 0.7534246575342466,
                        "precision": 0.8258258258258259,
                        "f1": 0.7879656160458453,
                        "threshold": 0.4284064499957584,
                        "truePositiveTest": 73,
                        "falsePositiveTest": 16,
                        "falseNegativeTest": 18,
                        "trueNegativeTest": 525,
                        "accuracyTest": 0.9462025316455697,
                        "mccTest": 0.7798157631261908,
                        "aucTest": 0.9696735796550954,
                        "logLossTest": 0.14611503001799023,
                        "missRateTest": 0.1978021978021978,
                        "falloutTest": 0.029574861367837338,
                        "npvTest": 0.9668508287292817,
                        "specificityTest": 0.9704251386321626,
                        "recallTest": 0.8021978021978022,
                        "precisionTest": 0.8202247191011236,
                        "f1Test": 0.8111111111111112,
                        "thresholdTest": 0.4284064499957584
                    }
                },
                "hpoNum": null,
                "droppedFeatures": [
                    {
                        "name": "DaysSinceLastService",
                        "reason": "has_target_leakage"
                    },
                    {
                        "name": "PriorPeriodUsage",
                        "reason": "highly_correlated"
                    },
                    {
                        "name": "Territory",
                        "reason": "feature_with_low_importance"
                    },
                    {
                        "name": "StartMonth",
                        "reason": "feature_with_low_importance"
                    },
                    {
                        "name": "CurrentPeriodUsage",
                        "reason": "feature_with_low_importance"
                    },
                    {
                        "name": "DeviceType",
                        "reason": "feature_with_low_importance"
                    },
                    {
                        "name": "StartWeek",
                        "reason": "feature_with_low_importance"
                    },
                    {
                        "name": "CustomerTenure",
                        "reason": "feature_with_low_importance"
                    }
                ],
                "samplingRatio": 1,
                "columns": [
                    "PlanType",
                    "NumberOfPenalties",
                    "HasRenewed",
                    "BaseFee",
                    "ServiceTickets",
                    "AdditionalFeatureSpend",
                    "ServiceRating",
                    "Promotion"
                ],
                "modelState": "inactive"
            }
        },
        {
            "type": "model",
            "id": "9fda6cb5-321a-4e3e-b9b9-b640bc8708f1",
            "attributes": {
                "id": "9fda6cb5-321a-4e3e-b9b9-b640bc8708f1",
                "experimentVersionId": "eb60cbd9-838f-4d4e-bb1c-18aaa6ad3ccf",
                "createdAt": "2024-11-27T13:19:20.880715Z",
                "updatedAt": "2024-11-27T13:19:30.471811Z",
                "batchNum": 0,
                "algorithm": "random_forest_classifier",
                "name": "v03_RAFC_00_00",
                "description": null,
                "seqNum": 0,
                "algoAbbrv": "RAFC",
                "status": "ready",
                "errorMessage": null,
                "metrics": {
                    "binary": {
                        "truePositive": 290,
                        "falsePositive": 76,
                        "falseNegative": 75,
                        "trueNegative": 2087,
                        "accuracy": 0.939873417721519,
                        "mcc": 0.7566444372668605,
                        "auc": 0.9604823336436583,
                        "logLoss": 0.17675936885512278,
                        "missRate": 0.2054794520547945,
                        "fallout": 0.03513638465094776,
                        "npv": 0.9653098982423681,
                        "specificity": 0.9648636153490523,
                        "recall": 0.7945205479452054,
                        "precision": 0.7923497267759563,
                        "f1": 0.7934336525307798,
                        "threshold": 0.36075704789531526,
                        "truePositiveTest": 74,
                        "falsePositiveTest": 24,
                        "falseNegativeTest": 17,
                        "trueNegativeTest": 517,
                        "accuracyTest": 0.935126582278481,
                        "mccTest": 0.7456978462730665,
                        "aucTest": 0.9520220998964067,
                        "logLossTest": 0.18079716644028357,
                        "missRateTest": 0.18681318681318682,
                        "falloutTest": 0.04436229205175601,
                        "npvTest": 0.9681647940074907,
                        "specificityTest": 0.955637707948244,
                        "recallTest": 0.8131868131868132,
                        "precisionTest": 0.7551020408163265,
                        "f1Test": 0.783068783068783,
                        "thresholdTest": 0.36075704789531526
                    }
                },
                "hpoNum": null,
                "droppedFeatures": [
                    {
                        "name": "DaysSinceLastService",
                        "reason": "has_target_leakage"
                    },
                    {
                        "name": "PriorPeriodUsage",
                        "reason": "highly_correlated"
                    }
                ],
                "samplingRatio": 1,
                "columns": [
                    "Territory",
                    "DeviceType",
                    "Promotion",
                    "HasRenewed",
                    "PlanType",
                    "BaseFee",
                    "AdditionalFeatureSpend",
                    "NumberOfPenalties",
                    "CurrentPeriodUsage",
                    "ServiceRating",
                    "ServiceTickets",
                    "StartMonth",
                    "StartWeek",
                    "CustomerTenure",
                    "Churned"
                ],
                "modelState": "inactive"
            }
        }
    ]
}

In this example, three models have been generated using the random_forest_classifier algorithm. These models were trained to predict customer churn (binary classification: churned vs. not churned).

Compare models

When evaluating models, you should ask yourself the following questions:

Which model performs best for the business objective?
How should metrics like accuracy, precision, recall, and F1 score influence the decision?
What trade-offs exist in terms of feature usage and performance?

For more information, see Interpreting model scores on Qlik Help.

Example values

The following table includes performance metrics for all models generated in the previous example:

Metric	Model `v03_RAFC_01_01`	Model `v03_RAFC_01_00`	Model `v03_RAFC_00_00`
Accuracy (Test)	93.04%	94.62%	93.51%
Precision (Test)	71.96%	82.02%	75.51%
Recall (Test)	84.62%	80.22%	81.32%
F1 score (Test)	77.78%	81.11%	78.31%
AUC (Test)	95.41%	96.97%	95.20%
Log loss (Test)	0.1789	0.1461	0.1807
Dropped features	`DaysSinceLastService`	`DaysSinceLastService`, `PriorPeriodUsage`, `Territory`, `StartMonth`, `CurrentPeriodUsage`, `DeviceType`, `StartWeek`, `CustomerTenure`	`DaysSinceLastService`, `PriorPeriodUsage`

Key metrics

Before selecting a model, understand what each metric signifies:

Metric	Description
Accuracy	Overall correctness of the model.
Precision	Reduces false positives (in this example, non-churners incorrectly flagged as churners).
Recall	Reduces false negatives (in this example, missed churners).
F1 score	Balances precision and recall.
AUC	Measures the ability to distinguish between classes (higher is better).

Key insights for model selection

Based on the example metrics:

v03_RAFC_01_01 has the best recall (84.62%) and AUC (95.41%), making it suitable for use cases where identifying all churners is critical, even at the risk of more false positives.
v03_RAFC_01_00 has the highest precision (82.02%) and F1 score (81.11%), making it ideal for minimizing false positives.
v03_RAFC_00_00 has balanced metrics but slightly behind the other models.

For more information about evaluating binary classification models, see Scoring binary classification models on Qlik Help.

Model selection

Select a model based on your business objective. In this example:

If reducing churn is critical, choose the model with higher recall and AUC.
If precision is more important, prioritize models with fewer false positives.

Based on the evaluation, v03_RAFC_01_00 is selected for deployment due to its highest precision, aligning with the business objective of minimizing false positives.

Next step

With the best-performing model identified, deploy it and make it available for predictions.