Skip to content

Step 4: Evaluate models

After creating your experiment version, multiple models are generated and trained. The top-performing model (topModelId) is returned by the API, but you can evaluate and compare all generated models to select one that aligns with your business objectives, for example, preventing customer churn as precisely as possible.

List models

Retrieve all models generated by an experiment version with the following API call:

curl -L "https://<TENANT>/api/v1/ml/experiments/{experimentId}/models" ^
-H "Content-Type: application/json" ^
-H "Accept: application/json" ^
-H "Authorization: Bearer <ACCESS_TOKEN>" ^

Response example

{
    "data": [
        {
            "type": "model",
            "id": "db823841-c551-4e18-84e3-d3a2a98edd60",
            "attributes": {
                "id": "db823841-c551-4e18-84e3-d3a2a98edd60",
                "experimentVersionId": "eb60cbd9-838f-4d4e-bb1c-18aaa6ad3ccf",
                "createdAt": "2024-11-27T13:19:36.403719Z",
                "updatedAt": "2024-11-27T13:19:46.442265Z",
                "batchNum": 1,
                "algorithm": "random_forest_classifier",
                "name": "v03_RAFC_01_01",
                "description": null,
                "seqNum": 1,
                "algoAbbrv": "RAFC",
                "status": "ready",
                "errorMessage": null,
                "metrics": {
                    "binary": {
                        "truePositive": 311,
                        "falsePositive": 103,
                        "falseNegative": 54,
                        "trueNegative": 2060,
                        "accuracy": 0.9375,
                        "mcc": 0.7621741380656772,
                        "auc": 0.9629434005281857,
                        "logLoss": 0.17645366167456517,
                        "missRate": 0.14794520547945206,
                        "fallout": 0.047619047619047616,
                        "npv": 0.9744560075685903,
                        "specificity": 0.9523809523809523,
                        "recall": 0.852054794520548,
                        "precision": 0.751207729468599,
                        "f1": 0.7984595635430038,
                        "threshold": 0.30128437693042176,
                        "truePositiveTest": 77,
                        "falsePositiveTest": 30,
                        "falseNegativeTest": 14,
                        "trueNegativeTest": 511,
                        "accuracyTest": 0.930379746835443,
                        "mccTest": 0.7402187229033519,
                        "aucTest": 0.9540533403749669,
                        "logLossTest": 0.1789144736636058,
                        "missRateTest": 0.15384615384615385,
                        "falloutTest": 0.05545286506469501,
                        "npvTest": 0.9733333333333334,
                        "specificityTest": 0.944547134935305,
                        "recallTest": 0.8461538461538461,
                        "precisionTest": 0.719626168224299,
                        "f1Test": 0.7777777777777778,
                        "thresholdTest": 0.30128437693042176
                    }
                },
                "hpoNum": null,
                "droppedFeatures": [
                    {
                        "name": "DaysSinceLastService",
                        "reason": "has_target_leakage"
                    }
                ],
                "samplingRatio": 1,
                "columns": [
                    "Territory",
                    "DeviceType",
                    "Promotion",
                    "HasRenewed",
                    "PlanType",
                    "BaseFee",
                    "AdditionalFeatureSpend",
                    "NumberOfPenalties",
                    "CurrentPeriodUsage",
                    "PriorPeriodUsage",
                    "ServiceRating",
                    "ServiceTickets",
                    "StartMonth",
                    "StartWeek",
                    "CustomerTenure",
                    "Churned"
                ],
                "modelState": "inactive"
            }
        },
        {
            "type": "model",
            "id": "6953dbe3-c997-4f82-af77-494bda9b1247",
            "attributes": {
                "id": "6953dbe3-c997-4f82-af77-494bda9b1247",
                "experimentVersionId": "eb60cbd9-838f-4d4e-bb1c-18aaa6ad3ccf",
                "createdAt": "2024-11-27T13:19:36.391646Z",
                "updatedAt": "2024-11-27T13:19:44.029329Z",
                "batchNum": 1,
                "algorithm": "random_forest_classifier",
                "name": "v03_RAFC_01_00",
                "description": null,
                "seqNum": 0,
                "algoAbbrv": "RAFC",
                "status": "ready",
                "errorMessage": null,
                "metrics": {
                    "binary": {
                        "truePositive": 275,
                        "falsePositive": 58,
                        "falseNegative": 90,
                        "trueNegative": 2105,
                        "accuracy": 0.9410601265822784,
                        "mcc": 0.753269275386698,
                        "auc": 0.9583315917136904,
                        "logLoss": 0.16340603954795424,
                        "missRate": 0.2465753424657534,
                        "fallout": 0.026814609338881183,
                        "npv": 0.958997722095672,
                        "specificity": 0.9731853906611189,
                        "recall": 0.7534246575342466,
                        "precision": 0.8258258258258259,
                        "f1": 0.7879656160458453,
                        "threshold": 0.4284064499957584,
                        "truePositiveTest": 73,
                        "falsePositiveTest": 16,
                        "falseNegativeTest": 18,
                        "trueNegativeTest": 525,
                        "accuracyTest": 0.9462025316455697,
                        "mccTest": 0.7798157631261908,
                        "aucTest": 0.9696735796550954,
                        "logLossTest": 0.14611503001799023,
                        "missRateTest": 0.1978021978021978,
                        "falloutTest": 0.029574861367837338,
                        "npvTest": 0.9668508287292817,
                        "specificityTest": 0.9704251386321626,
                        "recallTest": 0.8021978021978022,
                        "precisionTest": 0.8202247191011236,
                        "f1Test": 0.8111111111111112,
                        "thresholdTest": 0.4284064499957584
                    }
                },
                "hpoNum": null,
                "droppedFeatures": [
                    {
                        "name": "DaysSinceLastService",
                        "reason": "has_target_leakage"
                    },
                    {
                        "name": "PriorPeriodUsage",
                        "reason": "highly_correlated"
                    },
                    {
                        "name": "Territory",
                        "reason": "feature_with_low_importance"
                    },
                    {
                        "name": "StartMonth",
                        "reason": "feature_with_low_importance"
                    },
                    {
                        "name": "CurrentPeriodUsage",
                        "reason": "feature_with_low_importance"
                    },
                    {
                        "name": "DeviceType",
                        "reason": "feature_with_low_importance"
                    },
                    {
                        "name": "StartWeek",
                        "reason": "feature_with_low_importance"
                    },
                    {
                        "name": "CustomerTenure",
                        "reason": "feature_with_low_importance"
                    }
                ],
                "samplingRatio": 1,
                "columns": [
                    "PlanType",
                    "NumberOfPenalties",
                    "HasRenewed",
                    "BaseFee",
                    "ServiceTickets",
                    "AdditionalFeatureSpend",
                    "ServiceRating",
                    "Promotion"
                ],
                "modelState": "inactive"
            }
        },
        {
            "type": "model",
            "id": "9fda6cb5-321a-4e3e-b9b9-b640bc8708f1",
            "attributes": {
                "id": "9fda6cb5-321a-4e3e-b9b9-b640bc8708f1",
                "experimentVersionId": "eb60cbd9-838f-4d4e-bb1c-18aaa6ad3ccf",
                "createdAt": "2024-11-27T13:19:20.880715Z",
                "updatedAt": "2024-11-27T13:19:30.471811Z",
                "batchNum": 0,
                "algorithm": "random_forest_classifier",
                "name": "v03_RAFC_00_00",
                "description": null,
                "seqNum": 0,
                "algoAbbrv": "RAFC",
                "status": "ready",
                "errorMessage": null,
                "metrics": {
                    "binary": {
                        "truePositive": 290,
                        "falsePositive": 76,
                        "falseNegative": 75,
                        "trueNegative": 2087,
                        "accuracy": 0.939873417721519,
                        "mcc": 0.7566444372668605,
                        "auc": 0.9604823336436583,
                        "logLoss": 0.17675936885512278,
                        "missRate": 0.2054794520547945,
                        "fallout": 0.03513638465094776,
                        "npv": 0.9653098982423681,
                        "specificity": 0.9648636153490523,
                        "recall": 0.7945205479452054,
                        "precision": 0.7923497267759563,
                        "f1": 0.7934336525307798,
                        "threshold": 0.36075704789531526,
                        "truePositiveTest": 74,
                        "falsePositiveTest": 24,
                        "falseNegativeTest": 17,
                        "trueNegativeTest": 517,
                        "accuracyTest": 0.935126582278481,
                        "mccTest": 0.7456978462730665,
                        "aucTest": 0.9520220998964067,
                        "logLossTest": 0.18079716644028357,
                        "missRateTest": 0.18681318681318682,
                        "falloutTest": 0.04436229205175601,
                        "npvTest": 0.9681647940074907,
                        "specificityTest": 0.955637707948244,
                        "recallTest": 0.8131868131868132,
                        "precisionTest": 0.7551020408163265,
                        "f1Test": 0.783068783068783,
                        "thresholdTest": 0.36075704789531526
                    }
                },
                "hpoNum": null,
                "droppedFeatures": [
                    {
                        "name": "DaysSinceLastService",
                        "reason": "has_target_leakage"
                    },
                    {
                        "name": "PriorPeriodUsage",
                        "reason": "highly_correlated"
                    }
                ],
                "samplingRatio": 1,
                "columns": [
                    "Territory",
                    "DeviceType",
                    "Promotion",
                    "HasRenewed",
                    "PlanType",
                    "BaseFee",
                    "AdditionalFeatureSpend",
                    "NumberOfPenalties",
                    "CurrentPeriodUsage",
                    "ServiceRating",
                    "ServiceTickets",
                    "StartMonth",
                    "StartWeek",
                    "CustomerTenure",
                    "Churned"
                ],
                "modelState": "inactive"
            }
        }
    ]
}

In this example, three models have been generated using the random_forest_classifier algorithm. These models were trained to predict customer churn (binary classification: churned vs. not churned).

Compare models

When evaluating models, you should ask yourself the following questions:

  • Which model performs best for the business objective?
  • How should metrics like accuracy, precision, recall, and F1 score influence the decision?
  • What trade-offs exist in terms of feature usage and performance?

For more information, see Interpreting model scores on Qlik Help.

Example values

The following table includes performance metrics for all models generated in the previous example:

MetricModel v03_RAFC_01_01Model v03_RAFC_01_00Model v03_RAFC_00_00
Accuracy (Test)93.04%94.62%93.51%
Precision (Test)71.96%82.02%75.51%
Recall (Test)84.62%80.22%81.32%
F1 score (Test)77.78%81.11%78.31%
AUC (Test)95.41%96.97%95.20%
Log loss (Test)0.17890.14610.1807
Dropped featuresDaysSinceLastServiceDaysSinceLastService, PriorPeriodUsage, Territory, StartMonth, CurrentPeriodUsage, DeviceType, StartWeek, CustomerTenureDaysSinceLastService, PriorPeriodUsage

Key metrics

Before selecting a model, understand what each metric signifies:

MetricDescription
AccuracyOverall correctness of the model.
PrecisionReduces false positives (in this example, non-churners incorrectly flagged as churners).
RecallReduces false negatives (in this example, missed churners).
F1 scoreBalances precision and recall.
AUCMeasures the ability to distinguish between classes (higher is better).

Key insights for model selection

Based on the example metrics:

  • v03_RAFC_01_01 has the best recall (84.62%) and AUC (95.41%), making it suitable for use cases where identifying all churners is critical, even at the risk of more false positives.
  • v03_RAFC_01_00 has the highest precision (82.02%) and F1 score (81.11%), making it ideal for minimizing false positives.
  • v03_RAFC_00_00 has balanced metrics but slightly behind the other models.

For more information about evaluating binary classification models, see Scoring binary classification models on Qlik Help.

Model selection

Select a model based on your business objective. In this example:

  • If reducing churn is critical, choose the model with higher recall and AUC.
  • If precision is more important, prioritize models with fewer false positives.

Based on the evaluation, v03_RAFC_01_00 is selected for deployment due to its highest precision, aligning with the business objective of minimizing false positives.

Next step

With the best-performing model identified, deploy it and make it available for predictions.

Was this page helpful?