---
source: https://qlik.dev/embed/machine-learning/tutorial/evaluate-models/
last_updated: 2024-12-20T16:26:49+01:00
---

# Step 4: Evaluate models

After creating your experiment version, multiple models are generated and trained.
The top-performing model (`topModelId`) is [returned by the API](https://qlik.dev/embed/machine-learning/tutorial/evaluate-models/create-experiment-version#monitor-training-progress),
but you can evaluate and compare all generated models to select one that aligns with your business objectives, for
example, preventing customer churn as precisely as possible.

## List models

Retrieve all models generated by an experiment version with the following API call:

```bash
curl -L "https://<TENANT>/api/v1/ml/experiments/{experimentId}/models" ^
-H "Content-Type: application/json" ^
-H "Accept: application/json" ^
-H "Authorization: Bearer <ACCESS_TOKEN>" ^
```

<details>
  <summary>
    **Response example**
  </summary>

  ```json
  {
      "data": [
          {
              "type": "model",
              "id": "db823841-c551-4e18-84e3-d3a2a98edd60",
              "attributes": {
                  "id": "db823841-c551-4e18-84e3-d3a2a98edd60",
                  "experimentVersionId": "eb60cbd9-838f-4d4e-bb1c-18aaa6ad3ccf",
                  "createdAt": "2024-11-27T13:19:36.403719Z",
                  "updatedAt": "2024-11-27T13:19:46.442265Z",
                  "batchNum": 1,
                  "algorithm": "random_forest_classifier",
                  "name": "v03_RAFC_01_01",
                  "description": null,
                  "seqNum": 1,
                  "algoAbbrv": "RAFC",
                  "status": "ready",
                  "errorMessage": null,
                  "metrics": {
                      "binary": {
                          "truePositive": 311,
                          "falsePositive": 103,
                          "falseNegative": 54,
                          "trueNegative": 2060,
                          "accuracy": 0.9375,
                          "mcc": 0.7621741380656772,
                          "auc": 0.9629434005281857,
                          "logLoss": 0.17645366167456517,
                          "missRate": 0.14794520547945206,
                          "fallout": 0.047619047619047616,
                          "npv": 0.9744560075685903,
                          "specificity": 0.9523809523809523,
                          "recall": 0.852054794520548,
                          "precision": 0.751207729468599,
                          "f1": 0.7984595635430038,
                          "threshold": 0.30128437693042176,
                          "truePositiveTest": 77,
                          "falsePositiveTest": 30,
                          "falseNegativeTest": 14,
                          "trueNegativeTest": 511,
                          "accuracyTest": 0.930379746835443,
                          "mccTest": 0.7402187229033519,
                          "aucTest": 0.9540533403749669,
                          "logLossTest": 0.1789144736636058,
                          "missRateTest": 0.15384615384615385,
                          "falloutTest": 0.05545286506469501,
                          "npvTest": 0.9733333333333334,
                          "specificityTest": 0.944547134935305,
                          "recallTest": 0.8461538461538461,
                          "precisionTest": 0.719626168224299,
                          "f1Test": 0.7777777777777778,
                          "thresholdTest": 0.30128437693042176
                      }
                  },
                  "hpoNum": null,
                  "droppedFeatures": [
                      {
                          "name": "DaysSinceLastService",
                          "reason": "has_target_leakage"
                      }
                  ],
                  "samplingRatio": 1,
                  "columns": [
                      "Territory",
                      "DeviceType",
                      "Promotion",
                      "HasRenewed",
                      "PlanType",
                      "BaseFee",
                      "AdditionalFeatureSpend",
                      "NumberOfPenalties",
                      "CurrentPeriodUsage",
                      "PriorPeriodUsage",
                      "ServiceRating",
                      "ServiceTickets",
                      "StartMonth",
                      "StartWeek",
                      "CustomerTenure",
                      "Churned"
                  ],
                  "modelState": "inactive"
              }
          },
          {
              "type": "model",
              "id": "6953dbe3-c997-4f82-af77-494bda9b1247",
              "attributes": {
                  "id": "6953dbe3-c997-4f82-af77-494bda9b1247",
                  "experimentVersionId": "eb60cbd9-838f-4d4e-bb1c-18aaa6ad3ccf",
                  "createdAt": "2024-11-27T13:19:36.391646Z",
                  "updatedAt": "2024-11-27T13:19:44.029329Z",
                  "batchNum": 1,
                  "algorithm": "random_forest_classifier",
                  "name": "v03_RAFC_01_00",
                  "description": null,
                  "seqNum": 0,
                  "algoAbbrv": "RAFC",
                  "status": "ready",
                  "errorMessage": null,
                  "metrics": {
                      "binary": {
                          "truePositive": 275,
                          "falsePositive": 58,
                          "falseNegative": 90,
                          "trueNegative": 2105,
                          "accuracy": 0.9410601265822784,
                          "mcc": 0.753269275386698,
                          "auc": 0.9583315917136904,
                          "logLoss": 0.16340603954795424,
                          "missRate": 0.2465753424657534,
                          "fallout": 0.026814609338881183,
                          "npv": 0.958997722095672,
                          "specificity": 0.9731853906611189,
                          "recall": 0.7534246575342466,
                          "precision": 0.8258258258258259,
                          "f1": 0.7879656160458453,
                          "threshold": 0.4284064499957584,
                          "truePositiveTest": 73,
                          "falsePositiveTest": 16,
                          "falseNegativeTest": 18,
                          "trueNegativeTest": 525,
                          "accuracyTest": 0.9462025316455697,
                          "mccTest": 0.7798157631261908,
                          "aucTest": 0.9696735796550954,
                          "logLossTest": 0.14611503001799023,
                          "missRateTest": 0.1978021978021978,
                          "falloutTest": 0.029574861367837338,
                          "npvTest": 0.9668508287292817,
                          "specificityTest": 0.9704251386321626,
                          "recallTest": 0.8021978021978022,
                          "precisionTest": 0.8202247191011236,
                          "f1Test": 0.8111111111111112,
                          "thresholdTest": 0.4284064499957584
                      }
                  },
                  "hpoNum": null,
                  "droppedFeatures": [
                      {
                          "name": "DaysSinceLastService",
                          "reason": "has_target_leakage"
                      },
                      {
                          "name": "PriorPeriodUsage",
                          "reason": "highly_correlated"
                      },
                      {
                          "name": "Territory",
                          "reason": "feature_with_low_importance"
                      },
                      {
                          "name": "StartMonth",
                          "reason": "feature_with_low_importance"
                      },
                      {
                          "name": "CurrentPeriodUsage",
                          "reason": "feature_with_low_importance"
                      },
                      {
                          "name": "DeviceType",
                          "reason": "feature_with_low_importance"
                      },
                      {
                          "name": "StartWeek",
                          "reason": "feature_with_low_importance"
                      },
                      {
                          "name": "CustomerTenure",
                          "reason": "feature_with_low_importance"
                      }
                  ],
                  "samplingRatio": 1,
                  "columns": [
                      "PlanType",
                      "NumberOfPenalties",
                      "HasRenewed",
                      "BaseFee",
                      "ServiceTickets",
                      "AdditionalFeatureSpend",
                      "ServiceRating",
                      "Promotion"
                  ],
                  "modelState": "inactive"
              }
          },
          {
              "type": "model",
              "id": "9fda6cb5-321a-4e3e-b9b9-b640bc8708f1",
              "attributes": {
                  "id": "9fda6cb5-321a-4e3e-b9b9-b640bc8708f1",
                  "experimentVersionId": "eb60cbd9-838f-4d4e-bb1c-18aaa6ad3ccf",
                  "createdAt": "2024-11-27T13:19:20.880715Z",
                  "updatedAt": "2024-11-27T13:19:30.471811Z",
                  "batchNum": 0,
                  "algorithm": "random_forest_classifier",
                  "name": "v03_RAFC_00_00",
                  "description": null,
                  "seqNum": 0,
                  "algoAbbrv": "RAFC",
                  "status": "ready",
                  "errorMessage": null,
                  "metrics": {
                      "binary": {
                          "truePositive": 290,
                          "falsePositive": 76,
                          "falseNegative": 75,
                          "trueNegative": 2087,
                          "accuracy": 0.939873417721519,
                          "mcc": 0.7566444372668605,
                          "auc": 0.9604823336436583,
                          "logLoss": 0.17675936885512278,
                          "missRate": 0.2054794520547945,
                          "fallout": 0.03513638465094776,
                          "npv": 0.9653098982423681,
                          "specificity": 0.9648636153490523,
                          "recall": 0.7945205479452054,
                          "precision": 0.7923497267759563,
                          "f1": 0.7934336525307798,
                          "threshold": 0.36075704789531526,
                          "truePositiveTest": 74,
                          "falsePositiveTest": 24,
                          "falseNegativeTest": 17,
                          "trueNegativeTest": 517,
                          "accuracyTest": 0.935126582278481,
                          "mccTest": 0.7456978462730665,
                          "aucTest": 0.9520220998964067,
                          "logLossTest": 0.18079716644028357,
                          "missRateTest": 0.18681318681318682,
                          "falloutTest": 0.04436229205175601,
                          "npvTest": 0.9681647940074907,
                          "specificityTest": 0.955637707948244,
                          "recallTest": 0.8131868131868132,
                          "precisionTest": 0.7551020408163265,
                          "f1Test": 0.783068783068783,
                          "thresholdTest": 0.36075704789531526
                      }
                  },
                  "hpoNum": null,
                  "droppedFeatures": [
                      {
                          "name": "DaysSinceLastService",
                          "reason": "has_target_leakage"
                      },
                      {
                          "name": "PriorPeriodUsage",
                          "reason": "highly_correlated"
                      }
                  ],
                  "samplingRatio": 1,
                  "columns": [
                      "Territory",
                      "DeviceType",
                      "Promotion",
                      "HasRenewed",
                      "PlanType",
                      "BaseFee",
                      "AdditionalFeatureSpend",
                      "NumberOfPenalties",
                      "CurrentPeriodUsage",
                      "ServiceRating",
                      "ServiceTickets",
                      "StartMonth",
                      "StartWeek",
                      "CustomerTenure",
                      "Churned"
                  ],
                  "modelState": "inactive"
              }
          }
      ]
  }
  ```
</details>

In this example, three models have been generated using the `random_forest_classifier` algorithm.
These models were trained to predict customer churn (binary classification: churned vs. not churned).

## Compare models

When evaluating models, you should ask yourself the following questions:

- Which model performs best for the business objective?
- How should metrics like accuracy, precision, recall, and F1 score influence the decision?
- What trade-offs exist in terms of feature usage and performance?

For more information, see [Interpreting model scores](https://help.qlik.com/en-US/cloud-services/Subsystems/Hub/Content/Sense_Hub/AutoML/scoring-models.htm)
on Qlik Help.

### Example values

The following table includes performance metrics for all models generated in the previous example:

| **Metric**           | **Model** `v03_RAFC_01_01` | **Model** `v03_RAFC_01_00`                                                                                                               | **Model** `v03_RAFC_00_00`                 |
| -------------------- | -------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------ |
| **Accuracy (Test)**  | **93.04%**                 | **94.62%**                                                                                                                               | 93.51%                                     |
| **Precision (Test)** | 71.96%                     | **82.02%**                                                                                                                               | 75.51%                                     |
| **Recall (Test)**    | **84.62%**                 | 80.22%                                                                                                                                   | 81.32%                                     |
| **F1 score (Test)**  | 77.78%                     | **81.11%**                                                                                                                               | 78.31%                                     |
| **AUC (Test)**       | 95.41%                     | **96.97%**                                                                                                                               | 95.20%                                     |
| **Log loss (Test)**  | 0.1789                     | **0.1461**                                                                                                                               | 0.1807                                     |
| **Dropped features** | `DaysSinceLastService`     | `DaysSinceLastService`, `PriorPeriodUsage`, `Territory`, `StartMonth`, `CurrentPeriodUsage`, `DeviceType`, `StartWeek`, `CustomerTenure` | `DaysSinceLastService`, `PriorPeriodUsage` |

### Key metrics

Before selecting a model, understand what each metric signifies:

| **Metric** | **Description**                                                                          |
| ---------- | ---------------------------------------------------------------------------------------- |
| Accuracy   | Overall correctness of the model.                                                        |
| Precision  | Reduces false positives (in this example, non-churners incorrectly flagged as churners). |
| Recall     | Reduces false negatives (in this example, missed churners).                              |
| F1 score   | Balances precision and recall.                                                           |
| AUC        | Measures the ability to distinguish between classes (higher is better).                  |

### Key insights for model selection

Based on the example metrics:

- `v03_RAFC_01_01` has the best recall (84.62%) and AUC (95.41%), making it suitable for use cases where identifying all
  churners is critical, even at the risk of more false positives.
- `v03_RAFC_01_00` has the highest precision (82.02%) and F1 score (81.11%), making it ideal for minimizing false
  positives.
- `v03_RAFC_00_00` has balanced metrics but slightly behind the other models.

For more information about evaluating binary classification models, see
[Scoring binary classification models](https://help.qlik.com/en-US/cloud-services/Subsystems/Hub/Content/Sense_Hub/AutoML/scoring-binary-classification.htm)
on Qlik Help.

### Model selection

Select a model based on your business objective. In this example:

- If reducing churn is critical, choose the model with higher recall and AUC.
- If precision is more important, prioritize models with fewer false positives.

Based on the evaluation, v03\_RAFC\_01\_00 is selected for deployment due to its highest precision, aligning with the
business objective of minimizing false positives.

## Next step

With the best-performing model identified, [deploy it](https://qlik.dev/embed/machine-learning/tutorial/evaluate-models/deploy-model) and make it available for predictions.