Skip to content

Step 2: Profile your dataset

Profiling your dataset helps ensure it is suitable for machine learning. This process generates insights about features, highlights potential issues, and identifies features that align with your experiment goals.

Retrieve dataset information

Use the following call to the Items API to find all datasets in your tenant:

curl -L "https://<TENANT>/api/v1/items?resourceType=dataset" ^
-H "Authorization: Bearer <ACCESS_TOKEN>"

Locate your dataset in the response and save its resourceId (for example, 6749ddb893296645bb4bd795). You’ll need this ID for profiling the dataset.

Response example

"data": [
        {
            "name": "AutoML Example - Churn data - training.csv",
            "spaceId": "6745f737f536738170dfe82f",
            "resourceAttributes": {
                "appType": "QIX-DF",
                "dataStoreName": "DataFilesStore",
                "dataStoreType": "qix-datafiles",
                "qri": "qdf:qix-datafiles:GIlHILBfb5R6drAY2L7Zvi2c_YnlFDHR:sid@6745f737f536738170dfe82f:AutoML Example - Churn data - training.csv",
                "secureQri": "qri:qdf:space://cxT3ijCQNKeVFuqux1p1C7SLfDe7crQa8-BAbZvXXDA#UitfNqc2PShxZs6qPzkllSp7PQ37AQYCZri7OIdcLgo",
                "sourceSystemId": "QIX-DF_88b30e9b-5738-4c9b-bd4e-dd2c2d3b542d",
                "technicalDescription": "",
                "technicalName": "AutoML Example - Churn data - training.csv",
                "type": "DELIMETED",
                "version": "1"
            },
            "resourceCustomAttributes": null,
            "resourceUpdatedAt": "2024-11-26T16:29:55Z",
            "resourceType": "dataset",
            "resourceSubType": "qix-df",
            "resourceId": "6745f77ebf86ce46d48ed34f",
            "resourceCreatedAt": "2024-11-26T16:29:50Z",
            "id": "6745f77ef10262fdb0a50c55",
            "createdAt": "2024-11-26T16:29:50Z",
            "updatedAt": "2024-11-26T16:29:55Z",
            "creatorId": "62d58b0a6cdb747267985bb2",
            "updaterId": "62d58b0a6cdb747267985bb2",
            "tenantId": "GIlHILBfb5R6drAY2L7Zvi2c_YnlFDHR",
            "isFavorited": false,
            "links": {
                "self": {
                    "href": "https://tenant.us.qlik.com/api/v1/items/6745f77ef10262fdb0a50c55"
                },
                "collections": {
                    "href": "https://tenant.us.qlik.com/api/v1/items/6745f77ef10262fdb0a50c55/collections"
                }
            },
            "actions": [
                "create",
                "delete",
                "list",
                "profile",
                "read",
                "update"
            ],
            "collectionIds": [],
            "meta": {
                "isFavorited": false,
                "actions": [
                    "create",
                    "delete",
                    "list",
                    "profile",
                    "read",
                    "update"
                ],
                "tags": [],
                "collections": []
            },
            "ownerId": "62d58b0a6cdb747267985bb2",
            "resourceReloadEndTime": "",
            "resourceReloadStatus": "",
            "resourceSize": {
                "appFile": 0,
                "appMemory": 0
            },
            "itemViews": {}
        }
        ]

Profile your dataset

Use the following call to generate insights about your dataset, identifying features suitable for training and highlighting potential issues.

Include the resourceId obtained previously as the dataSetId in the request body:

curl -L -X POST "https://<TENANT>/api/v1/ml/profile-insights" ^
-H "Content-Type: application/json" ^
-H "Accept: application/json" ^
-H "Authorization: Bearer <ACCESS_TOKEN>" ^
-d "{
    \"data\": {
        \"type\": \"profile-insights\",
        \"attributes\": {
            \"dataSetId\": \"<DATASET_ID>\"
        }
    }
}"

The response includes feature-level insights.

In the following example, the response contains the following key insights:

  • high_cardinality: Features with many unique values.
  • will_be_one_hot_encoded: Feature will be one-hot encoded during preprocessing.
  • will_be_impact_encoded: Feature will use impact encoding for categorical data.
  • willBeDropped: When true, the feature will be excluded from the training process.
  • cannotBeTarget: When true, the feature isn’t suitable as a target variable.

The experimentTypes property lists suitable experiment types for each feature.

The response also includes a defaultVersionConfig property, which serves as a template for creating experiment versions. This property provides a feature list with data types, inclusion settings, and other configuration for downstream processes. You can copy, edit, and use this in the body of the POST /ml/experiments/{experimentId}/versions request to create an experiment version.

In this example, the experimentMode returned is set to intelligent. Intelligent model optimization automatically refines models through iterations. For more information, see Intelligent model optimization on Qlik Help.

Response example

{
  "data": {
      "type": "profile-insights",
      "id": "6749ddb893296645bb4bd795",
      "attributes": {
          "tenantId": "GIlHILBfb5R6drAY2L7Zvi2c_YnlFDHR",
          "ownerId": "67475097984561d02f0cb3dc",
          "status": "ready",
          "insights": [
              {
                  "name": "AccountID",
                  "experimentTypes": [],
                  "insights": [
                      "high_cardinality",
                      "will_be_impact_encoded",
                      "valid_index"
                  ],
                  "willBeDropped": true,
                  "cannotBeTarget": true
              },
              {
                  "name": "Territory",
                  "experimentTypes": [],
                  "insights": [
                      "will_be_impact_encoded"
                  ],
                  "willBeDropped": false,
                  "cannotBeTarget": true
              },
              {
                  "name": "Country",
                  "experimentTypes": [],
                  "insights": [
                      "constant",
                      "will_be_one_hot_encoded"
                  ],
                  "willBeDropped": true,
                  "cannotBeTarget": true
              },
              {
                  "name": "DeviceType",
                  "experimentTypes": [
                      "multiclass"
                  ],
                  "insights": [
                      "will_be_one_hot_encoded"
                  ],
                  "willBeDropped": false,
                  "cannotBeTarget": false
              },
              {
                  "name": "Promotion",
                  "experimentTypes": [
                      "binary"
                  ],
                  "insights": [
                      "will_be_one_hot_encoded"
                  ],
                  "willBeDropped": false,
                  "cannotBeTarget": false
              },
              {
                  "name": "HasRenewed",
                  "experimentTypes": [
                      "binary"
                  ],
                  "insights": [
                      "will_be_one_hot_encoded"
                  ],
                  "willBeDropped": false,
                  "cannotBeTarget": false
              },
              {
                  "name": "PlanType",
                  "experimentTypes": [
                      "multiclass"
                  ],
                  "insights": [
                      "will_be_one_hot_encoded"
                  ],
                  "willBeDropped": false,
                  "cannotBeTarget": false
              },
              {
                  "name": "BaseFee",
                  "experimentTypes": [
                      "regression"
                  ],
                  "insights": [],
                  "willBeDropped": false,
                  "cannotBeTarget": false
              },
              {
                  "name": "AdditionalFeatureSpend",
                  "experimentTypes": [
                      "regression"
                  ],
                  "insights": [],
                  "willBeDropped": false,
                  "cannotBeTarget": false
              },
              {
                  "name": "NumberOfPenalties",
                  "experimentTypes": [
                      "multiclass"
                  ],
                  "insights": [],
                  "willBeDropped": false,
                  "cannotBeTarget": false
              },
              {
                  "name": "CurrentPeriodUsage",
                  "experimentTypes": [
                      "regression"
                  ],
                  "insights": [],
                  "willBeDropped": false,
                  "cannotBeTarget": false
              },
              {
                  "name": "PriorPeriodUsage",
                  "experimentTypes": [
                      "regression"
                  ],
                  "insights": [],
                  "willBeDropped": false,
                  "cannotBeTarget": false
              },
              {
                  "name": "DaysSinceLastService",
                  "experimentTypes": [
                      "regression"
                  ],
                  "insights": [],
                  "willBeDropped": false,
                  "cannotBeTarget": false
              },
              {
                  "name": "ServiceRating",
                  "experimentTypes": [
                      "regression"
                  ],
                  "insights": [],
                  "willBeDropped": false,
                  "cannotBeTarget": false
              },
              {
                  "name": "ServiceTickets",
                  "experimentTypes": [
                      "regression"
                  ],
                  "insights": [],
                  "willBeDropped": false,
                  "cannotBeTarget": false
              },
              {
                  "name": "StartMonth",
                  "experimentTypes": [],
                  "insights": [
                      "will_be_one_hot_encoded"
                  ],
                  "willBeDropped": false,
                  "cannotBeTarget": true
              },
              {
                  "name": "StartWeek",
                  "experimentTypes": [],
                  "insights": [
                      "will_be_impact_encoded"
                  ],
                  "willBeDropped": false,
                  "cannotBeTarget": true
              },
              {
                  "name": "CustomerTenure",
                  "experimentTypes": [
                      "regression"
                  ],
                  "insights": [],
                  "willBeDropped": false,
                  "cannotBeTarget": false
              },
              {
                  "name": "Churned",
                  "experimentTypes": [
                      "binary"
                  ],
                  "insights": [
                      "will_be_one_hot_encoded"
                  ],
                  "willBeDropped": false,
                  "cannotBeTarget": false
              }
          ],
          "defaultVersionConfig": {
              "name": "2024-12-10T15:08:06.570Z",
              "datasetOrigin": "new",
              "dataSetId": "6749ddb893296645bb4bd795",
              "experimentMode": "intelligent",
              "featuresList": [
                  {
                      "name": "AccountID",
                      "dataType": "STRING",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "Territory",
                      "dataType": "STRING",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "Country",
                      "dataType": "STRING",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "DeviceType",
                      "dataType": "STRING",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "Promotion",
                      "dataType": "STRING",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "HasRenewed",
                      "dataType": "STRING",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "PlanType",
                      "dataType": "STRING",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "BaseFee",
                      "dataType": "DOUBLE",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "AdditionalFeatureSpend",
                      "dataType": "INTEGER",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "NumberOfPenalties",
                      "dataType": "INTEGER",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "CurrentPeriodUsage",
                      "dataType": "DOUBLE",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "PriorPeriodUsage",
                      "dataType": "DOUBLE",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "DaysSinceLastService",
                      "dataType": "INTEGER",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "ServiceRating",
                      "dataType": "DOUBLE",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "ServiceTickets",
                      "dataType": "INTEGER",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "StartMonth",
                      "dataType": "STRING",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "StartWeek",
                      "dataType": "STRING",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "CustomerTenure",
                      "dataType": "INTEGER",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  },
                  {
                      "name": "Churned",
                      "dataType": "STRING",
                      "include": true,
                      "featureType": "numeric",
                      "changeType": null
                  }
              ]
          }
      }
  }
}

Select the target feature

Use the insights property from the profile insights response to select a target feature. A suitable target feature:

  • Has "cannotBeTarget": false.
  • Shouldn’t be flagged as unsuitable (for example, high_cardinality).
  • Represents a meaningful and actionable objective.

In the example response, the Churned column is a valid target because:

  • "cannotBeTarget": false shows it can be used as the target.
  • Predicting churn is actionable for improving customer retention.

Next step

Profiling your dataset provides everything you need to start creating an experiment version::

  • Feature insights help you select which features to include or exclude.
  • A basis for selecting a meaningful target feature.
  • Experiment types (experimentTypes) for the selected target feature.
  • A defaultVersionConfig template that simplifies the creation of an experiment version.

With this, you can now create your experiment versions and train models.

Was this page helpful?