Step 2: Profile your dataset
Profiling your dataset helps ensure it is suitable for machine learning. This process generates insights about features, highlights potential issues, and identifies features that align with your experiment goals.
Retrieve dataset information
Use the following call to the Items API to find all datasets in your tenant:
curl -L "https://<TENANT>/api/v1/items?resourceType=dataset" ^
-H "Authorization: Bearer <ACCESS_TOKEN>"
Locate your dataset in the response and save its resourceId
(for example, 6749ddb893296645bb4bd795
).
You’ll need this ID for profiling the dataset.
Response example
"data": [
{
"name": "AutoML Example - Churn data - training.csv",
"spaceId": "6745f737f536738170dfe82f",
"resourceAttributes": {
"appType": "QIX-DF",
"dataStoreName": "DataFilesStore",
"dataStoreType": "qix-datafiles",
"qri": "qdf:qix-datafiles:GIlHILBfb5R6drAY2L7Zvi2c_YnlFDHR:sid@6745f737f536738170dfe82f:AutoML Example - Churn data - training.csv",
"secureQri": "qri:qdf:space://cxT3ijCQNKeVFuqux1p1C7SLfDe7crQa8-BAbZvXXDA#UitfNqc2PShxZs6qPzkllSp7PQ37AQYCZri7OIdcLgo",
"sourceSystemId": "QIX-DF_88b30e9b-5738-4c9b-bd4e-dd2c2d3b542d",
"technicalDescription": "",
"technicalName": "AutoML Example - Churn data - training.csv",
"type": "DELIMETED",
"version": "1"
},
"resourceCustomAttributes": null,
"resourceUpdatedAt": "2024-11-26T16:29:55Z",
"resourceType": "dataset",
"resourceSubType": "qix-df",
"resourceId": "6745f77ebf86ce46d48ed34f",
"resourceCreatedAt": "2024-11-26T16:29:50Z",
"id": "6745f77ef10262fdb0a50c55",
"createdAt": "2024-11-26T16:29:50Z",
"updatedAt": "2024-11-26T16:29:55Z",
"creatorId": "62d58b0a6cdb747267985bb2",
"updaterId": "62d58b0a6cdb747267985bb2",
"tenantId": "GIlHILBfb5R6drAY2L7Zvi2c_YnlFDHR",
"isFavorited": false,
"links": {
"self": {
"href": "https://tenant.us.qlik.com/api/v1/items/6745f77ef10262fdb0a50c55"
},
"collections": {
"href": "https://tenant.us.qlik.com/api/v1/items/6745f77ef10262fdb0a50c55/collections"
}
},
"actions": [
"create",
"delete",
"list",
"profile",
"read",
"update"
],
"collectionIds": [],
"meta": {
"isFavorited": false,
"actions": [
"create",
"delete",
"list",
"profile",
"read",
"update"
],
"tags": [],
"collections": []
},
"ownerId": "62d58b0a6cdb747267985bb2",
"resourceReloadEndTime": "",
"resourceReloadStatus": "",
"resourceSize": {
"appFile": 0,
"appMemory": 0
},
"itemViews": {}
}
]
Profile your dataset
Use the following call to generate insights about your dataset, identifying features suitable for training and highlighting potential issues.
Include the resourceId
obtained previously as the dataSetId
in the request body:
curl -L -X POST "https://<TENANT>/api/v1/ml/profile-insights" ^
-H "Content-Type: application/json" ^
-H "Accept: application/json" ^
-H "Authorization: Bearer <ACCESS_TOKEN>" ^
-d "{
\"data\": {
\"type\": \"profile-insights\",
\"attributes\": {
\"dataSetId\": \"<DATASET_ID>\"
}
}
}"
The response includes feature-level insights.
In the following example, the response contains the following key insights:
high_cardinality
: Features with many unique values.will_be_one_hot_encoded
: Feature will be one-hot encoded during preprocessing.will_be_impact_encoded
: Feature will use impact encoding for categorical data.willBeDropped
: Whentrue
, the feature will be excluded from the training process.cannotBeTarget
: Whentrue
, the feature isn’t suitable as a target variable.
The experimentTypes
property lists suitable experiment types for each feature.
The response also includes a defaultVersionConfig
property, which serves as a template for creating experiment
versions. This property provides a feature list with data types, inclusion settings, and other configuration for
downstream processes. You can copy, edit, and use this in the body of the
POST /ml/experiments/{experimentId}/versions
request to create an experiment version.
In this example, the experimentMode
returned is set to intelligent
. Intelligent model optimization automatically
refines models through iterations. For more information, see Intelligent model optimization
on Qlik Help.
Response example
{
"data": {
"type": "profile-insights",
"id": "6749ddb893296645bb4bd795",
"attributes": {
"tenantId": "GIlHILBfb5R6drAY2L7Zvi2c_YnlFDHR",
"ownerId": "67475097984561d02f0cb3dc",
"status": "ready",
"insights": [
{
"name": "AccountID",
"experimentTypes": [],
"insights": [
"high_cardinality",
"will_be_impact_encoded",
"valid_index"
],
"willBeDropped": true,
"cannotBeTarget": true
},
{
"name": "Territory",
"experimentTypes": [],
"insights": [
"will_be_impact_encoded"
],
"willBeDropped": false,
"cannotBeTarget": true
},
{
"name": "Country",
"experimentTypes": [],
"insights": [
"constant",
"will_be_one_hot_encoded"
],
"willBeDropped": true,
"cannotBeTarget": true
},
{
"name": "DeviceType",
"experimentTypes": [
"multiclass"
],
"insights": [
"will_be_one_hot_encoded"
],
"willBeDropped": false,
"cannotBeTarget": false
},
{
"name": "Promotion",
"experimentTypes": [
"binary"
],
"insights": [
"will_be_one_hot_encoded"
],
"willBeDropped": false,
"cannotBeTarget": false
},
{
"name": "HasRenewed",
"experimentTypes": [
"binary"
],
"insights": [
"will_be_one_hot_encoded"
],
"willBeDropped": false,
"cannotBeTarget": false
},
{
"name": "PlanType",
"experimentTypes": [
"multiclass"
],
"insights": [
"will_be_one_hot_encoded"
],
"willBeDropped": false,
"cannotBeTarget": false
},
{
"name": "BaseFee",
"experimentTypes": [
"regression"
],
"insights": [],
"willBeDropped": false,
"cannotBeTarget": false
},
{
"name": "AdditionalFeatureSpend",
"experimentTypes": [
"regression"
],
"insights": [],
"willBeDropped": false,
"cannotBeTarget": false
},
{
"name": "NumberOfPenalties",
"experimentTypes": [
"multiclass"
],
"insights": [],
"willBeDropped": false,
"cannotBeTarget": false
},
{
"name": "CurrentPeriodUsage",
"experimentTypes": [
"regression"
],
"insights": [],
"willBeDropped": false,
"cannotBeTarget": false
},
{
"name": "PriorPeriodUsage",
"experimentTypes": [
"regression"
],
"insights": [],
"willBeDropped": false,
"cannotBeTarget": false
},
{
"name": "DaysSinceLastService",
"experimentTypes": [
"regression"
],
"insights": [],
"willBeDropped": false,
"cannotBeTarget": false
},
{
"name": "ServiceRating",
"experimentTypes": [
"regression"
],
"insights": [],
"willBeDropped": false,
"cannotBeTarget": false
},
{
"name": "ServiceTickets",
"experimentTypes": [
"regression"
],
"insights": [],
"willBeDropped": false,
"cannotBeTarget": false
},
{
"name": "StartMonth",
"experimentTypes": [],
"insights": [
"will_be_one_hot_encoded"
],
"willBeDropped": false,
"cannotBeTarget": true
},
{
"name": "StartWeek",
"experimentTypes": [],
"insights": [
"will_be_impact_encoded"
],
"willBeDropped": false,
"cannotBeTarget": true
},
{
"name": "CustomerTenure",
"experimentTypes": [
"regression"
],
"insights": [],
"willBeDropped": false,
"cannotBeTarget": false
},
{
"name": "Churned",
"experimentTypes": [
"binary"
],
"insights": [
"will_be_one_hot_encoded"
],
"willBeDropped": false,
"cannotBeTarget": false
}
],
"defaultVersionConfig": {
"name": "2024-12-10T15:08:06.570Z",
"datasetOrigin": "new",
"dataSetId": "6749ddb893296645bb4bd795",
"experimentMode": "intelligent",
"featuresList": [
{
"name": "AccountID",
"dataType": "STRING",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "Territory",
"dataType": "STRING",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "Country",
"dataType": "STRING",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "DeviceType",
"dataType": "STRING",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "Promotion",
"dataType": "STRING",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "HasRenewed",
"dataType": "STRING",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "PlanType",
"dataType": "STRING",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "BaseFee",
"dataType": "DOUBLE",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "AdditionalFeatureSpend",
"dataType": "INTEGER",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "NumberOfPenalties",
"dataType": "INTEGER",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "CurrentPeriodUsage",
"dataType": "DOUBLE",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "PriorPeriodUsage",
"dataType": "DOUBLE",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "DaysSinceLastService",
"dataType": "INTEGER",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "ServiceRating",
"dataType": "DOUBLE",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "ServiceTickets",
"dataType": "INTEGER",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "StartMonth",
"dataType": "STRING",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "StartWeek",
"dataType": "STRING",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "CustomerTenure",
"dataType": "INTEGER",
"include": true,
"featureType": "numeric",
"changeType": null
},
{
"name": "Churned",
"dataType": "STRING",
"include": true,
"featureType": "numeric",
"changeType": null
}
]
}
}
}
}
Select the target feature
Use the insights
property from the profile insights response to select a target feature. A suitable target feature:
- Has
"cannotBeTarget": false
. - Shouldn’t be flagged as unsuitable (for example,
high_cardinality
). - Represents a meaningful and actionable objective.
In the example response, the Churned
column is a valid target because:
"cannotBeTarget": false
shows it can be used as the target.- Predicting churn is actionable for improving customer retention.
Next step
Profiling your dataset provides everything you need to start creating an experiment version::
- Feature insights help you select which features to include or exclude.
- A basis for selecting a meaningful target feature.
- Experiment types (
experimentTypes
) for the selected target feature. - A
defaultVersionConfig
template that simplifies the creation of an experiment version.
With this, you can now create your experiment versions and train models.