DataRobot PartnersUnify your AI stack with our open platform, extend your cloud investments, and connect with service providers to help you build, deploy, or migrate to DataRobot.
This AI Accelerator shows how to extract cluster insights from DataRobot models, use prompt engineering to label clusters and then rename the clusters in the DataRobot project.
DataRobot has democratized unsupervised learning by allowing you to run clustering on datasets with minimal effort and risk with guardrails and smart visualizations. You can build segmentation models on any kind of dataset they can conjure. Once the segments/clusters are built, DataRobot provides cluster level insights which can be used to understand each cluster/segment further. Generally, cluster/segment analysis requires a subject matter expert to be able to understand and label the segment. However, with the advancement of Generative AI, users without subject matter expertise can leverage LLMs to provide last mile analytics on clustering or segmentation models.
This accelerator shows how you can use cluster insights provided by DataRobot with ChatGPT to provide business- or domain-specific labels to the clusters using OpenAI and DataRobot APIs.
Out[3]:
<datarobot.rest.RESTClientObject at 0x7fac77e4e9d0>
Prompt completion
This demo uses Openai’s chatGPT but the approach can be used on similar LLM models. The prompt structure and completion functions are inspired from Andrew Ng’s course on Prompt Engineering.
Out[10]:
'Cluster 1: liveborn, outcome, delivery, mellitus, anomalies, mother, accompanying, healthy, person,Average time_in_hospital is 3.0102249488752557, Average number_diagnoses is 4.875937286980232; Cluster 2: esophagus, acromegaly, gigantism, inguinal, dementia, senile, cysts, amblyopia, bleeding,Average time_in_hospital is 3.8177290836653386, Average number_diagnoses is 7.886454183266932; Cluster 3: postmyocardial, insomnia, organic, coronary, native, vessel, postmyocardial, syndrome, infarction,Average time_in_hospital is 4.021739130434782, Average number_diagnoses is 6.911764705882353; Cluster 4: mellitus, stated, ii, inertia, uterine, nontoxic, carcinoma, cystic, deformity,Average time_in_hospital is 3.371548117154812, Average number_diagnoses is 5.773221757322176; Cluster 5: iv, stage, through, alkaloids, opium, disseminated, pelvis, hereditary, infiltrative,Average time_in_hospital is 4.708683473389356, Average number_diagnoses is 7.927170868347339; Cluster 6: 19, adult, body, pharynx, thyroiditis, head, skull, effect, body,Average time_in_hospital is 3.8707627118644066, Average number_diagnoses is 7.46045197740113; Cluster 7: thrombosis, epilepsy, involvement, aortic, anomaly, empyema, rheumatic, alveolar, hypertrophic,Average time_in_hospital is 8.433709449929479, Average number_diagnoses is 8.25176304654443; '
Prompt function
The below functions uses the Cluster Summary and requests LLM model to label the clusters. The parameter “label_type” can be used to tweak the flavor of the cluster labels.
In[12]:
def get_cluster_names(cluster_info, label_types="human friendly and descriptive"):
prompt = (
'you are an business analyst. You have run a clustering model and following text in double quotes shows the cluster level values."'
+ cluster_info
+ "”. Please provide "
+ label_types
+ " cluster names for each cluster. Output format is json with fields cluster description, cluster name, cluster id."
)
response = get_completion(prompt)
return prompt, response
Demo
Using the project built on the 10K Diabetes dataset, label the same clusters for different audiences.
Out[14]:
[{'cluster_description': 'This cluster includes cases related to liveborn outcomes, delivery, maternal diabetes, anomalies, and accompanying healthy individuals.',
'cluster_name': 'Maternal and Neonatal Health',
'cluster_id': 1},
{'cluster_description': 'This cluster includes cases related to esophageal disorders, acromegaly, gigantism, inguinal conditions, dementia, cysts, amblyopia, and bleeding.',
'cluster_name': 'Rare Disorders',
'cluster_id': 2},
{'cluster_description': 'This cluster includes cases related to postmyocardial conditions, insomnia, organic disorders, coronary issues, native vessel problems, and infarction.',
'cluster_name': 'Cardiovascular Health',
'cluster_id': 3},
{'cluster_description': 'This cluster includes cases related to diabetes mellitus, uterine issues, nontoxic carcinoma, cystic deformities, and hormonal inertia.',
'cluster_name': 'Endocrine Disorders',
'cluster_id': 4},
{'cluster_description': 'This cluster includes cases related to advanced stage diseases, opium alkaloid use, disseminated conditions, and hereditary infiltrative disorders.',
'cluster_name': 'Advanced Stage Diseases',
'cluster_id': 5},
{'cluster_description': 'This cluster includes cases related to adult body conditions, pharyngeal disorders, thyroiditis, head and skull issues, and their effects on the body.',
'cluster_name': 'Head and Body Disorders',
'cluster_id': 6},
{'cluster_description': 'This cluster includes cases related to thrombosis, epilepsy, aortic involvement, alveolar anomalies, empyema, and hypertrophic conditions.',
'cluster_name': 'Vascular and Neurological Disorders',
'cluster_id': 7}]
Update clusters
Using the DataRobot API, you can update the cluster names in the UI automatically, reducing manual effort.
Warning: Ensure that the Cluster name matches when updating.
In[15]:
cluster_name_mappings = [
("Cluster " + str(cluster["cluster_id"]), cluster["cluster_name"])
for cluster in cluster_json["clusters"]
]
model.update_cluster_names(cluster_name_mappings)
Out[15]:
[Cluster(name=Maternal and Neonatal Health, percent=16.3),
Cluster(name=Rare Disorders, percent=22.31111111111111),
Cluster(name=Cardiovascular Health, percent=8.688888888888888),
Cluster(name=Endocrine Disorders, percent=13.277777777777779),
Cluster(name=Advanced Stage Diseases, percent=7.933333333333334),
Cluster(name=Head and Body Disorders, percent=15.733333333333333),
Cluster(name=Vascular and Neurological Disorders, percent=15.755555555555556)]
Conclusion
This is how you can use Cluster Insights provided by DataRobot and use ChatGPT to provide business- or domain-specific labels to the clusters using OpenAI and DataRobot APIs. You also saw how you can customized the prompting to be able to tailor the cluster labels to suit the end users.
Get Started with Free Trial
Experience new features and capabilities previously only available in our full AI Platform product.