Dimensionality Reduction in DataRobot Using t-SNE

Horizontal

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a powerful technique for dimensionality reduction that can effectively visualize high-dimensional data in a lower-dimensional space.

Build with Free Trial

Dimensionality reduction can improve machine learning results by reducing computational complexity of the algorithms, preventing overfitting, and focusing on the most relevant features in the dataset. Note that this technique should only be used when the number of features is low.

Import libraries

In [ ]:
import datarobot as dr
import pandas as pd
import seaborn as sns
from sklearn.manifold import TSNE

Connect to DataRobot

Instructions for obtaining your endpoint and token are located in the DataRobot API documentation here.

In [3]:

# either directly pass in your endpoint/token, use a config file, or connect using DataRobot notebooks
dr.Client()

Out [3]:

<datarobot.rest.RESTClientObject at 0x7f5f10312280>

Get dataset

This example uses data on the movement of a double pendulum which has already been loaded into DataRobot for this example, but can be found here.

In [40]:

# replace the dataset ID with your own data
ds_id = "62fbcdf583b30f0ef972dc31"

# get dataset from DataRobot
ds = dr.Dataset.get(ds_id)
df = ds.get_as_dataframe()
display(df)

Out[40]:

	t	x1	x2	v1	v2	a1	a2
0	0.000000	2.36	3.14	-0.0100	-0.01000	-9.24	6.53
1	0.000862	2.36	3.14	-0.0180	-0.00437	-9.24	6.53
2	0.001720	2.36	3.14	-0.0259	0.00126	-9.24	6.53
3	0.002590	2.36	3.14	-0.0339	0.00689	-9.24	6.53
4	0.003450	2.36	3.14	-0.0418	0.01250	-9.24	6.53
…	…	…	…	…	…	…	…
2424	9.970000	-14.70	-22.40	1.1400	1.82000	6.94	-3.84
2425	9.980000	-14.70	-22.30	1.2000	1.79000	7.04	-3.64
2426	9.980000	-14.70	-22.30	1.2500	1.76000	7.12	-3.42
2427	9.990000	-14.70	-22.30	1.3100	1.73000	7.20	-3.19
2428	10.000000	-14.70	-22.30	1.3700	1.70000	7.28	-2.95

2429 rows × 7 columns

Reduce the number of features in the dataset

In [ ]:

# features to exclude from reduction
# can be target columns or ID columns or other
exclude_cols = ["t", "a2"]

model = TSNE(learning_rate=100, random_state=42)
transformed = model.fit_transform(df.drop(exclude_cols, axis=1))

In [25]:

transformed

Out [25]:

array([[  2.542573 , -80.301025 ],
       [  2.5057044, -80.29103  ],
       [  2.869162 , -80.113396 ],
       ...,
       [  9.5524645,  74.92201  ],
       [  9.630235 ,  74.90384  ],
       [  9.827253 ,  74.67084  ]], dtype=float32)

Create new dataframe with reduced columns and previously excluded columns

In [39]:

# get the tsne dataset
reduced_df = pd.DataFrame(transformed, columns=["tsne_x", "tsne_y"])

# join in target and time columns from original dataset
reduced_df = pd.concat([reduced_df, df[exclude_cols]], axis=1)

display(reduced_df)

Out[39]:

	tsne_x	tsne_y	t	a2
0	2.542573	-80.301025	0.000000	6.53
1	2.505704	-80.291031	0.000862	6.53
2	2.869162	-80.113396	0.001720	6.53
3	2.899721	-80.068108	0.002590	6.53
4	2.924986	-80.020332	0.003450	6.53
…	…	…	…	…
2424	9.658271	74.433037	9.970000	-3.84
2425	9.417135	74.999992	9.980000	-3.64
2426	9.552464	74.922012	9.980000	-3.42
2427	9.630235	74.903839	9.990000	-3.19
2428	9.827253	74.670837	10.000000	-2.95

2429 rows × 4 columns

Upload back to DataRobot

In [42]:


ds = dr.Dataset.create_from_in_memory_data(
    data_frame=reduced_df, fname=f"{ds.name}.csv"
)
ds.modify(name=f"{ds.name} t-SNE Reduced")
ds

Out [42]:

Dataset(name='Double Pendulum.csv.csv t-SNE Reduced', id='65a970bc040d9a438cdfb9de')

Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

Get Started Now