Page 373 - Python Data Science Handbook

P. 373

Figure 5-18. The handwritten digits data; each sample is represented by one 8×8 grid of
pixels

In order to work with this data within Scikit-Learn, we need a two-dimensional,
[n_samples, n_features] representation. We can accomplish this by treating each
pixel in the image as a feature—that is, by flattening out the pixel arrays so that we
have a length-64 array of pixel values representing each digit. Additionally, we need
the target array, which gives the previously determined label for each digit. These two
quantities are built into the digits dataset under the data and target attributes,
respectively:

In[24]: X = digits.data
X.shape
Out[24]: (1797, 64)
In[25]: y = digits.target
y.shape
Out[25]: (1797,)
We see here that there are 1,797 samples and 64 features.

Unsupervised learning: Dimensionality reduction
We’d like to visualize our points within the 64-dimensional parameter space, but it’s
difficult to effectively visualize points in such a high-dimensional space. Instead we’ll
reduce the dimensions to 2, using an unsupervised method. Here, we’ll make use of a

Introducing Scikit-Learn | 355

368 369 370 371 372 373 374 375 376 377 378