Multiview Datasets¶

UCI multiple feature dataset¶

mvlearn.datasets.load_UCImultifeature(select_labeled='all', views='all', shuffle=False, random_state=None)[source]¶

Load the UCI multiple features dataset 1, taken from the UCI Machine Learning Repository 2 at https://archive.ics.uci.edu/ml/datasets/Multiple+Features. This data set consists of 6 views of handwritten digit images, with classes 0-9. The 6 views are the following:

76 Fourier coefficients of the character shapes
216 profile correlations
64 Karhunen-Love coefficients
240 pixel averages of the images from 2x3 windows
47 Zernike moments
6 morphological features

Each class contains 200 labeled examples.

Parameters

select_labeled (optional, array-like, shape (n_features,) default (all)) -- A list of the examples that the user wants by label. If not specified, all examples in the dataset are returned. Repeated labels are ignored.
views (optional, array-like, shape (n_views,) default (all)) -- A list of the data views that the user would like in the indicated order. If not specified, all data views will be returned. Repeated views are ignored.
shuffle (bool, default=False) -- If True, returns each array with its rows and corresponding labels shuffled randomly according to random_state.
random_state (int, default=None) -- Determines the order data is shuffled if shuffle=True. Used so that data loaded is reproducible but shuffled.

Returns

data (list of np.ndarray, each of size (200*num_classes, n_features)) -- List of length 6 with each element being the data for one of the views.
labels (np.ndarray) -- Array of labels for the digit

References

1: M. van Breukelen, et al. "Handwritten digit recognition by combined classifiers", Kybernetika, 34(4):381-386, 1998
2: Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Examples

>>> from mvlearn.datasets import load_UCImultifeature
>>> # Load 6-view dataset with all 10 classes
>>> mv_data, labels = load_UCImultifeature()
>>> print(len(mv_data))
6
>>> print([mv_data[i].shape for i in range(6)])
[(2000, 76), (2000, 216), (2000, 64), (2000, 240), (2000, 47), (2000, 6)]
>>> print(labels.shape)
(2000,)

Nutrimouse dataset¶

mvlearn.datasets.load_nutrimouse(return_Xs_y=False)[source]¶

Load the Nutrimouse dataset 3, a two-view dataset from a nutrition study on mice, as available from https://CRAN.R-project.org/package=CCA 4.

Parameters

return_Xs_y (bool, default=False) -- If True, returns an (Xs, y) tuple of the multiple views and sample labels as strings. Otherwise returns all the data in a dictionary-like sklearn.utils.Bunch object.

Returns

dataset (sklearn.utils.Bunch object following key: value pairs) -- (see Notes for details). Returned if return_Xs_y is False. gene : numpy.ndarray, shape (40, 120)

The gene expressions (1st view).

lipidnumpy.ndarray, shape (40, 21)
The fatty acid concentrations (2nd view)

genotypenumpy.ndarray, shape (40,)
The genotype label (1st label).

dietnumpy.ndarray, shape (40,)
The diet label (2nd label).

gene_feature_nameslist, length 120
The names of the genes.

lipid_feature_nameslist, length 21
The names of the fatty acids.
(Xs, y) (2-tuple of the multiple views and labels as strings) -- Returned if return_Xs_y is False.

Notes

This data consists of two views from a nutrition study of 40 mice:

gene : expressions of 120 potentially relevant genes
lipid : concentrations of 21 hepatic fatty acids

Each mouse has two labels, four mice per pair of labels:

genotype (2 classes) : wt, ppar
diet (5 classes) : REF, COC, SUN, LIN, FISH

References

3: P. Martin, H. Guillou, F. Lasserre, S. Déjean, A. Lan, J-M. Pascussi, M. San Cristobal, P. Legrand, P. Besse, T. Pineau. "Novel aspects of PPARalpha-mediated regulation of lipid and xenobiotic metabolism revealed through a nutrigenomic study." Hepatology, 2007.
4: González I., Déjean S., Martin P.G.P and Baccini, A. (2008) CCA: "An R Package to Extend Canonical Correlation Analysis." Journal of Statistical Software, 23(12).

Examples

>>> from mvlearn.datasets import load_nutrimouse
>>> # Load both views and labels
>>> Xs, y = load_nutrimouse(return_Xs_y=True)
>>> print(len(Xs))
2
>>> print([X.shape for X in Xs])
[(40, 120), (40, 21)]
>>> print(labels.shape)
(40, 2)

Data Simulator¶

mvlearn.datasets.make_gaussian_mixture(n_samples, centers, covariances, transform='linear', noise=None, noise_dims=None, class_probs=None, random_state=None, shuffle=False, shuffle_random_state=None, seed=1, return_latents=False)[source]¶

Creates a two-view dataset from a Gaussian mixture model and a transformation.

Parameters

n_samples (int) -- The number of points in each view, divided across Gaussians per class_probs.
centers (1D array-like or list of 1D array-likes) -- The mean(s) of the Gaussian(s) from which the latent points are sampled. If is a list of 1D array-likes, each is the mean of a distinct Gaussian, sampled from with probability given by class_probs. Otherwise is the mean of a single Gaussian from which all are sampled.
covariances (2D array-like or list of 2D array-likes) -- The covariance matrix(s) of the Gaussian(s), matched to the specified centers.
transform ('linear' | 'sin' | poly' | callable, (default 'linear')) -- Transformation to perform on the latent variable. If a function, applies it to the latent. Otherwise uses an implemented function.
noise (double or None (default=None)) -- Variance of mean zero Gaussian noise added to the first view
noise_dims (int or None (default=None)) -- Number of additional dimensions of standard normal noise to add
class_probs (array-like, default=None) -- A list of probabilities specifying the probability of a latent point being sampled from each of the Gaussians. Must sum to 1. If None, then is taken to be uniform over the Gaussians.
random_state (int, default=None) -- If set, can be used to reproduce the data generated.
shuffle (bool, default=False) -- If True, data is shuffled so the labels are not ordered.
shuffle_random_state (int, default=None) -- If given, then sets the random state for shuffling the samples. Ignored if shuffle=False.
return_latents (boolean (defaul False)) -- If true, returns the non-noisy latent variables

Returns

Xs (list of np.ndarray, each shape (n_samples, n_features)) -- The latent data and its noisy transformation
y (np.ndarray, shape (n_samples,)) -- The integer labels for each sample's Gaussian membership
latents (np.ndarray, shape (n_samples, n_features)) -- The non-noisy latent variables. Only returned if return_latents=True.

Notes

For each class \(i\) with prior probability \(p_i\), center and covariance matrix \(\mu_i\) and \(\Sigma_i\), and \(n\) total samples, the latent data is sampled such that:

\[(X_1, y_1), \dots, (X_{np_i}, Y_{np_i}) \overset{i.i.d.}{\sim} \mathcal{N}(\mu_i, \Sigma_i)\]

Two views of data are returned, the first being the latent samples and the second being a specified transformation of the latent samples. Additional noise may be added to the first view or added as noise dimensions to both views.

Examples

>>> from mvlearn.datasets import make_gaussian_mixture
>>> import numpy as np
>>> n_samples = 10
>>> centers = [[0,1], [0,-1]]
>>> covariances = [np.eye(2), np.eye(2)]
>>> Xs, y = make_gaussian_mixture(n_samples, centers, covariances,
...                               shuffle=True, shuffle_random_state=42)
>>> print(y)
[1. 0. 1. 0. 1. 0. 1. 0. 0. 1.]

Factor Model¶

mvlearn.datasets.sample_joint_factor_model(n_views, n_samples, n_features, joint_rank=1, noise_std=1, m=1.5, random_state=None, return_decomp=False)[source]¶

Samples from a low rank, joint factor model where there is one set of shared scores.

Parameters

n_views (int) -- Number of views to sample
n_samples (int) -- Number of samples in each view
n_features (int, or list of ints) -- Number of features in each view. A list specifies a different number of features for each view.
joint_rank (int (default 1)) -- Rank of the common signal across views.
noise_std (float (default 1)) -- Scale of noise distribution.
m (float (default 1.5)) -- Signal strength.
random_state (int or RandomState instance, optional (default=None)) -- Controls random orthonormal matrix sampling and random noise generation. Set for reproducible results.
return_decomp (boolean, default=False) -- If True, returns the view_loadings as well.

Returns

Xs (list of array-likes or numpy.ndarray) --
- Xs length: n_views
- Xs[i] shape: (n_samples, n_features_i)
List of samples data matrices
U ((n_samples, joint_rank)) -- The true orthonormal joint scores matrix. Returned if return_decomp is True.
view_loadings (list of numpy.ndarray) -- The true view loadings matrices. Returned if return_decomp is True.

Notes

For b = 1, .., B: X_b = U @ diag(svals) @ W_b^T + noise_std * E_b

where U and each W_b are orthonormal matrices. The singular values are linearly increasing following (Choi et al. 2017) section 2.2.3.