Multiview Datasets

UCI multiple feature dataset

mvlearn.datasets.load_UCImultifeature(select_labeled='all', views='all', shuffle=False, random_state=None)[source]

Load the UCI multiple features dataset 1, taken from the UCI Machine Learning Repository 2 at https://archive.ics.uci.edu/ml/datasets/Multiple+Features. This data set consists of 6 views of handwritten digit images, with classes 0-9. The 6 views are the following:

  1. 76 Fourier coefficients of the character shapes

  2. 216 profile correlations

  3. 64 Karhunen-Love coefficients

  4. 240 pixel averages of the images from 2x3 windows

  5. 47 Zernike moments

  6. 6 morphological features

Each class contains 200 labeled examples.

Parameters
  • select_labeled (optional, array-like, shape (n_features,) default (all)) -- A list of the examples that the user wants by label. If not specified, all examples in the dataset are returned. Repeated labels are ignored.

  • views (optional, array-like, shape (n_views,) default (all)) -- A list of the data views that the user would like in the indicated order. If not specified, all data views will be returned. Repeated views are ignored.

  • shuffle (bool, default=False) -- If True, returns each array with its rows and corresponding labels shuffled randomly according to random_state.

  • random_state (int, default=None) -- Determines the order data is shuffled if shuffle=True. Used so that data loaded is reproducible but shuffled.

Returns

  • data (list of np.ndarray, each of size (200*num_classes, n_features)) -- List of length 6 with each element being the data for one of the views.

  • labels (np.ndarray) -- Array of labels for the digit

References

1

M. van Breukelen, et al. "Handwritten digit recognition by combined classifiers", Kybernetika, 34(4):381-386, 1998

2

Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Examples

>>> from mvlearn.datasets import load_UCImultifeature
>>> # Load 6-view dataset with all 10 classes
>>> mv_data, labels = load_UCImultifeature()
>>> print(len(mv_data))
6
>>> print([mv_data[i].shape for i in range(6)])
[(2000, 76), (2000, 216), (2000, 64), (2000, 240), (2000, 47), (2000, 6)]
>>> print(labels.shape)
(2000,)

Nutrimouse dataset

mvlearn.datasets.load_nutrimouse(return_Xs_y=False)[source]

Load the Nutrimouse dataset 3, a two-view dataset from a nutrition study on mice, as available from https://CRAN.R-project.org/package=CCA 4.

Parameters

return_Xs_y (bool, default=False) -- If True, returns an (Xs, y) tuple of the multiple views and sample labels as strings. Otherwise returns all the data in a dictionary-like sklearn.utils.Bunch object.

Returns

  • dataset (sklearn.utils.Bunch object following key: value pairs) -- (see Notes for details). Returned if return_Xs_y is False. gene : numpy.ndarray, shape (40, 120)

    The gene expressions (1st view).

    lipidnumpy.ndarray, shape (40, 21)

    The fatty acid concentrations (2nd view)

    genotypenumpy.ndarray, shape (40,)

    The genotype label (1st label).

    dietnumpy.ndarray, shape (40,)

    The diet label (2nd label).

    gene_feature_nameslist, length 120

    The names of the genes.

    lipid_feature_nameslist, length 21

    The names of the fatty acids.

  • (Xs, y) (2-tuple of the multiple views and labels as strings) -- Returned if return_Xs_y is False.

Notes

This data consists of two views from a nutrition study of 40 mice:

  • gene : expressions of 120 potentially relevant genes

  • lipid : concentrations of 21 hepatic fatty acids

Each mouse has two labels, four mice per pair of labels:

  • genotype (2 classes) : wt, ppar

  • diet (5 classes) : REF, COC, SUN, LIN, FISH

References

3

P. Martin, H. Guillou, F. Lasserre, S. Déjean, A. Lan, J-M. Pascussi, M. San Cristobal, P. Legrand, P. Besse, T. Pineau. "Novel aspects of PPARalpha-mediated regulation of lipid and xenobiotic metabolism revealed through a nutrigenomic study." Hepatology, 2007.

4

González I., Déjean S., Martin P.G.P and Baccini, A. (2008) CCA: "An R Package to Extend Canonical Correlation Analysis." Journal of Statistical Software, 23(12).

Examples

>>> from mvlearn.datasets import load_nutrimouse
>>> # Load both views and labels
>>> Xs, y = load_nutrimouse(return_Xs_y=True)
>>> print(len(Xs))
2
>>> print([X.shape for X in Xs])
[(40, 120), (40, 21)]
>>> print(labels.shape)
(40, 2)

Data Simulator

mvlearn.datasets.make_gaussian_mixture(n_samples, centers, covariances, transform='linear', noise=None, noise_dims=None, class_probs=None, random_state=None, shuffle=False, shuffle_random_state=None, seed=1, return_latents=False)[source]

Creates a two-view dataset from a Gaussian mixture model and a transformation.

Parameters
  • n_samples (int) -- The number of points in each view, divided across Gaussians per class_probs.

  • centers (1D array-like or list of 1D array-likes) -- The mean(s) of the Gaussian(s) from which the latent points are sampled. If is a list of 1D array-likes, each is the mean of a distinct Gaussian, sampled from with probability given by class_probs. Otherwise is the mean of a single Gaussian from which all are sampled.

  • covariances (2D array-like or list of 2D array-likes) -- The covariance matrix(s) of the Gaussian(s), matched to the specified centers.

  • transform ('linear' | 'sin' | poly' | callable, (default 'linear')) -- Transformation to perform on the latent variable. If a function, applies it to the latent. Otherwise uses an implemented function.

  • noise (double or None (default=None)) -- Variance of mean zero Gaussian noise added to the first view

  • noise_dims (int or None (default=None)) -- Number of additional dimensions of standard normal noise to add

  • class_probs (array-like, default=None) -- A list of probabilities specifying the probability of a latent point being sampled from each of the Gaussians. Must sum to 1. If None, then is taken to be uniform over the Gaussians.

  • random_state (int, default=None) -- If set, can be used to reproduce the data generated.

  • shuffle (bool, default=False) -- If True, data is shuffled so the labels are not ordered.

  • shuffle_random_state (int, default=None) -- If given, then sets the random state for shuffling the samples. Ignored if shuffle=False.

  • return_latents (boolean (defaul False)) -- If true, returns the non-noisy latent variables

Returns

  • Xs (list of np.ndarray, each shape (n_samples, n_features)) -- The latent data and its noisy transformation

  • y (np.ndarray, shape (n_samples,)) -- The integer labels for each sample's Gaussian membership

  • latents (np.ndarray, shape (n_samples, n_features)) -- The non-noisy latent variables. Only returned if return_latents=True.

Notes

For each class \(i\) with prior probability \(p_i\), center and covariance matrix \(\mu_i\) and \(\Sigma_i\), and \(n\) total samples, the latent data is sampled such that:

\[(X_1, y_1), \dots, (X_{np_i}, Y_{np_i}) \overset{i.i.d.}{\sim} \mathcal{N}(\mu_i, \Sigma_i)\]

Two views of data are returned, the first being the latent samples and the second being a specified transformation of the latent samples. Additional noise may be added to the first view or added as noise dimensions to both views.

Examples

>>> from mvlearn.datasets import make_gaussian_mixture
>>> import numpy as np
>>> n_samples = 10
>>> centers = [[0,1], [0,-1]]
>>> covariances = [np.eye(2), np.eye(2)]
>>> Xs, y = make_gaussian_mixture(n_samples, centers, covariances,
...                               shuffle=True, shuffle_random_state=42)
>>> print(y)
[1. 0. 1. 0. 1. 0. 1. 0. 0. 1.]

Factor Model

mvlearn.datasets.sample_joint_factor_model(n_views, n_samples, n_features, joint_rank=1, noise_std=1, m=1.5, random_state=None, return_decomp=False)[source]

Samples from a low rank, joint factor model where there is one set of shared scores.

Parameters
  • n_views (int) -- Number of views to sample

  • n_samples (int) -- Number of samples in each view

  • n_features (int, or list of ints) -- Number of features in each view. A list specifies a different number of features for each view.

  • joint_rank (int (default 1)) -- Rank of the common signal across views.

  • noise_std (float (default 1)) -- Scale of noise distribution.

  • m (float (default 1.5)) -- Signal strength.

  • random_state (int or RandomState instance, optional (default=None)) -- Controls random orthonormal matrix sampling and random noise generation. Set for reproducible results.

  • return_decomp (boolean, default=False) -- If True, returns the view_loadings as well.

Returns

  • Xs (list of array-likes or numpy.ndarray) --

    • Xs length: n_views

    • Xs[i] shape: (n_samples, n_features_i)

    List of samples data matrices

  • U ((n_samples, joint_rank)) -- The true orthonormal joint scores matrix. Returned if return_decomp is True.

  • view_loadings (list of numpy.ndarray) -- The true view loadings matrices. Returned if return_decomp is True.

Notes

For b = 1, .., B

X_b = U @ diag(svals) @ W_b^T + noise_std * E_b

where U and each W_b are orthonormal matrices. The singular values are linearly increasing following (Choi et al. 2017) section 2.2.3.