Multiview Datasets¶
UCI multiple feature dataset¶
- mvlearn.datasets.load_UCImultifeature(select_labeled='all', views='all', shuffle=False, random_state=None)[source]¶
Load the UCI multiple features dataset 1, taken from the UCI Machine Learning Repository 2 at https://archive.ics.uci.edu/ml/datasets/Multiple+Features. This data set consists of 6 views of handwritten digit images, with classes 0-9. The 6 views are the following:
76 Fourier coefficients of the character shapes
216 profile correlations
64 Karhunen-Love coefficients
240 pixel averages of the images from 2x3 windows
47 Zernike moments
6 morphological features
Each class contains 200 labeled examples.
- Parameters
select_labeled (optional, array-like, shape (n_features,) default (all)) -- A list of the examples that the user wants by label. If not specified, all examples in the dataset are returned. Repeated labels are ignored.
views (optional, array-like, shape (n_views,) default (all)) -- A list of the data views that the user would like in the indicated order. If not specified, all data views will be returned. Repeated views are ignored.
shuffle (bool, default=False) -- If
True
, returns each array with its rows and corresponding labels shuffled randomly according torandom_state
.random_state (int, default=None) -- Determines the order data is shuffled if
shuffle=True
. Used so that data loaded is reproducible but shuffled.
- Returns
data (list of np.ndarray, each of size (200*num_classes, n_features)) -- List of length 6 with each element being the data for one of the views.
labels (np.ndarray) -- Array of labels for the digit
References
- 1
M. van Breukelen, et al. "Handwritten digit recognition by combined classifiers", Kybernetika, 34(4):381-386, 1998
- 2
Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
Examples
>>> from mvlearn.datasets import load_UCImultifeature >>> # Load 6-view dataset with all 10 classes >>> mv_data, labels = load_UCImultifeature() >>> print(len(mv_data)) 6 >>> print([mv_data[i].shape for i in range(6)]) [(2000, 76), (2000, 216), (2000, 64), (2000, 240), (2000, 47), (2000, 6)] >>> print(labels.shape) (2000,)
Nutrimouse dataset¶
- mvlearn.datasets.load_nutrimouse(return_Xs_y=False)[source]¶
Load the Nutrimouse dataset 3, a two-view dataset from a nutrition study on mice, as available from https://CRAN.R-project.org/package=CCA 4.
- Parameters
return_Xs_y (bool, default=False) -- If
True
, returns an(Xs, y)
tuple of the multiple views and sample labels as strings. Otherwise returns all the data in a dictionary-like sklearn.utils.Bunch object.- Returns
dataset (sklearn.utils.Bunch object following key: value pairs) -- (see Notes for details). Returned if
return_Xs_y
is False. gene : numpy.ndarray, shape (40, 120)The gene expressions (1st view).
- lipidnumpy.ndarray, shape (40, 21)
The fatty acid concentrations (2nd view)
- genotypenumpy.ndarray, shape (40,)
The genotype label (1st label).
- dietnumpy.ndarray, shape (40,)
The diet label (2nd label).
- gene_feature_nameslist, length 120
The names of the genes.
- lipid_feature_nameslist, length 21
The names of the fatty acids.
(Xs, y) (2-tuple of the multiple views and labels as strings) -- Returned if
return_Xs_y
is False.
Notes
This data consists of two views from a nutrition study of 40 mice:
gene : expressions of 120 potentially relevant genes
lipid : concentrations of 21 hepatic fatty acids
Each mouse has two labels, four mice per pair of labels:
genotype (2 classes) : wt, ppar
diet (5 classes) : REF, COC, SUN, LIN, FISH
References
- 3
P. Martin, H. Guillou, F. Lasserre, S. Déjean, A. Lan, J-M. Pascussi, M. San Cristobal, P. Legrand, P. Besse, T. Pineau. "Novel aspects of PPARalpha-mediated regulation of lipid and xenobiotic metabolism revealed through a nutrigenomic study." Hepatology, 2007.
- 4
González I., Déjean S., Martin P.G.P and Baccini, A. (2008) CCA: "An R Package to Extend Canonical Correlation Analysis." Journal of Statistical Software, 23(12).
Examples
>>> from mvlearn.datasets import load_nutrimouse >>> # Load both views and labels >>> Xs, y = load_nutrimouse(return_Xs_y=True) >>> print(len(Xs)) 2 >>> print([X.shape for X in Xs]) [(40, 120), (40, 21)] >>> print(labels.shape) (40, 2)
Data Simulator¶
- mvlearn.datasets.make_gaussian_mixture(n_samples, centers, covariances, transform='linear', noise=None, noise_dims=None, class_probs=None, random_state=None, shuffle=False, shuffle_random_state=None, seed=1, return_latents=False)[source]¶
Creates a two-view dataset from a Gaussian mixture model and a transformation.
- Parameters
n_samples (int) -- The number of points in each view, divided across Gaussians per class_probs.
centers (1D array-like or list of 1D array-likes) -- The mean(s) of the Gaussian(s) from which the latent points are sampled. If is a list of 1D array-likes, each is the mean of a distinct Gaussian, sampled from with probability given by class_probs. Otherwise is the mean of a single Gaussian from which all are sampled.
covariances (2D array-like or list of 2D array-likes) -- The covariance matrix(s) of the Gaussian(s), matched to the specified centers.
transform ('linear' | 'sin' | poly' | callable, (default 'linear')) -- Transformation to perform on the latent variable. If a function, applies it to the latent. Otherwise uses an implemented function.
noise (double or None (default=None)) -- Variance of mean zero Gaussian noise added to the first view
noise_dims (int or None (default=None)) -- Number of additional dimensions of standard normal noise to add
class_probs (array-like, default=None) -- A list of probabilities specifying the probability of a latent point being sampled from each of the Gaussians. Must sum to 1. If None, then is taken to be uniform over the Gaussians.
random_state (int, default=None) -- If set, can be used to reproduce the data generated.
shuffle (bool, default=False) -- If
True
, data is shuffled so the labels are not ordered.shuffle_random_state (int, default=None) -- If given, then sets the random state for shuffling the samples. Ignored if
shuffle=False
.return_latents (boolean (defaul False)) -- If true, returns the non-noisy latent variables
- Returns
Xs (list of np.ndarray, each shape (n_samples, n_features)) -- The latent data and its noisy transformation
y (np.ndarray, shape (n_samples,)) -- The integer labels for each sample's Gaussian membership
latents (np.ndarray, shape (n_samples, n_features)) -- The non-noisy latent variables. Only returned if
return_latents=True
.
Notes
For each class \(i\) with prior probability \(p_i\), center and covariance matrix \(\mu_i\) and \(\Sigma_i\), and \(n\) total samples, the latent data is sampled such that:
\[(X_1, y_1), \dots, (X_{np_i}, Y_{np_i}) \overset{i.i.d.}{\sim} \mathcal{N}(\mu_i, \Sigma_i)\]Two views of data are returned, the first being the latent samples and the second being a specified transformation of the latent samples. Additional noise may be added to the first view or added as noise dimensions to both views.
Examples
>>> from mvlearn.datasets import make_gaussian_mixture >>> import numpy as np >>> n_samples = 10 >>> centers = [[0,1], [0,-1]] >>> covariances = [np.eye(2), np.eye(2)] >>> Xs, y = make_gaussian_mixture(n_samples, centers, covariances, ... shuffle=True, shuffle_random_state=42) >>> print(y) [1. 0. 1. 0. 1. 0. 1. 0. 0. 1.]
Factor Model¶
- mvlearn.datasets.sample_joint_factor_model(n_views, n_samples, n_features, joint_rank=1, noise_std=1, m=1.5, random_state=None, return_decomp=False)[source]¶
Samples from a low rank, joint factor model where there is one set of shared scores.
- Parameters
n_views (int) -- Number of views to sample
n_samples (int) -- Number of samples in each view
n_features (int, or list of ints) -- Number of features in each view. A list specifies a different number of features for each view.
joint_rank (int (default 1)) -- Rank of the common signal across views.
noise_std (float (default 1)) -- Scale of noise distribution.
m (float (default 1.5)) -- Signal strength.
random_state (int or RandomState instance, optional (default=None)) -- Controls random orthonormal matrix sampling and random noise generation. Set for reproducible results.
return_decomp (boolean, default=False) -- If
True
, returns theview_loadings
as well.
- Returns
Xs (list of array-likes or numpy.ndarray) --
Xs length: n_views
Xs[i] shape: (n_samples, n_features_i)
List of samples data matrices
U ((n_samples, joint_rank)) -- The true orthonormal joint scores matrix. Returned if
return_decomp
is True.view_loadings (list of numpy.ndarray) -- The true view loadings matrices. Returned if
return_decomp
is True.
Notes
- For b = 1, .., B
X_b = U @ diag(svals) @ W_b^T + noise_std * E_b
where U and each W_b are orthonormal matrices. The singular values are linearly increasing following (Choi et al. 2017) section 2.2.3.