Semi-Supervised¶

Cotraining Classifier¶

class mvlearn.semi_supervised.CTClassifier(estimator1=None, estimator2=None, p=None, n=None, unlabeled_pool_size=75, num_iter=50, random_state=None)[source]¶

This class implements the co-training classifier for supervised and semi-supervised learning with the framework as described in 1. The best use case is when the 2 views of input data are sufficiently distinct and independent as detailed in 1. However, this can also be successful when a single matrix of input data is given as both views and two estimators are chosen which are quite different. 2. See the examples below.

In the semi-supervised case, performance can vary greatly, so using a separate validation set or cross validation procedure is recommended to ensure the classifier has fit well.

Parameters

estimator1 (classifier object, (default=sklearn GaussianNB)) -- The classifier object which will be trained on view 1 of the data. This classifier should support the predict_proba() function so that classification probabilities can be computed and co-training can be performed effectively.
estimator2 (classifier object, (default=sklearn GaussianNB)) -- The classifier object which will be trained on view 2 of the data. Does not need to be of the same type as estimator1, but should support predict_proba().
p (int, optional (default=None)) -- The number of positive classifications from the unlabeled_pool training set which will be given a positive "label". If None, the default is the floor of the ratio of positive to negative examples in the labeled training data (at least 1). If only one of p or n is not None, the other will be set to be the same. When the labels are 0 or 1, positive is defined as 1, and in general, positive is the larger label.
n (int, optional (default=None)) -- The number of negative classifications from the unlabeled_pool training set which will be given a negative "label". If None, the default is the floor of the ratio of positive to negative examples in the labeled training data (at least 1). If only one of p or n is not None, the other will be set to be the same. When the labels are 0 or 1, negative is defined as 0, and in general, negative is the smaller label.
unlabeled_pool_size (int, optional (default=75)) -- The number of unlabeled_pool samples which will be kept in a separate pool for classification and selection by the updated classifier at each training iteration.
num_iter (int, optional (default=50)) -- The maximum number of training iterations to run.
random_state (int (default=None)) -- The starting random seed for fit() and class operations, passed to numpy.random.seed().

estimator1_¶

The classifier used on view 1.

Type: classifier object

estimator2_¶

The classifier used on view 2.

Type: classifier object

class_name_¶

The name of the class.

Type: string

p_¶

The number of positive classifications from the unlabeled_pool training set which will be given a positive "label" each round.

Type: int, optional (default=None)

n_¶

The number of negative classifications from the unlabeled_pool training set which will be given a negative "label" each round.

Type: int, optional (default=None)

classes_¶

Unique class labels.

Type: array-like of shape (n_classes,)

Examples

>>> # Supervised learning of single-view data with 2 distinct estimators
>>> from mvlearn.semi_supervised import CTClassifier
>>> from mvlearn.datasets import load_UCImultifeature
>>> import numpy as np
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.model_selection import train_test_split
>>> data, labels = load_UCImultifeature(select_labeled=[0,1])
>>> X1 = data[0]  # Only using the first view
>>> X1_train, X1_test, l_train, l_test = train_test_split(X1, labels)

>>> # Supervised learning with a single view of data and 2 estimator types
>>> estimator1 = GaussianNB()
>>> estimator2 = RandomForestClassifier()
>>> ctc = CTClassifier(estimator1, estimator2, random_state=1)
>>> # Use the same matrix for each view
>>> ctc = ctc.fit([X1_train, X1_train], l_train)
>>> preds = ctc.predict([X1_test, X1_test])
>>> print("Accuracy: ", sum(preds==l_test) / len(preds))
Accuracy:  0.97

Notes

Multi-view co-training is most helpful for tasks in semi-supervised learning where each view offers unique information not seen in the other. As is shown in the example notebooks for using this algorithm, multi-view co-training can provide good classification results even when number of unlabeled samples far exceeds the number of labeled samples. This classifier uses 2 classifiers which work individually on each view but which share information and thus result in improved performance over looking at the views completely separately or even when concatenating the views to get more features in a single-view setting. The classifier can be initialized with or without the classifiers desired for each view being specified, but if the classifier for a certain view is specified, then it must support a predict_proba() method in order to give a sense of the most likely labels for different examples. This is because the algorithm must be able to determine which of the training samples it is most confident about during training epochs. The algorithm, as first proposed by Blum and Mitchell, is described in detail below.

Algorithm

Given:

a set L of labeled training samples (with 2 views)

a set U of unlabeled samples (with 2 views)

Create a pool U' of examples by choosing u examples at random from U

Loop for k iterations

Use L to train a classifier h1 (estimator1) that considers only the view 1 portion of the data (i.e. Xs[0])

Use L to train a classifier h2 (estimator2) that considers only the view 2 portion of the data (i.e. Xs[1])

Allow h1 to label p (self.p_) positive and n (self.n_) negative samples from view 1 of U'

Allow h2 to label p positive and n negative samples from view 2 of U'

Add these self-labeled samples to L

Randomly take 2*p* + 2*n* samples from U to replenish U'

References

1(1,2): Blum, A., and Mitchell, T. "Combining labeled and unlabeled data with co-training." In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 92–100, 1998.
2: Goldman, Sally, and Yan Zhou. "Enhancing supervised learning with unlabeled data." In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 2000.

Cotraining Regressor¶

class mvlearn.semi_supervised.CTRegressor(estimator1=None, estimator2=None, k_neighbors=5, unlabeled_pool_size=50, num_iter=100, random_state=None)[source]¶

This class implements the co-training regression for supervised and semi supervised learning with the framework as described in 3. The best use case is when 2 views of input data are sufficiently distinct and independent as detailed in 3. However this can also be successfull when a single matrix of input data is given as both views and two estimators are choosen which are quite different 4.

In the semi-supervised case, performance can vary greatly, so using a separate validation set or cross validation procedure is recommended to ensure the regression model has fit well.

Parameters

estimator1 (sklearn object, (only supports KNeighborsRegressor)) -- The regressor object which will be trained on view1 of the data.
estimator2 (sklearn object, (only supports KNeighborsRegressor)) -- The regressir object which will be trained on view2 of the data.
k_neighbors (int, optional (default = 5)) -- The number of neighbors to be considered for determining the mean squared error.
unlabeled_pool_size (int, optional (default = 50)) -- The number of unlabeled_pool samples which will be kept in a separate pool for regression and selection by the updated regressor at each training iteration.
num_iter (int, optional (default = 100)) -- The maximum number of iteration to be performed
random_state (int (default = None)) -- The seed for fit() method and other class operations

estimator1_¶

Type: regressor object, used on view1

estimator2_¶

Type: regressor object, used on view2

class_name_¶

The name of the class.

Type: string

k_neighbors_¶

The number of neighbors to be considered for determining the mean squared error.

Type: int

unlabeled_pool_size¶

The number of unlabeled_pool samples which will be kept in a separate pool for regression and selection by the updated regressor at each training iteration.

Type: int

num_iter¶

The maximum number of iterations to be performed

Type: int

n_views¶

The number of views in the data

Type: int

Examples

>>> from mvlearn.semi_supervised import CTRegressor
>>> from sklearn.neighbors import KNeighborsRegressor
>>> import numpy as np
>>> # X1 and X2 are the 2 views of the data
>>> X1 = [[0], [1], [2], [3], [4], [5], [6]]
>>> X2 = [[2], [3], [4], [6], [7], [8], [10]]
>>> y = [10, 11, 12, 13, 14, 15, 16]
>>> # Converting some of the labeled values to nan
>>> y_train = [10, np.nan, 12, np.nan, 14, np.nan, 16]
>>> knn1 = KNeighborsRegressor(n_neighbors = 2)
>>> knn2 = KNeighborsRegressor(n_neighbors = 2)
>>> ctr = CTRegressor(knn1, knn2, k_neighbors = 2, random_state =  42)
>>> ctr = ctr.fit([X1, X2], y_train)
>>> pred = ctr.predict([X1, X2])
>>> print("True value\n{}".format(y))
True value
[10, 11, 12, 13, 14, 15, 16]
>>> print("Predicted value\n{}".format(pred))
Predicted value
[10.75 11.25 11.25 13.25 13.25 14.75 15.25]

Notes

Multi-view co-training is most helpful for tasks in semi-supervised learning where each view offers unique information not seen in the other. As is shown in the example notebooks for using this algorithm, multi-view co-training can provide good regression results even when number of unlabeled samples far exceeds the number of labeled samples. This regressor uses 2 sklearn regressors which work individually on each view but which share information and thus result in improved performance over looking at the views completely separately. The regressor needs to be KNeighborsRegressor, as described in the 3.

Algorithm: Given

a set L1, L2 having labeled training

samples of each view respectively

a set U of unlabeled samples

Create a pool U' of examples by choosing examples at random from U

Use L1 to train a regressor h1 (estimator1) that considers only the view1 portion of the data (i.e. Xs[0])

Use L2 to train a regressor h2 (estimator2) that considers only the view2 portion of the data (i.e. Xs[1])

Loop for T iterations

for each view j

for each u in U'

Calculate the neighbors of u

Predict the value of u using hj estimator

create a new estimator hj' with same parameters

as that of hj and train it on the data (Lj union u) * predict the value of neighbors from estimator hj and calculate the mean squared error with respect to original values * predict the value of neighbors from the new estimator hj' and calculate the mean squared error with respect to original values * calculate the difference between the two errors * store the error in a list named deltaj

select the index with maximum positive value from both

the delta1 and delta2 * let the indexes selected be index1 and index2 * Add the index1 to L2 * Add the index2 to L1 * Remove the selected index from U' and replenish it by taking unlabeled index from U * Use L1 to train the regressor h1 * Use L2 to train the regressor h2

References

3(1,2,3): Zhou, Zhi-Hua and Li, Ming. "Semi-supervised regression with co-training." In Proceedings of the 19th International Joint Conference on Artificial Intelligence, page 908–913, 2005
4: Goldman, Sally, and Yan Zhou. "Enhancing supervised learning with unlabeled data." In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 2000.