Semi-Supervised

Cotraining Classifier

class mvlearn.semi_supervised.CTClassifier(estimator1=None, estimator2=None, p=None, n=None, unlabeled_pool_size=75, num_iter=50, random_state=None)[source]

This class implements the co-training classifier for supervised and semi-supervised learning with the framework as described in 1. The best use case is when the 2 views of input data are sufficiently distinct and independent as detailed in 1. However, this can also be successful when a single matrix of input data is given as both views and two estimators are chosen which are quite different. 2. See the examples below.

In the semi-supervised case, performance can vary greatly, so using a separate validation set or cross validation procedure is recommended to ensure the classifier has fit well.

Parameters
  • estimator1 (classifier object, (default=sklearn GaussianNB)) -- The classifier object which will be trained on view 1 of the data. This classifier should support the predict_proba() function so that classification probabilities can be computed and co-training can be performed effectively.

  • estimator2 (classifier object, (default=sklearn GaussianNB)) -- The classifier object which will be trained on view 2 of the data. Does not need to be of the same type as estimator1, but should support predict_proba().

  • p (int, optional (default=None)) -- The number of positive classifications from the unlabeled_pool training set which will be given a positive "label". If None, the default is the floor of the ratio of positive to negative examples in the labeled training data (at least 1). If only one of p or n is not None, the other will be set to be the same. When the labels are 0 or 1, positive is defined as 1, and in general, positive is the larger label.

  • n (int, optional (default=None)) -- The number of negative classifications from the unlabeled_pool training set which will be given a negative "label". If None, the default is the floor of the ratio of positive to negative examples in the labeled training data (at least 1). If only one of p or n is not None, the other will be set to be the same. When the labels are 0 or 1, negative is defined as 0, and in general, negative is the smaller label.

  • unlabeled_pool_size (int, optional (default=75)) -- The number of unlabeled_pool samples which will be kept in a separate pool for classification and selection by the updated classifier at each training iteration.

  • num_iter (int, optional (default=50)) -- The maximum number of training iterations to run.

  • random_state (int (default=None)) -- The starting random seed for fit() and class operations, passed to numpy.random.seed().

estimator1_

The classifier used on view 1.

Type

classifier object

estimator2_

The classifier used on view 2.

Type

classifier object

class_name_

The name of the class.

Type

string

p_

The number of positive classifications from the unlabeled_pool training set which will be given a positive "label" each round.

Type

int, optional (default=None)

n_

The number of negative classifications from the unlabeled_pool training set which will be given a negative "label" each round.

Type

int, optional (default=None)

classes_

Unique class labels.

Type

array-like of shape (n_classes,)

Examples

>>> # Supervised learning of single-view data with 2 distinct estimators
>>> from mvlearn.semi_supervised import CTClassifier
>>> from mvlearn.datasets import load_UCImultifeature
>>> import numpy as np
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.model_selection import train_test_split
>>> data, labels = load_UCImultifeature(select_labeled=[0,1])
>>> X1 = data[0]  # Only using the first view
>>> X1_train, X1_test, l_train, l_test = train_test_split(X1, labels)
>>> # Supervised learning with a single view of data and 2 estimator types
>>> estimator1 = GaussianNB()
>>> estimator2 = RandomForestClassifier()
>>> ctc = CTClassifier(estimator1, estimator2, random_state=1)
>>> # Use the same matrix for each view
>>> ctc = ctc.fit([X1_train, X1_train], l_train)
>>> preds = ctc.predict([X1_test, X1_test])
>>> print("Accuracy: ", sum(preds==l_test) / len(preds))
Accuracy:  0.97

Notes

Multi-view co-training is most helpful for tasks in semi-supervised learning where each view offers unique information not seen in the other. As is shown in the example notebooks for using this algorithm, multi-view co-training can provide good classification results even when number of unlabeled samples far exceeds the number of labeled samples. This classifier uses 2 classifiers which work individually on each view but which share information and thus result in improved performance over looking at the views completely separately or even when concatenating the views to get more features in a single-view setting. The classifier can be initialized with or without the classifiers desired for each view being specified, but if the classifier for a certain view is specified, then it must support a predict_proba() method in order to give a sense of the most likely labels for different examples. This is because the algorithm must be able to determine which of the training samples it is most confident about during training epochs. The algorithm, as first proposed by Blum and Mitchell, is described in detail below.

Algorithm

Given:

  • a set L of labeled training samples (with 2 views)

  • a set U of unlabeled samples (with 2 views)

Create a pool U' of examples by choosing u examples at random from U

Loop for k iterations

  • Use L to train a classifier h1 (estimator1) that considers only the view 1 portion of the data (i.e. Xs[0])

  • Use L to train a classifier h2 (estimator2) that considers only the view 2 portion of the data (i.e. Xs[1])

  • Allow h1 to label p (self.p_) positive and n (self.n_) negative samples from view 1 of U'

  • Allow h2 to label p positive and n negative samples from view 2 of U'

  • Add these self-labeled samples to L

  • Randomly take 2*p* + 2*n* samples from U to replenish U'

References

1(1,2)

Blum, A., and Mitchell, T. "Combining labeled and unlabeled data with co-training." In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 92–100, 1998.

2

Goldman, Sally, and Yan Zhou. "Enhancing supervised learning with unlabeled data." In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 2000.

Cotraining Regressor

class mvlearn.semi_supervised.CTRegressor(estimator1=None, estimator2=None, k_neighbors=5, unlabeled_pool_size=50, num_iter=100, random_state=None)[source]

This class implements the co-training regression for supervised and semi supervised learning with the framework as described in 3. The best use case is when 2 views of input data are sufficiently distinct and independent as detailed in 3. However this can also be successfull when a single matrix of input data is given as both views and two estimators are choosen which are quite different 4.

In the semi-supervised case, performance can vary greatly, so using a separate validation set or cross validation procedure is recommended to ensure the regression model has fit well.

Parameters
  • estimator1 (sklearn object, (only supports KNeighborsRegressor)) -- The regressor object which will be trained on view1 of the data.

  • estimator2 (sklearn object, (only supports KNeighborsRegressor)) -- The regressir object which will be trained on view2 of the data.

  • k_neighbors (int, optional (default = 5)) -- The number of neighbors to be considered for determining the mean squared error.

  • unlabeled_pool_size (int, optional (default = 50)) -- The number of unlabeled_pool samples which will be kept in a separate pool for regression and selection by the updated regressor at each training iteration.

  • num_iter (int, optional (default = 100)) -- The maximum number of iteration to be performed

  • random_state (int (default = None)) -- The seed for fit() method and other class operations

estimator1_
Type

regressor object, used on view1

estimator2_
Type

regressor object, used on view2

class_name_

The name of the class.

Type

string

k_neighbors_

The number of neighbors to be considered for determining the mean squared error.

Type

int

unlabeled_pool_size

The number of unlabeled_pool samples which will be kept in a separate pool for regression and selection by the updated regressor at each training iteration.

Type

int

num_iter

The maximum number of iterations to be performed

Type

int

n_views

The number of views in the data

Type

int

Examples

>>> from mvlearn.semi_supervised import CTRegressor
>>> from sklearn.neighbors import KNeighborsRegressor
>>> import numpy as np
>>> # X1 and X2 are the 2 views of the data
>>> X1 = [[0], [1], [2], [3], [4], [5], [6]]
>>> X2 = [[2], [3], [4], [6], [7], [8], [10]]
>>> y = [10, 11, 12, 13, 14, 15, 16]
>>> # Converting some of the labeled values to nan
>>> y_train = [10, np.nan, 12, np.nan, 14, np.nan, 16]
>>> knn1 = KNeighborsRegressor(n_neighbors = 2)
>>> knn2 = KNeighborsRegressor(n_neighbors = 2)
>>> ctr = CTRegressor(knn1, knn2, k_neighbors = 2, random_state =  42)
>>> ctr = ctr.fit([X1, X2], y_train)
>>> pred = ctr.predict([X1, X2])
>>> print("True value\n{}".format(y))
True value
[10, 11, 12, 13, 14, 15, 16]
>>> print("Predicted value\n{}".format(pred))
Predicted value
[10.75 11.25 11.25 13.25 13.25 14.75 15.25]

Notes

Multi-view co-training is most helpful for tasks in semi-supervised learning where each view offers unique information not seen in the other. As is shown in the example notebooks for using this algorithm, multi-view co-training can provide good regression results even when number of unlabeled samples far exceeds the number of labeled samples. This regressor uses 2 sklearn regressors which work individually on each view but which share information and thus result in improved performance over looking at the views completely separately. The regressor needs to be KNeighborsRegressor, as described in the 3.

Algorithm: Given

  • a set L1, L2 having labeled training

samples of each view respectively

  • a set U of unlabeled samples

Create a pool U' of examples by choosing examples at random from U

  • Use L1 to train a regressor h1 (estimator1) that considers only the view1 portion of the data (i.e. Xs[0])

  • Use L2 to train a regressor h2 (estimator2) that considers only the view2 portion of the data (i.e. Xs[1])

Loop for T iterations
  • for each view j
    • for each u in U'
      • Calculate the neighbors of u

      • Predict the value of u using hj estimator

      • create a new estimator hj' with same parameters

      as that of hj and train it on the data (Lj union u) * predict the value of neighbors from estimator hj and calculate the mean squared error with respect to original values * predict the value of neighbors from the new estimator hj' and calculate the mean squared error with respect to original values * calculate the difference between the two errors * store the error in a list named deltaj

  • select the index with maximum positive value from both

the delta1 and delta2 * let the indexes selected be index1 and index2 * Add the index1 to L2 * Add the index2 to L1 * Remove the selected index from U' and replenish it by taking unlabeled index from U * Use L1 to train the regressor h1 * Use L2 to train the regressor h2

References

3(1,2,3)

Zhou, Zhi-Hua and Li, Ming. "Semi-supervised regression with co-training." In Proceedings of the 19th International Joint Conference on Artificial Intelligence, page 908–913, 2005

4

Goldman, Sally, and Yan Zhou. "Enhancing supervised learning with unlabeled data." In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 2000.