What does score () do in python?
Syntax: sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None) Show
In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true. Parameters: y_true : 1d array-like, or label indicator array / sparse matrix Ground truth (correct) labels. y_pred: 1d array-like, or label indicator array / sparse matrix Predicted labels, as returned by a classifier. normalize : bool, optional (default=True) If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples. sample_weight : array-like of shape = [n_samples], optional Sample weights. Returns: The best performance is 1 with normalize == True and the number of samples with normalize == False. For more information you can refer to: [https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score][1] $\begingroup$ Related to
asked Jul 9, 2019 at 16:22
$\endgroup$ $\begingroup$ You don't specify the language or library you're using. Assuming it's If you have derived the predictions anyway (e.g. using answered Jul 17, 2019 at 20:28
fswingsfswings 2782 silver badges6 bronze badges $\endgroup$ $\begingroup$ You must indeed compute your metric score between your target variable y_test and the output of the model y_pred. answered Jul 9, 2019 at 16:23
$\endgroup$ $\begingroup$ No , notice that you're using your model for scoring. The score() method will take in an input X_test, and it's target value Y_test, your model will compute Y_pred for your X_test, and attribute a score , using the optimization function used by your model, to your prediction. You can't feed 2 target values to a model, you feed to it an input and it gives you a result, the score method() does that plus the real target value for evaluation , yes sure metric functions take in y_pred and y_test but it's built in your model so it does that for you answered Jul 9, 2019 at 17:00
BlenzBlenz 1,9749 silver badges27 bronze badges $\endgroup$ There are 3 different APIs for evaluating the quality of a model’s predictions:
Finally, Dummy estimators are useful to get a baseline value of those metrics for random predictions. 3.3.1. The scoring parameter: defining model evaluation rules¶Model selection and evaluation using tools, such as 3.3.1.1. Common cases: predefined values¶For the most common use cases, you can designate a scorer object with the
Usage examples: >>> from sklearn import svm, datasets >>> from sklearn.model_selection import cross_val_score >>> X, y = datasets.load_iris(return_X_y=True) >>> clf = svm.SVC(random_state=0) >>> cross_val_score(clf, X, y, cv=5, scoring='recall_macro') array([0.96..., 0.96..., 0.96..., 0.93..., 1. ]) >>> model = svm.SVC() >>> cross_val_score(model, X, y, cv=5, scoring='wrong_choice') Traceback (most recent call last): ValueError: 'wrong_choice' is not a valid scoring value. Use sklearn.metrics.get_scorer_names() to get valid options. Note The values listed by the 3.3.1.2. Defining your scoring strategy from metric functions¶The module
Metrics available for various machine learning tasks are detailed in sections below. Many metrics are not given names to be used as One typical use case is to wrap an existing metric function from the library with non-default values for its parameters, such as the >>> from sklearn.metrics import fbeta_score, make_scorer >>> ftwo_scorer = make_scorer(fbeta_score, beta=2) >>> from sklearn.model_selection import GridSearchCV >>> from sklearn.svm import LinearSVC >>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, ... scoring=ftwo_scorer, cv=5) The second use case is to build a completely custom scorer object from a simple python function using
Here is an example of building custom scorers, and of using the >>> import numpy as np >>> def my_custom_loss_func(y_true, y_pred): ... diff = np.abs(y_true - y_pred).max() ... return np.log1p(diff) ... >>> # score will negate the return value of my_custom_loss_func, >>> # which will be np.log(2), 0.693, given the values for X >>> # and y defined below. >>> score = make_scorer(my_custom_loss_func, greater_is_better=False) >>> X = [[1], [1]] >>> y = [0, 1] >>> from sklearn.dummy import DummyClassifier >>> clf = DummyClassifier(strategy='most_frequent', random_state=0) >>> clf = clf.fit(X, y) >>> my_custom_loss_func(y, clf.predict(X)) 0.69... >>> score(clf, X, y) -0.69... 3.3.1.3. Implementing your own scoring object¶You can generate even more flexible model scorers by constructing your own scoring object from scratch, without using the
Note Using custom scorers in functions where n_jobs > 1 While defining the custom scoring function alongside the calling function should work out of the box with the default joblib backend (loky), importing it from another module will be a more robust approach and work independently of the joblib backend. For example, to use >>> from custom_scorer_module import custom_scoring_function >>> cross_val_score(model, ... X_train, ... y_train, ... scoring=make_scorer(custom_scoring_function, greater_is_better=False), ... cv=5, ... n_jobs=-1) 3.3.1.4. Using multiple metric evaluation¶Scikit-learn also permits evaluation of multiple metrics in There are three ways to specify multiple scoring metrics for the
3.3.2. Classification metrics¶The Some of these are restricted to the binary classification case:
Others also work in the multiclass case:
Some also work in the multilabel case:
And some work with binary and multilabel (but not multiclass) problems: In the following sub-sections, we will describe each of those functions, preceded by some notes on common API and metric definition. 3.3.2.1. From binary to multiclass and multilabel¶Some metrics are essentially defined for binary classification tasks (e.g. In extending a binary metric to multiclass or multilabel problems, the data is treated as a collection of
binary problems, one for each class. There are then a number of ways to average binary metric calculations across the set of classes, each of which may be useful in some scenario. Where available, you should select among these using the
While multiclass data is provided to the metric, like binary targets, as an array of class labels, multilabel data is specified as an indicator matrix, in which cell 3.3.2.2. Accuracy score¶The In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0. If \(\hat{y}_i\) is the predicted value of the \(i\)-th sample and \(y_i\) is the corresponding true value, then the fraction of correct predictions over \(n_\text{samples}\) is defined as \[\texttt{accuracy}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples}-1} 1(\hat{y}_i = y_i)\] where \(1(x)\) is the indicator function. >>> import numpy as np >>> from sklearn.metrics import accuracy_score >>> y_pred = [0, 2, 1, 3] >>> y_true = [0, 1, 2, 3] >>> accuracy_score(y_true, y_pred) 0.5 >>> accuracy_score(y_true, y_pred, normalize=False) 2 In the multilabel case with binary label indicators: >>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2))) 0.5 3.3.2.3. Top-k accuracy score¶The The function covers the binary and multiclass classification cases but not the multilabel case. If \(\hat{f}_{i,j}\) is the predicted class for the \(i\)-th sample corresponding to the \(j\)-th largest predicted score and \(y_i\) is the corresponding true value, then the fraction of correct predictions over \(n_\text{samples}\) is defined as \[\texttt{top-k accuracy}(y, \hat{f}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples}-1} \sum_{j=1}^{k} 1(\hat{f}_{i,j} = y_i)\] where \(k\) is the number of guesses allowed and \(1(x)\) is the indicator function. >>> import numpy as np >>> from sklearn.metrics import top_k_accuracy_score >>> y_true = np.array([0, 1, 2, 2]) >>> y_score = np.array([[0.5, 0.2, 0.2], ... [0.3, 0.4, 0.2], ... [0.2, 0.4, 0.3], ... [0.7, 0.2, 0.1]]) >>> top_k_accuracy_score(y_true, y_score, k=2) 0.75 >>> # Not normalizing gives the number of "correctly" classified samples >>> top_k_accuracy_score(y_true, y_score, k=2, normalize=False) 3 3.3.2.4. Balanced accuracy score¶The
In the binary case, balanced accuracy is equal to the arithmetic mean of sensitivity (true positive rate) and specificity (true negative rate), or the area under the ROC curve with binary predictions rather than scores: \[\texttt{balanced-accuracy} = \frac{1}{2}\left( \frac{TP}{TP + FN} + \frac{TN}{TN + FP}\right )\] If the classifier performs equally well on either class, this term reduces to the conventional accuracy (i.e., the number of correct predictions divided by the total number of predictions). In contrast, if the conventional accuracy is above chance only because the classifier takes advantage of an imbalanced test set, then the balanced accuracy, as appropriate, will drop to \(\frac{1}{n\_classes}\). The score ranges from 0 to 1, or when If \(y_i\) is the true value of the \(i\)-th sample, and \(w_i\) is the corresponding sample weight, then we adjust the sample weight to: \[\hat{w}_i = \frac{w_i}{\sum_j{1(y_j = y_i) w_j}}\] where \(1(x)\) is the indicator function. Given predicted \(\hat{y}_i\) for sample \(i\), balanced accuracy is defined as: \[\texttt{balanced-accuracy}(y, \hat{y}, w) = \frac{1}{\sum{\hat{w}_i}} \sum_i 1(\hat{y}_i = y_i) \hat{w}_i\] With Note The multiclass definition here seems the most reasonable extension of the metric used in binary classification, though there is no certain consensus in the literature:
3.3.2.5. Cohen’s kappa¶The function The kappa score (see docstring) is a number between -1 and 1. Scores above .8 are generally considered good agreement; zero or lower means no agreement (practically random labels). Kappa scores can be computed for binary or multiclass problems, but not for multilabel problems (except by manually computing a per-label score) and not for more than two annotators. >>> from sklearn.metrics import cohen_kappa_score >>> y_true = [2, 0, 2, 2, 0, 1] >>> y_pred = [0, 0, 2, 2, 0, 2] >>> cohen_kappa_score(y_true, y_pred) 0.4285714285714286 3.3.2.6. Confusion matrix¶The
By definition, entry \(i, j\) in a confusion matrix is the number of observations actually in group \(i\), but predicted to be in group \(j\). Here is an example: >>> from sklearn.metrics import confusion_matrix >>> y_true = [2, 0, 2, 2, 0, 1] >>> y_pred = [0, 0, 2, 2, 0, 2] >>> confusion_matrix(y_true, y_pred) array([[2, 0, 0], [0, 0, 1], [1, 0, 2]])
The parameter >>> y_true = [0, 0, 0, 1, 1, 1, 1, 1] >>> y_pred = [0, 1, 0, 1, 0, 1, 0, 1] >>> confusion_matrix(y_true, y_pred, normalize='all') array([[0.25 , 0.125], [0.25 , 0.375]]) For binary problems, we can get counts of true negatives, false positives, false negatives and true positives as follows: >>> y_true = [0, 0, 0, 1, 1, 1, 1, 1] >>> y_pred = [0, 1, 0, 1, 0, 1, 0, 1] >>> tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel() >>> tn, fp, fn, tp (2, 1, 2, 3) 3.3.2.7. Classification report¶The
>>> from sklearn.metrics import classification_report >>> y_true = [0, 1, 2, 2, 0] >>> y_pred = [0, 0, 2, 1, 0] >>> target_names = ['class 0', 'class 1', 'class 2'] >>> print(classification_report(y_true, y_pred, target_names=target_names)) precision recall f1-score support class 0 0.67 1.00 0.80 2 class 1 0.00 0.00 0.00 1 class 2 1.00 0.50 0.67 2 accuracy 0.60 5 macro avg 0.56 0.50 0.49 5 weighted avg 0.67 0.60 0.59 5 3.3.2.8. Hamming loss¶The If \(\hat{y}_j\) is the predicted value for the \(j\)-th label of a given sample, \(y_j\) is the corresponding true value, and \(n_\text{labels}\) is the number of classes or labels, then the Hamming loss \(L_{Hamming}\) between two samples is defined as: \[L_{Hamming}(y, \hat{y}) = \frac{1}{n_\text{labels}} \sum_{j=0}^{n_\text{labels} - 1} 1(\hat{y}_j \not= y_j)\] where \(1(x)\) is the indicator function. >>> from sklearn.metrics import hamming_loss >>> y_pred = [1, 2, 3, 4] >>> y_true = [2, 2, 3, 4] >>> hamming_loss(y_true, y_pred) 0.25 In the multilabel case with binary label indicators: >>> hamming_loss(np.array([[0, 1], [1, 1]]), np.zeros((2, 2))) 0.75 Note In multiclass classification, the Hamming loss corresponds to the Hamming distance between 3.3.2.9. Precision, recall and F-measures¶Intuitively, precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the ability of the classifier to find all the positive samples. The F-measure (\(F_\beta\) and \(F_1\) measures) can be interpreted as a weighted harmonic mean of the precision and recall. A \(F_\beta\) measure reaches its best value at 1 and its worst score at 0. With \(\beta = 1\), \(F_\beta\) and \(F_1\) are equivalent, and the recall and the precision are equally important. The The
\[\text{AP} = \sum_n (R_n - R_{n-1}) P_n\] where \(P_n\) and \(R_n\) are the precision and recall at the nth threshold. With random predictions, the AP is the fraction of positive samples. References [Manning2008] and
[Everingham2010] present alternative variants of AP that interpolate the precision-recall curve. Currently, Several functions allow you to analyze the precision, recall and F-measures score:
Note that the 3.3.2.9.1. Binary classification¶In a binary classification task, the terms ‘’positive’’ and ‘’negative’’ refer to the classifier’s prediction, and the terms ‘’true’’ and ‘’false’’ refer to whether that prediction corresponds to the external judgment (sometimes known as the ‘’observation’’). Given these definitions, we can formulate the following table:
In this context, we can define the notions of precision, recall and F-measure: \[\text{precision} = \frac{tp}{tp + fp},\] \[\text{recall} = \frac{tp}{tp + fn},\] \[F_\beta = (1 + \beta^2) \frac{\text{precision} \times \text{recall}}{\beta^2 \text{precision} + \text{recall}}.\] Here are some small examples in binary classification: >>> from sklearn import metrics >>> y_pred = [0, 1, 0, 0] >>> y_true = [0, 1, 0, 1] >>> metrics.precision_score(y_true, y_pred) 1.0 >>> metrics.recall_score(y_true, y_pred) 0.5 >>> metrics.f1_score(y_true, y_pred) 0.66... >>> metrics.fbeta_score(y_true, y_pred, beta=0.5) 0.83... >>> metrics.fbeta_score(y_true, y_pred, beta=1) 0.66... >>> metrics.fbeta_score(y_true, y_pred, beta=2) 0.55... >>> metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5) (array([0.66..., 1. ]), array([1. , 0.5]), array([0.71..., 0.83...]), array([2, 2])) >>> import numpy as np >>> from sklearn.metrics import precision_recall_curve >>> from sklearn.metrics import average_precision_score >>> y_true = np.array([0, 0, 1, 1]) >>> y_scores = np.array([0.1, 0.4, 0.35, 0.8]) >>> precision, recall, threshold = precision_recall_curve(y_true, y_scores) >>> precision array([0.5 , 0.66..., 0.5 , 1. , 1. ]) >>> recall array([1. , 1. , 0.5, 0.5, 0. ]) >>> threshold array([0.1 , 0.35, 0.4 , 0.8 ]) >>> average_precision_score(y_true, y_scores) 0.83... 3.3.2.9.2. Multiclass and multilabel classification¶In a multiclass and multilabel classification task, the notions of precision, recall, and F-measures can be applied to each label independently. There are a few ways to combine results across labels, specified by the To make this more explicit, consider the following notation:
Then the metrics are defined as:
>>> from sklearn import metrics >>> y_true = [0, 1, 2, 0, 1, 2] >>> y_pred = [0, 2, 1, 0, 0, 1] >>> metrics.precision_score(y_true, y_pred, average='macro') 0.22... >>> metrics.recall_score(y_true, y_pred, average='micro') 0.33... >>> metrics.f1_score(y_true, y_pred, average='weighted') 0.26... >>> metrics.fbeta_score(y_true, y_pred, average='macro', beta=0.5) 0.23... >>> metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5, average=None) (array([0.66..., 0. , 0. ]), array([1., 0., 0.]), array([0.71..., 0. , 0. ]), array([2, 2, 2]...)) For multiclass classification with a “negative class”, it is possible to exclude some labels: >>> metrics.recall_score(y_true, y_pred, labels=[1, 2], average='micro') ... # excluding 0, no labels were correctly recalled 0.0 Similarly, labels not present in the data sample may be accounted for in macro-averaging. >>> metrics.precision_score(y_true, y_pred, labels=[0, 1, 2, 3], average='macro') 0.166... 3.3.2.10. Jaccard similarity coefficient score¶The The Jaccard similarity coefficient of the \(i\)-th samples, with a ground truth label set \(y_i\) and predicted label set \(\hat{y}_i\), is defined as \[J(y_i, \hat{y}_i) = \frac{|y_i \cap \hat{y}_i|}{|y_i \cup \hat{y}_i|}.\]
In the binary case: >>> import numpy as np >>> from sklearn.metrics import jaccard_score >>> y_true = np.array([[0, 1, 1], ... [1, 1, 0]]) >>> y_pred = np.array([[1, 1, 1], ... [1, 0, 0]]) >>> jaccard_score(y_true[0], y_pred[0]) 0.6666... In the 2D comparison case (e.g. image similarity): >>> jaccard_score(y_true, y_pred, average="micro") 0.6 In the multilabel case with binary label indicators: >>> jaccard_score(y_true, y_pred, average='samples') 0.5833... >>> jaccard_score(y_true, y_pred, average='macro') 0.6666... >>> jaccard_score(y_true, y_pred, average=None) array([0.5, 0.5, 1. ]) Multiclass problems are binarized and treated like the corresponding multilabel problem: >>> y_pred = [0, 2, 1, 2] >>> y_true = [0, 1, 2, 2] >>> jaccard_score(y_true, y_pred, average=None) array([1. , 0. , 0.33...]) >>> jaccard_score(y_true, y_pred, average='macro') 0.44... >>> jaccard_score(y_true, y_pred, average='micro') 0.33... 3.3.2.11. Hinge loss¶The If the labels are encoded with +1 and -1, \(y\): is the true value, and \(w\) is the predicted decisions as output by \[L_\text{Hinge}(y, w) = \max\left\{1 - wy, 0\right\} = \left|1 - wy\right|_+\] If there are more than two labels,
If \(y_w\) is the predicted decision for true label and \(y_t\) is the maximum of the predicted decisions for all other labels, where predicted decisions are output by decision function, then multiclass hinge loss is defined by: \[L_\text{Hinge}(y_w, y_t) = \max\left\{1 + y_t - y_w, 0\right\}\] Here a small example demonstrating the use of the >>> from sklearn import svm >>> from sklearn.metrics import hinge_loss >>> X = [[0], [1]] >>> y = [-1, 1] >>> est = svm.LinearSVC(random_state=0) >>> est.fit(X, y) LinearSVC(random_state=0) >>> pred_decision = est.decision_function([[-2], [3], [0.5]]) >>> pred_decision array([-2.18..., 2.36..., 0.09...]) >>> hinge_loss([-1, 1, 1], pred_decision) 0.3... Here is an example demonstrating the use of the >>> X = np.array([[0], [1], [2], [3]]) >>> Y = np.array([0, 1, 2, 3]) >>> labels = np.array([0, 1, 2, 3]) >>> est = svm.LinearSVC() >>> est.fit(X, Y) LinearSVC() >>> pred_decision = est.decision_function([[-1], [2], [3]]) >>> y_true = [0, 2, 3] >>> hinge_loss(y_true, pred_decision, labels=labels) 0.56... 3.3.2.12. Log loss¶Log loss, also called logistic regression loss or cross-entropy loss, is defined on probability estimates. It is commonly used in (multinomial) logistic regression and neural networks, as well as in some variants of expectation-maximization, and can be used to evaluate the probability outputs ( For binary classification with a true label \(y \in \{0,1\}\) and a probability estimate \(p = \operatorname{Pr}(y = 1)\), the log loss per sample is the negative log-likelihood of the classifier given the true label: \[L_{\log}(y, p) = -\log \operatorname{Pr}(y|p) = -(y \log (p) + (1 - y) \log (1 - p))\] This extends to the multiclass case as follows. Let the true labels for a set of samples be encoded as a 1-of-K binary indicator matrix \(Y\), i.e., \(y_{i,k} = 1\) if sample \(i\) has label \(k\) taken from a set of \(K\) labels. Let \(P\) be a matrix of probability estimates, with \(p_{i,k} = \operatorname{Pr}(y_{i,k} = 1)\). Then the log loss of the whole set is \[L_{\log}(Y, P) = -\log \operatorname{Pr}(Y|P) = - \frac{1}{N} \sum_{i=0}^{N-1} \sum_{k=0}^{K-1} y_{i,k} \log p_{i,k}\] To see how this generalizes the binary log loss given above, note that in the binary case, \(p_{i,0} = 1 - p_{i,1}\) and \(y_{i,0} = 1 - y_{i,1}\), so expanding the inner sum over \(y_{i,k} \in \{0,1\}\) gives the binary log loss. The >>> from sklearn.metrics import log_loss >>> y_true = [0, 0, 1, 1] >>> y_pred = [[.9, .1], [.8, .2], [.3, .7], [.01, .99]] >>> log_loss(y_true, y_pred) 0.1738... The first 3.3.2.13. Matthews correlation coefficient¶The
In the binary (two-class) case, \(tp\), \(tn\), \(fp\) and \(fn\) are respectively the number of true positives, true negatives, false positives and false negatives, the MCC is defined as \[MCC = \frac{tp \times tn - fp \times fn}{\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}.\] In the multiclass case, the Matthews correlation coefficient can be defined in terms of a
Then the multiclass MCC is defined as: \[MCC = \frac{ c \times s - \sum_{k}^{K} p_k \times t_k }{\sqrt{ (s^2 - \sum_{k}^{K} p_k^2) \times (s^2 - \sum_{k}^{K} t_k^2) }}\] When there are more than two labels, the value of the MCC will no longer range between -1 and +1. Instead the minimum value will be somewhere between -1 and 0 depending on the number and distribution of ground true labels. The maximum value is always +1. Here is a small example illustrating the usage of the
>>> from sklearn.metrics import matthews_corrcoef >>> y_true = [+1, +1, +1, -1] >>> y_pred = [+1, -1, +1, +1] >>> matthews_corrcoef(y_true, y_pred) -0.33... 3.3.2.14. Multi-label confusion matrix¶The When calculating class-wise multilabel confusion matrix \(C\), the count of true negatives for class \(i\) is \(C_{i,0,0}\), false negatives is \(C_{i,1,0}\), true positives is \(C_{i,1,1}\) and false positives is \(C_{i,0,1}\). Here is an example
demonstrating the use of the >>> import numpy as np >>> from sklearn.metrics import multilabel_confusion_matrix >>> y_true = np.array([[1, 0, 1], ... [0, 1, 0]]) >>> y_pred = np.array([[1, 0, 0], ... [0, 1, 1]]) >>> multilabel_confusion_matrix(y_true, y_pred) array([[[1, 0], [0, 1]], [[1, 0], [0, 1]], [[0, 1], [1, 0]]]) Or a confusion matrix can be constructed for each sample’s labels: >>> multilabel_confusion_matrix(y_true, y_pred, samplewise=True) array([[[1, 0], [1, 1]], [[1, 1], [0, 1]]]) Here is an example demonstrating the use of the >>> y_true = ["cat", "ant", "cat", "cat", "ant", "bird"] >>> y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"] >>> multilabel_confusion_matrix(y_true, y_pred, ... labels=["ant", "bird", "cat"]) array([[[3, 1], [0, 2]], [[5, 0], [1, 0]], [[2, 1], [1, 2]]]) Here are some examples
demonstrating the use of the Calculating recall (also called the true positive rate or the sensitivity) for each class: >>> y_true = np.array([[0, 0, 1], ... [0, 1, 0], ... [1, 1, 0]]) >>> y_pred = np.array([[0, 1, 0], ... [0, 0, 1], ... [1, 1, 0]]) >>> mcm = multilabel_confusion_matrix(y_true, y_pred) >>> tn = mcm[:, 0, 0] >>> tp = mcm[:, 1, 1] >>> fn = mcm[:, 1, 0] >>> fp = mcm[:, 0, 1] >>> tp / (tp + fn) array([1. , 0.5, 0. ]) Calculating specificity (also called the true negative rate) for each class: >>> tn / (tn + fp) array([1. , 0. , 0.5]) Calculating fall out (also called the false positive rate) for each class: >>> fp / (fp + tn) array([0. , 1. , 0.5]) Calculating miss rate (also called the false negative rate) for each class: >>> fn / (fn + tp) array([0. , 0.5, 1. ]) 3.3.2.15. Receiver operating characteristic (ROC)¶The function
This function requires the true binary value and the target scores, which can either be probability estimates of the positive class, confidence values, or binary decisions. Here is a small example of how to use the >>> import numpy as np >>> from sklearn.metrics import roc_curve >>> y = np.array([1, 1, 2, 2]) >>> scores = np.array([0.1, 0.4, 0.35, 0.8]) >>> fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2) >>> fpr array([0. , 0. , 0.5, 0.5, 1. ]) >>> tpr array([0. , 0.5, 0.5, 1. , 1. ]) >>> thresholds array([1.8 , 0.8 , 0.4 , 0.35, 0.1 ]) This figure shows an example of such an ROC curve: The Compared to metrics such as the subset accuracy, the Hamming loss, or the F1 score, ROC doesn’t require optimizing a threshold for each label. 3.3.2.15.1. Binary case¶In the binary case, you can either provide the probability estimates, using the >>> from sklearn.datasets import load_breast_cancer >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.metrics import roc_auc_score >>> X, y = load_breast_cancer(return_X_y=True) >>> clf = LogisticRegression(solver="liblinear").fit(X, y) >>> clf.classes_ array([0, 1]) We can use the probability estimates corresponding to >>> y_score = clf.predict_proba(X)[:, 1] >>> roc_auc_score(y, y_score) 0.99... Otherwise, we can use the non-thresholded decision values >>> roc_auc_score(y, clf.decision_function(X)) 0.99... 3.3.2.15.2. Multi-class case¶The One-vs-one Algorithm: Computes the average AUC of all possible pairwise combinations of classes. [HT2001] defines a multiclass AUC metric weighted uniformly: \[\frac{1}{c(c-1)}\sum_{j=1}^{c}\sum_{k > j}^c (\text{AUC}(j | k) + \text{AUC}(k | j))\] where \(c\) is the number of classes and \(\text{AUC}(j | k)\) is the AUC with class
\(j\) as the positive class and class \(k\) as the negative class. In general, \(\text{AUC}(j | k) \neq \text{AUC}(k | j))\) in the multiclass case. This algorithm is used by setting the keyword argument The [HT2001] multiclass AUC metric can be extended to be weighted by the prevalence: \[\frac{1}{c(c-1)}\sum_{j=1}^{c}\sum_{k > j}^c p(j \cup k)( \text{AUC}(j | k) + \text{AUC}(k | j))\] where \(c\) is the number of classes. This algorithm is used by setting the keyword argument One-vs-rest
Algorithm: Computes the AUC of each class against the rest [PD2000]. The algorithm is functionally the same as the multilabel case. To enable this algorithm set the keyword argument In applications where a high false positive rate is not tolerable the parameter 3.3.2.15.3. Multi-label case¶In multi-label classification, the >>> from sklearn.datasets import make_multilabel_classification >>> from sklearn.multioutput import MultiOutputClassifier >>> X, y = make_multilabel_classification(random_state=0) >>> inner_clf = LogisticRegression(solver="liblinear", random_state=0) >>> clf = MultiOutputClassifier(inner_clf).fit(X, y) >>> y_score = np.transpose([y_pred[:, 1] for y_pred in clf.predict_proba(X)]) >>> roc_auc_score(y, y_score, average=None) array([0.82..., 0.86..., 0.94..., 0.85... , 0.94...]) And the decision values do not require such processing. >>> from sklearn.linear_model import RidgeClassifierCV >>> clf = RidgeClassifierCV().fit(X, y) >>> y_score = clf.decision_function(X) >>> roc_auc_score(y, y_score, average=None) array([0.81..., 0.84... , 0.93..., 0.87..., 0.94...]) 3.3.2.16. Detection error tradeoff (DET)¶The function
DET curves are a variation of receiver operating characteristic (ROC) curves where False Negative Rate is plotted on the y-axis instead of True Positive Rate. DET curves are commonly plotted in normal deviate scale by transformation with \(\phi^{-1}\) (with \(\phi\) being the cumulative distribution function). The resulting performance curves explicitly visualize the tradeoff of error types for given classification algorithms. See [Martin1997] for examples and further motivation. This figure compares the ROC and DET curves of two example classifiers on the same classification task: Properties:
Applications and limitations: DET curves are intuitive to read and hence allow quick visual assessment of a classifier’s performance. Additionally DET curves can be consulted for threshold analysis and operating point selection. This is particularly helpful if a comparison of error types is required. On the other hand DET curves do not provide their metric as a single number. Therefore for either automated evaluation or comparison to other classification tasks metrics like the derived area under ROC curve might be better suited. 3.3.2.17. Zero one loss¶The In multilabel classification, the If \(\hat{y}_i\) is the predicted value of the \(i\)-th sample and \(y_i\) is the corresponding true value, then the 0-1 loss \(L_{0-1}\) is defined as: \[L_{0-1}(y_i, \hat{y}_i) = 1(\hat{y}_i \not= y_i)\] where \(1(x)\) is the indicator function. >>> from sklearn.metrics import zero_one_loss >>> y_pred = [1, 2, 3, 4] >>> y_true = [2, 2, 3, 4] >>> zero_one_loss(y_true, y_pred) 0.25 >>> zero_one_loss(y_true, y_pred, normalize=False) 1 In the multilabel case with binary label indicators, where the first label set [0,1] has an error: >>> zero_one_loss(np.array([[0, 1], [1, 1]]), np.ones((2, 2))) 0.5 >>> zero_one_loss(np.array([[0, 1], [1, 1]]), np.ones((2, 2)), normalize=False) 1 3.3.2.18. Brier score loss¶The
This function returns the mean squared error of the actual outcome \(y \in \{0,1\}\) and the predicted probability estimate \(p = \operatorname{Pr}(y = 1)\) (predict_proba) as outputted by: \[BS = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}} - 1}(y_i - p_i)^2\] The Brier score loss is also between 0 to 1 and the lower the value (the mean square difference is smaller), the more accurate the prediction is. Here is a small example of usage of this function: >>> import numpy as np >>> from sklearn.metrics import brier_score_loss >>> y_true = np.array([0, 1, 1, 0]) >>> y_true_categorical = np.array(["spam", "ham", "ham", "spam"]) >>> y_prob = np.array([0.1, 0.9, 0.8, 0.4]) >>> y_pred = np.array([0, 1, 1, 0]) >>> brier_score_loss(y_true, y_prob) 0.055 >>> brier_score_loss(y_true, 1 - y_prob, pos_label=0) 0.055 >>> brier_score_loss(y_true_categorical, y_prob, pos_label="ham") 0.055 >>> brier_score_loss(y_true, y_prob > 0.5) 0.0 The Brier score can be used to assess how well a classifier is calibrated. However, a lower Brier score loss does not always mean a better calibration. This is because, by analogy with the bias-variance decomposition of the mean squared error, the Brier score loss can be decomposed as the sum of calibration loss and refinement loss [Bella2012]. Calibration loss is defined as the mean squared deviation from empirical probabilities derived from the slope of ROC segments. Refinement loss can be defined as the expected optimal loss as measured by the area under the optimal cost curve. Refinement loss can change independently from calibration loss, thus a lower Brier score loss does not necessarily mean a better calibrated model. “Only when refinement loss remains the same does a lower Brier score loss always mean better calibration” [Bella2012], [Flach2008]. 3.3.3. Multilabel ranking metrics¶In multilabel learning, each sample can have any number of ground truth labels associated with it. The goal is to give high scores and better rank to the ground truth labels. 3.3.3.1. Coverage error¶The Note Our implementation’s score is 1 greater than the one given in Tsoumakas et al., 2010. This extends it to handle the degenerate case in which an instance has 0 true labels. Formally, given a binary indicator matrix of the ground truth labels \(y \in \left\{0, 1\right\}^{n_\text{samples} \times n_\text{labels}}\) and the score associated with each label \(\hat{f} \in \mathbb{R}^{n_\text{samples} \times n_\text{labels}}\), the coverage is defined as \[coverage(y, \hat{f}) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}} - 1} \max_{j:y_{ij} = 1} \text{rank}_{ij}\] with \(\text{rank}_{ij} = \left|\left\{k: \hat{f}_{ik} \geq \hat{f}_{ij} \right\}\right|\). Given the rank definition, ties in Here is a small example of usage of this function: >>> import numpy as np >>> from sklearn.metrics import coverage_error >>> y_true = np.array([[1, 0, 0], [0, 0, 1]]) >>> y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]]) >>> coverage_error(y_true, y_score) 2.5 3.3.3.2. Label ranking average precision¶The
Label ranking average precision (LRAP) averages over the samples the answer to the following question: for each ground truth label, what fraction of higher-ranked labels were true labels? This performance measure will be higher if you are able to give better rank to the labels associated with each sample. The obtained score is always strictly greater than 0, and the best value is 1. If there is exactly one relevant label per sample, label ranking average precision is equivalent to the mean reciprocal rank. Formally, given a binary indicator matrix of the ground truth labels \(y \in \left\{0, 1\right\}^{n_\text{samples} \times n_\text{labels}}\) and the score associated with each label \(\hat{f} \in \mathbb{R}^{n_\text{samples} \times n_\text{labels}}\), the average precision is defined as \[LRAP(y, \hat{f}) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}} - 1} \frac{1}{||y_i||_0} \sum_{j:y_{ij} = 1} \frac{|\mathcal{L}_{ij}|}{\text{rank}_{ij}}\] where \(\mathcal{L}_{ij} = \left\{k: y_{ik} = 1, \hat{f}_{ik} \geq \hat{f}_{ij} \right\}\), \(\text{rank}_{ij} = \left|\left\{k: \hat{f}_{ik} \geq \hat{f}_{ij} \right\}\right|\), \(|\cdot|\) computes the cardinality of the set (i.e., the number of elements in the set), and \(||\cdot||_0\) is the \(\ell_0\) “norm” (which computes the number of nonzero elements in a vector). Here is a small example of usage of this function: >>> import numpy as np >>> from sklearn.metrics import label_ranking_average_precision_score >>> y_true = np.array([[1, 0, 0], [0, 0, 1]]) >>> y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]]) >>> label_ranking_average_precision_score(y_true, y_score) 0.416... 3.3.3.3. Ranking loss¶The Formally, given a binary indicator matrix of the ground truth labels \(y \in \left\{0, 1\right\}^{n_\text{samples} \times n_\text{labels}}\) and the score associated with each label \(\hat{f} \in \mathbb{R}^{n_\text{samples} \times n_\text{labels}}\), the ranking loss is defined as \[ranking\_loss(y, \hat{f}) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}} - 1} \frac{1}{||y_i||_0(n_\text{labels} - ||y_i||_0)} \left|\left\{(k, l): \hat{f}_{ik} \leq \hat{f}_{il}, y_{ik} = 1, y_{il} = 0 \right\}\right|\] where \(|\cdot|\) computes the cardinality of the set (i.e., the number of elements in the set) and \(||\cdot||_0\) is the \(\ell_0\) “norm” (which computes the number of nonzero elements in a vector). Here is a small example of usage of this function: >>> import numpy as np >>> from sklearn.metrics import label_ranking_loss >>> y_true = np.array([[1, 0, 0], [0, 0, 1]]) >>> y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]]) >>> label_ranking_loss(y_true, y_score) 0.75... >>> # With the following prediction, we have perfect and minimal loss >>> y_score = np.array([[1.0, 0.1, 0.2], [0.1, 0.2, 0.9]]) >>> label_ranking_loss(y_true, y_score) 0.0 3.3.3.4. Normalized Discounted Cumulative Gain¶Discounted Cumulative Gain (DCG) and Normalized
Discounted Cumulative Gain (NDCG) are ranking metrics implemented in From the Wikipedia page for Discounted Cumulative Gain: “Discounted cumulative gain (DCG) is a measure of ranking quality. In information retrieval, it is often used to measure effectiveness of web search engine algorithms or related applications. Using a graded relevance scale of documents in a search-engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks” DCG orders the true targets (e.g. relevance of query answers) in the predicted order, then multiplies them by a logarithmic decay and sums the result. The sum can be truncated after the first \(K\) results, in which case we call it DCG@K. NDCG, or NDCG@K is DCG divided by the DCG obtained by a perfect prediction, so that it is always between 0 and 1. Usually, NDCG is preferred to DCG. Compared with the ranking loss, NDCG can take into account relevance scores, rather than a ground-truth ranking. So if the ground-truth consists only of an ordering, the ranking loss should be preferred; if the ground-truth consists of actual usefulness scores (e.g. 0 for irrelevant, 1 for relevant, 2 for very relevant), NDCG can be used. For one sample, given the vector of continuous ground-truth values for each target \(y \in \mathbb{R}^{M}\), where \(M\) is the number of outputs, and the prediction \(\hat{y}\), which induces the ranking function \(f\), the DCG score is \[\sum_{r=1}^{\min(K, M)}\frac{y_{f(r)}}{\log(1 + r)}\] and the NDCG score is the DCG score divided by the DCG score obtained for \(y\). 3.3.4. Regression metrics¶The These functions have a The 3.3.4.1. R² score, the coefficient of determination¶The It represents the proportion of variance (of y) that has been explained by the independent variables in the model. It provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance. As such variance is dataset dependent, \(R^2\) may not be meaningfully comparable across different datasets. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected (average) value of y, disregarding the input features, would get an \(R^2\) score of 0.0. Note: when the prediction residuals have zero mean, the \(R^2\) score and the Explained variance score are identical. If \(\hat{y}_i\) is the predicted value of the \(i\)-th sample and \(y_i\) is the corresponding true value for total \(n\) samples, the estimated \(R^2\) is defined as: \[R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\] where \(\bar{y} = \frac{1}{n} \sum_{i=1}^{n} y_i\) and \(\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} \epsilon_i^2\). Note that In the particular case where the true
target is constant, the \(R^2\) score is not finite: it is either Here is a small example of usage of the >>> from sklearn.metrics import r2_score >>> y_true = [3, -0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> r2_score(y_true, y_pred) 0.948... >>> y_true = [[0.5, 1], [-1, 1], [7, -6]] >>> y_pred = [[0, 2], [-1, 2], [8, -5]] >>> r2_score(y_true, y_pred, multioutput='variance_weighted') 0.938... >>> y_true = [[0.5, 1], [-1, 1], [7, -6]] >>> y_pred = [[0, 2], [-1, 2], [8, -5]] >>> r2_score(y_true, y_pred, multioutput='uniform_average') 0.936... >>> r2_score(y_true, y_pred, multioutput='raw_values') array([0.965..., 0.908...]) >>> r2_score(y_true, y_pred, multioutput=[0.3, 0.7]) 0.925... >>> y_true = [-2, -2, -2] >>> y_pred = [-2, -2, -2] >>> r2_score(y_true, y_pred) 1.0 >>> r2_score(y_true, y_pred, force_finite=False) nan >>> y_true = [-2, -2, -2] >>> y_pred = [-2, -2, -2 + 1e-8] >>> r2_score(y_true, y_pred) 0.0 >>> r2_score(y_true, y_pred, force_finite=False) -inf 3.3.4.2. Mean absolute error¶The If \(\hat{y}_i\) is the predicted value of the \(i\)-th sample, and \(y_i\) is the corresponding true value, then the mean absolute error (MAE) estimated over \(n_{\text{samples}}\) is defined as \[\text{MAE}(y, \hat{y}) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}}-1} \left| y_i - \hat{y}_i \right|.\] Here is a small example of usage of the >>> from sklearn.metrics import mean_absolute_error >>> y_true = [3, -0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> mean_absolute_error(y_true, y_pred) 0.5 >>> y_true = [[0.5, 1], [-1, 1], [7, -6]] >>> y_pred = [[0, 2], [-1, 2], [8, -5]] >>> mean_absolute_error(y_true, y_pred) 0.75 >>> mean_absolute_error(y_true, y_pred, multioutput='raw_values') array([0.5, 1. ]) >>> mean_absolute_error(y_true, y_pred, multioutput=[0.3, 0.7]) 0.85... 3.3.4.3. Mean squared error¶The If \(\hat{y}_i\) is the predicted value of the \(i\)-th sample, and \(y_i\) is the corresponding true value, then the mean squared error (MSE) estimated over \(n_{\text{samples}}\) is defined as \[\text{MSE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (y_i - \hat{y}_i)^2.\] Here is a small example of usage of the
>>> from sklearn.metrics import mean_squared_error >>> y_true = [3, -0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> mean_squared_error(y_true, y_pred) 0.375 >>> y_true = [[0.5, 1], [-1, 1], [7, -6]] >>> y_pred = [[0, 2], [-1, 2], [8, -5]] >>> mean_squared_error(y_true, y_pred) 0.7083... 3.3.4.4. Mean squared logarithmic error¶The If \(\hat{y}_i\) is the predicted value of the \(i\)-th sample, and \(y_i\) is the corresponding true value, then the mean squared logarithmic error (MSLE) estimated over \(n_{\text{samples}}\) is defined as \[\text{MSLE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (\log_e (1 + y_i) - \log_e (1 + \hat{y}_i) )^2.\] Where \(\log_e (x)\) means the natural logarithm of \(x\). This metric is best to use when targets having exponential growth, such as population counts, average sales of a commodity over a span of years etc. Note that this metric penalizes an under-predicted estimate greater than an over-predicted estimate. Here is a small example of usage of the
>>> from sklearn.metrics import mean_squared_log_error >>> y_true = [3, 5, 2.5, 7] >>> y_pred = [2.5, 5, 4, 8] >>> mean_squared_log_error(y_true, y_pred) 0.039... >>> y_true = [[0.5, 1], [1, 2], [7, 6]] >>> y_pred = [[0.5, 2], [1, 2.5], [8, 8]] >>> mean_squared_log_error(y_true, y_pred) 0.044... 3.3.4.5. Mean absolute percentage error¶The If \(\hat{y}_i\) is the predicted value of the \(i\)-th sample and \(y_i\) is the corresponding true value, then the mean absolute percentage error (MAPE) estimated over \(n_{\text{samples}}\) is defined as \[\text{MAPE}(y, \hat{y}) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}}-1} \frac{{}\left| y_i - \hat{y}_i \right|}{\max(\epsilon, \left| y_i \right|)}\] where \(\epsilon\) is an arbitrary small yet strictly positive number to avoid undefined results when y is zero. The Here is a small example of usage of the >>> from sklearn.metrics import mean_absolute_percentage_error >>> y_true = [1, 10, 1e6] >>> y_pred = [0.9, 15, 1.2e6] >>> mean_absolute_percentage_error(y_true, y_pred) 0.2666... In above example, if we had used 3.3.4.6. Median absolute error¶The
If \(\hat{y}_i\) is the predicted value of the \(i\)-th sample and \(y_i\) is the corresponding true value, then the median absolute error (MedAE) estimated over \(n_{\text{samples}}\) is defined as \[\text{MedAE}(y, \hat{y}) = \text{median}(\mid y_1 - \hat{y}_1 \mid, \ldots, \mid y_n - \hat{y}_n \mid).\] The Here is a small example of usage of the >>> from sklearn.metrics import median_absolute_error >>> y_true = [3, -0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> median_absolute_error(y_true, y_pred) 0.5 3.3.4.7. Max error¶The If \(\hat{y}_i\) is the predicted value of the \(i\)-th sample, and \(y_i\) is the corresponding true value, then the max error is defined as \[\text{Max Error}(y, \hat{y}) = \max(| y_i - \hat{y}_i |)\] Here is a small example of usage of the >>> from sklearn.metrics import max_error >>> y_true = [3, 2, 7, 1] >>> y_pred = [9, 2, 7, 1] >>> max_error(y_true, y_pred) 6 The 3.3.4.8. Explained variance score¶The
If \(\hat{y}\) is the estimated target output, \(y\) the corresponding (correct) target output, and \(Var\) is Variance, the square of the standard deviation, then the explained variance is estimated as follow: \[explained\_{}variance(y, \hat{y}) = 1 - \frac{Var\{ y - \hat{y}\}}{Var\{y\}}\] The best possible score is 1.0, lower values are worse. In the particular case where the true target is constant, the Explained Variance score is not finite: it is either Here is a small example of usage of the >>> from sklearn.metrics import explained_variance_score >>> y_true = [3, -0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> explained_variance_score(y_true, y_pred) 0.957... >>> y_true = [[0.5, 1], [-1, 1], [7, -6]] >>> y_pred = [[0, 2], [-1, 2], [8, -5]] >>> explained_variance_score(y_true, y_pred, multioutput='raw_values') array([0.967..., 1. ]) >>> explained_variance_score(y_true, y_pred, multioutput=[0.3, 0.7]) 0.990... >>> y_true = [-2, -2, -2] >>> y_pred = [-2, -2, -2] >>> explained_variance_score(y_true, y_pred) 1.0 >>> explained_variance_score(y_true, y_pred, force_finite=False) nan >>> y_true = [-2, -2, -2] >>> y_pred = [-2, -2, -2 + 1e-8] >>> explained_variance_score(y_true, y_pred) 0.0 >>> explained_variance_score(y_true, y_pred, force_finite=False) -inf 3.3.4.9. Mean Poisson, Gamma, and Tweedie deviances¶The
Following special cases exist,
If \(\hat{y}_i\) is the predicted value of the \(i\)-th sample, and \(y_i\) is the corresponding true value, then the mean Tweedie deviance error (D) for power \(p\), estimated over \(n_{\text{samples}}\) is defined as \[\begin{split}\text{D}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} \begin{cases} (y_i-\hat{y}_i)^2, & \text{for }p=0\text{ (Normal)}\\ 2(y_i \log(y_i/\hat{y}_i) + \hat{y}_i - y_i), & \text{for }p=1\text{ (Poisson)}\\ 2(\log(\hat{y}_i/y_i) + y_i/\hat{y}_i - 1), & \text{for }p=2\text{ (Gamma)}\\ 2\left(\frac{\max(y_i,0)^{2-p}}{(1-p)(2-p)}- \frac{y_i\,\hat{y}_i^{1-p}}{1-p}+\frac{\hat{y}_i^{2-p}}{2-p}\right), & \text{otherwise} \end{cases}\end{split}\] Tweedie deviance is a homogeneous function of degree For instance, let’s compare the two predictions 1.5 and 150 that are both 50% larger than their corresponding true value. The mean squared error ( >>> from sklearn.metrics import mean_tweedie_deviance >>> mean_tweedie_deviance([1.0], [1.5], power=0) 0.25 >>> mean_tweedie_deviance([100.], [150.], power=0) 2500.0 If we increase
>>> mean_tweedie_deviance([1.0], [1.5], power=1) 0.18... >>> mean_tweedie_deviance([100.], [150.], power=1) 18.9... the difference in errors decreases. Finally, by setting, >>> mean_tweedie_deviance([1.0], [1.5], power=2) 0.14... >>> mean_tweedie_deviance([100.], [150.], power=2) 0.14... we would get identical errors. The deviance when 3.3.4.10. Pinball loss¶The \[\text{pinball}(y, \hat{y}) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}}-1} \alpha \max(y_i - \hat{y}_i, 0) + (1 - \alpha) \max(\hat{y}_i - y_i, 0)\] The value of pinball loss is equivalent to half of Here is a small example of usage of the
>>> from sklearn.metrics import mean_pinball_loss >>> y_true = [1, 2, 3] >>> mean_pinball_loss(y_true, [0, 2, 3], alpha=0.1) 0.03... >>> mean_pinball_loss(y_true, [1, 2, 4], alpha=0.1) 0.3... >>> mean_pinball_loss(y_true, [0, 2, 3], alpha=0.9) 0.3... >>> mean_pinball_loss(y_true, [1, 2, 4], alpha=0.9) 0.03... >>> mean_pinball_loss(y_true, y_true, alpha=0.1) 0.0 >>> mean_pinball_loss(y_true, y_true, alpha=0.9) 0.0 It is possible to build a scorer object with a specific choice of >>> from sklearn.metrics import make_scorer >>> mean_pinball_loss_95p = make_scorer(mean_pinball_loss, alpha=0.95) Such a scorer can be used to evaluate the generalization performance of a quantile regressor via cross-validation: >>> from sklearn.datasets import make_regression >>> from sklearn.model_selection import cross_val_score >>> from sklearn.ensemble import GradientBoostingRegressor >>> >>> X, y = make_regression(n_samples=100, random_state=0) >>> estimator = GradientBoostingRegressor( ... loss="quantile", ... alpha=0.95, ... random_state=0, ... ) >>> cross_val_score(estimator, X, y, cv=5, scoring=mean_pinball_loss_95p) array([13.6..., 9.7..., 23.3..., 9.5..., 10.4...]) It is also possible to build scorer objects for hyper-parameter tuning. The sign of the loss must be switched to ensure that greater means better as explained in the example linked below. 3.3.4.11. D² score¶The D² score computes the fraction of deviance explained. It is a generalization of R², where the squared error is generalized and replaced by a deviance of choice \(\text{dev}(y, \hat{y})\) (e.g., Tweedie, pinball or mean absolute error). D² is a form of a skill score. It is calculated as \[D^2(y, \hat{y}) = 1 - \frac{\text{dev}(y, \hat{y})}{\text{dev}(y, y_{\text{null}})} \,.\] Where \(y_{\text{null}}\) is the optimal prediction of an intercept-only model (e.g., the mean of Like R², the best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts \(y_{\text{null}}\), disregarding the input features, would get a D² score of 0.0. 3.3.4.11.1. D² Tweedie score¶The The argument A scorer object with a specific choice of >>> from sklearn.metrics import d2_tweedie_score, make_scorer >>> d2_tweedie_score_15 = make_scorer(d2_tweedie_score, power=1.5) 3.3.4.11.2. D² pinball score¶The \[\text{dev}(y, \hat{y}) = \text{pinball}(y, \hat{y}).\] The argument A scorer object with a specific choice of >>> from sklearn.metrics import d2_pinball_score, make_scorer >>> d2_pinball_score_08 = make_scorer(d2_pinball_score, alpha=0.8) 3.3.4.11.3. D² absolute error score¶The \[\text{dev}(y, \hat{y}) = \text{MAE}(y, \hat{y}).\] Here are some usage examples of the >>> from sklearn.metrics import d2_absolute_error_score >>> y_true = [3, -0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> d2_absolute_error_score(y_true, y_pred) 0.764... >>> y_true = [1, 2, 3] >>> y_pred = [1, 2, 3] >>> d2_absolute_error_score(y_true, y_pred) 1.0 >>> y_true = [1, 2, 3] >>> y_pred = [2, 2, 2] >>> d2_absolute_error_score(y_true, y_pred) 0.0 3.3.5. Clustering metrics¶The 3.3.6. Dummy estimators¶When doing supervised learning, a simple sanity check consists of comparing one’s estimator against simple rules of thumb.
Note that with all these strategies, the To illustrate >>> from sklearn.datasets import load_iris >>> from sklearn.model_selection import train_test_split >>> X, y = load_iris(return_X_y=True) >>> y[y != 1] = -1 >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) Next, let’s compare the accuracy of >>> from sklearn.dummy import DummyClassifier >>> from sklearn.svm import SVC >>> clf = SVC(kernel='linear', C=1).fit(X_train, y_train) >>> clf.score(X_test, y_test) 0.63... >>> clf = DummyClassifier(strategy='most_frequent', random_state=0) >>> clf.fit(X_train, y_train) DummyClassifier(random_state=0, strategy='most_frequent') >>> clf.score(X_test, y_test) 0.57... We see that >>> clf = SVC(kernel='rbf', C=1).fit(X_train, y_train) >>> clf.score(X_test, y_test) 0.94... We see that the accuracy was boosted to almost 100%. A cross validation strategy is recommended for a better estimate of the accuracy, if it is not too CPU costly. For more information see the Cross-validation: evaluating estimator performance section. Moreover if you want to optimize over the parameter space, it is highly recommended to use an appropriate methodology; see the Tuning the hyper-parameters of an estimator section for details. More generally, when the accuracy of a classifier is too close to random, it probably means that something went wrong: features are not helpful, a hyperparameter is not correctly tuned, the classifier is suffering from class imbalance, etc…
In all these strategies, the What does score function do in Python?score(self, X, y, sample_weight=None)[source] Returns the coefficient of determination R^2 of the prediction. The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((ytrue - ypred) ** 2). sum() and v is the total sum of squares ((ytrue - ytrue. mean()) ** 2).
What does Model score () do?A model scoring is an AI Studio operator that stores the value predicted by a supervised learning model for the objective field, i.e., the field you want to predict. When you make a prediction in AI Studio, the model returns the predicted value along with a performance measure.
What is score method in linear regression?Linear Regression Scoring: This type of scoring is performed by implementing linear regression algorithm on the random sample of data. The process includes scoring techniques on variables that have linear dependencies.
What is score function sklearn?In Python, the accuracy_score function of the sklearn. metrics package calculates the accuracy score for a set of predicted labels against the true labels.
|