### Spie Press Book • new

Optimal Bayesian ClassificationFormat | Member Price | Non-Member Price |
---|---|---|

The most basic problem of engineering is the design of optimal operators. Design takes different forms depending on the random process constituting the scientific model and the operator class of interest. For classification, the random process is a feature-label distribution, and a Bayes classifier minimizes classification error. Rarely do we know the feature-label distribution or have sufficient data to estimate it. To best use available knowledge and data, this book takes a Bayesian approach to modeling the feature-label distribution and designs an optimal classifier relative to a posterior distribution governing an uncertainty class of feature-label distributions. The origins of this approach lie in estimating classifier error when there are insufficient data to hold out test data, in which case an optimal error estimate can be obtained relative to the uncertainty class. A natural next step is to forgo classical ad hoc classifier design and find an optimal classifier relative to the posterior distribution over the uncertainty class—this being an optimal Bayesian classifier.

Pages: 362

ISBN: 9781510630697

Volume: PM310

### Table of Contents

*Preface**Acknowledgments***1 Classification and Error Estimation**- 1.1 Classifiers
- 1.2 Constrained Classifiers
- 1.3 Error Estimation
- 1.4 Random Versus Separate Sampling
- 1.5 Epistemology and Validity
- 1.5.1 RMS bounds
- 1.5.2 Error RMS in the Gaussian model
**2 Optimal Bayesian Error Estimation**- 2.1 The Bayesian MMSE Error Estimator
- 2.2 Evaluation of the Bayesian MMSE Error Estimator
- 2.3 Performance Evaluation at a Fixed Point
- 2.4 Discrete Model
- 2.4.1 Representation of the Bayesian MMSE error estimator
- 2.4.2 Performance and robustness in the discrete model
- 2.5 Gaussian Model
- 2.5.1 Independent covariance model
- 2.5.2 Homoscedastic covariance model
- 2.5.3 Effective class-conditional densities
- 2.5.4 Bayesian MMSE error estimator for linear classification
- 2.6 Performance in the Gaussian Model with LDA
- 2.6.1 Fixed circular Gaussian distributions
- 2.6.2 Robustness to falsely assuming identity covariances
- 2.6.3 Robustness to falsely assuming Gaussianity
- 2.6.4 Average performance under proper priors
- 2.7 Consistency of Bayesian Error Estimation
- 2.7.1 Convergence of posteriors
- 2.7.2 Sufficient conditions for consistency
- 2.7.3 Discrete and Gaussian models
- 2.8 Calibration
- 2.8.1 MMSE calibration function
- 2.8.2 Performance with LDA
- 2.9 Optimal Bayesian ROC-based Analysis
- 2.9.1 Bayesian MMSE FPR and TPR estimation
- 2.9.2 Bayesian MMSE ROC and AUC estimation
- 2.9.3 Performance study
**3 Sample-Conditioned MSE of Error Estimation**- 3.1 Conditional MSE of Error Estimators
- 3.2 Evaluation of the Conditional MSE
- 3.3 Discrete Model
- 3.4 Gaussian Model
- 3.4.1 Effective joint class-conditional densities
- 3.4.2 Sample-conditioned MSE for linear classification
- 3.4.3 Closed-form expressions for functions
*I*and*R* - 3.5 Average Performance in the Gaussian Modell
- 3.6 Convergence of the Sample-Conditioned MSE
- 3.7 A Performance Bound for the Discrete Model
- 3.8 Censored Sampling
- 3.8.1 Gaussian model
- 3.9 Asymptotic Approximation of the RMS
- 3.9.1 Bayesian−Kolmogorov asymptotic conditions
- 3.9.2 Conditional expectation
- 3.9.3 Unconditional expectation
- 3.9.4 Conditional second moments
- 3.9.5 Unconditional second moments
- 3.9.6 Unconditional MSE
**4 Optimal Bayesian Classification**- 4.1 Optimal Operator Design Under Uncertainty
- 4.2 Optimal Bayesian Classifier
- 4.3 Discrete Model
- 4.4 Gaussian Model
- 4.4.1 Both covariances known
- 4.4.2 Both covariances diagonal
- 4.4.3 Both covariances scaled identity or general
- 4.4.4 Mixed covariance models
- 4.4.5 Average performance in the Gaussian model
- 4.5 Transformations of the Feature Space
- 4.6 Convergence of the Optimal Bayesian Classifier
- 4.7 Robustness in the Gaussian Model
- 4.7.1 Falsely assuming homoscedastic covariances
- 4.7.2 Falsely assuming the variance of the features
- 4.7.3 Falsely assuming the mean of a class
- 4.7.4 Falsely assuming Gaussianity under Johnson distributions
- 4.8 Intrinsically Bayesian Robust Classifiers
- 4.9 Missing Values
- 4.9.1 Computation for application
- 4.10 Optimal Sampling
- 4.10.1 MOCU-based optimal experimental design
- 4.10.2 MOCU-based optimal sampling
- 4.11 OBC for Autoregressive Dependent Sampling
- 4.11.1 Prior and posterior distributions for VAR processes
- 4.11.2 OBC for VAR processes
**5 Optimal Bayesian Risk-based Multi-class Classification**- 5.1 Bayes Decision Theory
- 5.2 Bayesian Risk Estimation
- 5.3 Optimal Bayesian Risk Classification
- 5.4 Efficient Computation
- 5.5 Efficient Computation
- 5.6 Evaluation of Posterior Mixed Moments: Discrete Model
- 5.7 Evaluation of Posterior Mixed Moments: Gaussian Models
- 5.7.1 Known covariance
- 5.7.2 Homoscedastic general covariance
- 5.7.3 Independent general covariance
- 5.8 Simulations
**6 Optimal Bayesian Transfer Learning**- 6.1 Joint Prior Distribution
- 6.2 Posterior Distribution in the Target Domain
- 6.3 Optimal Bayesian Transfer Learning Classifier
- 6.3.1 OBC in the target domain
- 6.4 OBTLC with Negative Binomial Distribution
**7 Construction of Prior Distributions**- 7.1 Prior Construction Using Data from Discarded Features
- 7.2 Prior Knowledge from Stochastic Differential Equations
- 7.2.1 Binary classification of Gaussian processes
- 7.2.2 SDE prior knowledge in the BCGP model
- 7.3 Maximal Knowledge-Driven Information Prior
- 7.3.1 Conditional probabilistic constraints
- 7.3.2 Dirichlet prior distribution
- 7.4 REMLP for a Normal-Wishart Prior
- 7.4.1 Pathway knowledge
- 7.4.2 REMLP optimization
- 7.4.3 Application of a normal-Wishart prior
- 7.4.4 Incorporating regulation types
- 7.4.5 A synthetic example
*References**Index*

## Preface

The most basic problem of engineering is the design of optimal (or close-tooptimal) operators. The design of optimal operators takes different forms depending on the random process constituting the scientific model and the operator class of interest. The operators might be filters, controllers, or classifiers, each having numerous domains of application. The underlying random process might be a random signal/image for filtering, a Markov process for control, or a feature-label distribution for classification. Here we are interested in classification, and an optimal operator is a Bayes classifier, which is a classifier minimizing the classification error.

With sufficient knowledge we can construct the feature-label distribution and thereby find a Bayes classifier. Rarely, and in practice virtually never, do we possess such knowledge. On the other hand, if we had unlimited data, we could accurately estimate the feature-label distribution and obtain a Bayes classifier. Rarely do we possess sufficient data. Therefore, we must use whatever knowledge and data are available to design a classifier whose performance is hopefully close to that of a Bayes classifier.

Classification theory has historically developed mainly on the side of data,
the classical case being *linear discriminant analysis (LDA)*, where a Bayes
classifier is deduced from a Gaussian model and the parameters of the
classifier are estimated from data via maximum-likelihood estimation. The
idea is that we have knowledge that the true feature-label distribution is
Gaussian (or close to Gaussian) and the data can fill in the parameters, in this
case, the mean vectors and common co-variance matrix. Much contemporary
work takes an even less knowledge-driven approach by assuming some very
general classifier form such as a neural network and estimating the network
parameters by fitting the network to the data in some manner. The more
general the classifier form, the more parameters to determine, and the more
data needed. Moreover, there is growing danger of overfitting the classifier
form to the data as the classifier structure becomes more complex. Lack of
knowledge presents us with model uncertainty, and hypothesizing a classifier
form and then estimating the parameters is an ad hoc way of dealing with that
uncertainty. It is ad hoc because the designer postulates a classification rule
based on some heuristics and then applies the rule to the data.

This book takes a Bayesian approach to modeling the feature-label
distribution and designs an optimal classifier relative to a posterior
distribution governing an uncertainty class of feature-label distributions. In
this way it takes full advantage of knowledge regarding the underlying system
and the available data. Its origins lie in the need to estimate classifier error
when there is insufficient data to hold out test data, in which case an optimal
error estimate can be obtained relative to the uncertainty class. A natural next
step is to forgo classical ad hoc classifier design and simply find an optimal
classifier relative to the posterior distribution over the uncertainty class—this
being an *optimal Bayesian classifier*.

A critical point is that, in general, for optimal operator design, the prior distribution is not on the parameters of the operator (controller, filter, classifier), but on the unknown parameters of the scientific model, which for classification is the feature-label distribution. If the model were known with certainty, then one would optimize with respect to the known model; if the model is uncertain, then the optimization is naturally extended to include model uncertainty and the prior distribution on that uncertainty. Model uncertainty induces uncertainty on the operator parameters, and the distribution of the latter uncertainty follows from the prior distribution on the model. If one places the prior directly on the operator parameters while ignoring model uncertainty, then there is a scientific gap, meaning that the relation between scientific knowledge and operator design is broken.

The first chapter reviews the basics of classification and error estimation. It addresses the issue that confronts much of contemporary science and engineering: How do we characterize validity when data are insufficient for the complexity of the problem? In particular, what can be said regarding the accuracy of an error estimate? This is the most fundamental question for classification since the error estimate characterizes the predictive capacity of a classifier. The chapter closes with a section on the theory of optimal operator design under uncertainty. Optimal Bayesian classification is a particular case in which the operators are classifiers.

Chapter 2 develops the theory of optimal Bayesian error estimation: What
is the best estimate of classifier error given our knowledge and the data? It
introduces what is perhaps the most important concept in the book: effective
class-conditional densities. Optimal classifier design and error estimation for a
particular feature-label distribution are based on the class-conditional
densities. In the context of an uncertainty class of class-conditional densities,
the key role is played by the effective class-conditional densities, which are the
expected densities relative to the posterior distribution. The Bayesian
*minimum-mean-square error (MMSE)* theory is developed for the discrete
multinomial model and several Gaussian models. Sufficient conditions for
error-estimation consistency are provided. The chapter closes with a
discussion of optimal Bayesian ROC estimation.

Chapter 3 addresses error-estimation accuracy. In the typical ad hoc
classification paradigm, there is no way to address the accuracy of a particular
error estimate. We can only quantify the *mean-square error (MSE)* of error
estimation relative to the sampling distribution. With Bayesian MMSE error
estimation, we can compute the MSE of the error estimate conditioned on the
actual sample relative to the uncertainty class. The sample-conditioned MSE
is studied in the discrete and Gaussian models, and its consistency is
established. Because a running MSE calculation can be performed as new
sample points are collected, one can do censored sampling: stop sampling
when the error estimate and MSE of the error estimate are sufficiently small.
Section 3.9 provides double-asymptotic approximations of the first and
second moments of the Bayesian MMSE error estimate relative to the
sampling distribution and the uncertainty class, thereby providing asymptotic
approximation to the MSE, or, its square root, the *root-mean-square (RMS)*
error. Double asymptotic convergence means that both the sample size and
the dimension of the space increase to infinity at a fixed rate between the two.
Even though we omit many theoretical details (referring instead to the
literature), this section is rather long, on account of double asymptotics, and
contains many complicated equations. Nevertheless, it provides an instructive
analysis of the relationship between the conditional and unconditional RMS.
Regarding the chapter as a whole, it is not a logical prerequisite for succeeding
chapters and can be skipped by those wishing to move directly to
classification.

Chapter 4 defines an optimal Bayesian classifier as one possessing
minimum expected error relative to the uncertainty class, this expectation
agreeing with the Bayesian MMSE error estimate. Optimal Bayesian
classifiers are developed for a discrete model and several Gaussian models,
and convergence to a Bayes classifier for the true feature-label distribution is
studied. The robustness of assumptions on the prior distribution is discussed.
The chapter has a section on *intrinsically Bayesian robust* classification, which
is equivalent to optimal Bayesian classification with a null dataset. It next has
a section showing how missing values in the data are incorporated into the
overall optimization without having to implement an intermediate imputation
step, which would cause a loss of optimality. The chapter closes with two
sections in which sampling is not random. Section 4.10 considers optimal
sampling, and Section 4.11 examines the effect of dependent sampling.

Chapter 5 extends the theory to multi-class classification via optimal Bayesian risk classification. It includes evaluation of the sample-conditioned MSE of risk estimation and evaluation of the posterior mixed moments for both the discrete and Gaussian models.

Chapter 6 extends the multi-class theory to transfer learning. Here, there are data from a different (source) feature-label distribution, and one wishes to use this data together with whatever data are available from the (target) feature-label distribution of interest. The source and target are linked via a joint prior distribution, and an optimal Bayesian transfer learning classifier is derived for the posterior distribution in the target domain. Both Gaussian and negative binomial distributions are considered.

The final chapter addresses the fundamental problem of prior construction: How do we transform prior knowledge into a prior distribution? The first two sections address special cases: using data from discarded features, and using knowledge from a partially known physical system. The heart of the chapter is the development of a general method for transforming scientific knowledge into a prior distribution by performing an information-theoretic optimization over a class of potential priors with the optimization constrained by a set of conditional probability statements characterizing our scientific knowledge.

In a sense, this book is the last of a trilogy. *The Evolution of Scientific
Knowledge: From Certainty to Uncertainty* (Dougherty, 2016) traces the
epistemology of modern science from its deterministic beginnings in the
Seventeenth Century up through the inherent stochasticity of quantum theory
in the first half of the Twentieth Century, and then to the uncertainty in
scientific models that became commonplace in the latter part of the Twentieth
Century. This uncertainty leads to an inability to validate physical models,
thereby limiting the scope of valid science. The last chapter of the book
presents, from a philosophical perspective, the structure of operator design in
the context of model uncertainty. *Optimal Signal Processing Under
Uncertainty * (Dougherty, 2018) develops the mathematical theory articulated
in that last chapter, applying it to filtering, control, classification, clustering,
and experimental design. In this book, we extensively develop the classification
theory summarized in that book.

## Acknowledgments

The material in this book was developed over a number of years and involves the contributions of numerous students and colleagues, whom we would like to acknowledge: Mohammadmahdi Rezaei Yousefi, Amin Zollanvari, Mohammad Shahrokh Esfahani, Shahin Boluki, Alireza Karbalayghareh, Siamak Zamani Dadaneh, Roozbeh Dehghannasiri, Xiaoning Qian, Byung- Jun Yoon, and Ulisses Braga-Neto. We also extend our appreciation to Dara Burrows for her careful editing of the book.

**Edward R. Dougherty**

**Lori A. Dalton**

December 2019

**© SPIE.**Terms of Use