Locality Regularized Robust-PCRC: A Novel Simultaneous Feature Extraction and Classification Framework for Hyperspectral Images

Despite the successful applications of probabilistic collaborative representation classification (PCRC) in pattern classification, it still suffers from two challenges when being applied on hyperspectral images (HSIs) classification: 1) ineffective feature extraction in HSIs under noisy situation; and 2) lack of prior information for HSIs classification. To tackle the first problem existed in PCRC, we impose the sparse representation to PCRC, i.e., to replace the 2-norm with 1-norm for effective feature extraction under noisy condition. In order to utilize the prior information in HSIs, we first introduce the Euclidean distance (ED) between the training samples and the testing samples for the PCRC to improve the performance of PCRC. Then, we bring the coordinate information (CI) of the HSIs into the proposed model, which finally leads to the proposed locality regularized robust PCRC (LRR-PCRC). Experimental results show the proposed LRR-PCRC outperformed PCRC and other state-of-the-art pattern recognition and machine learning algorithms.

cover types [1]. The supervised classification [2] is one of the main topics in HSIs processing. In the supervised classification scenario, the output of testing samples is determined by the given training samples labeled for each class [1]. The great challenges, however, posed by adopting supervised classification on HSIs are due to various causes. First, the ratio of the large number of spectral bands with respect to the limited samples of training pixels is high, i.e., the well-known Hughes phenomenon [3]. Second, the materials of the same category may appear different spectral features and the different classes of samples may share the same spectral characteristic due to the sensory or environmental noises [4], [5].
In order to address the Hughes phenomenon in HSIs or remote sensing images, many advanced supervised classification methods have been proposed and achieved good performance, such as the support vector machine (SVM) [5], extreme learning machine (ELM) [6], [7], sparse multinomial logistic regression (SMLR) [8], [9], etc. In addition, some recent literatures have also been reported for solving the Hughes Phenomenon. For examples, in [10] and [11], the band selection techniques have been proposed for reducing the spectral bands via adaptive subspace partition strategy and optimal clustering framework, respectively; a subpixel component analysis and recurrent attention technique have been proposed for HSI and scene classification, respectively, in [12] and [13]; in [14] and [15], the performance of HSIs classification has been improved by multilabel sample augmentation-based and superpixel-based semisupervised active learning. The above methods have achieved relative good performances. On the other hand, methods such as sparse representation classification (SRC) [16], [17] and collaborative representation classification (CRC) [1] also achieved good performance in HSIs classifications. Different from SVM, ELM, or SMLR, the SRC and CRC do not need the training process because they aim to represent a testing sample directly by the linear combination of all the training samples from all classes with 1-norm and 2-norm on the representation coefficient, respectively [18], [19]. Both methods finally assign the class labels to the corresponding testing samples directly via evaluating the minimum redundancy representation among all the classes [1]. This advantage of SRC and CRC has drawn vast research attentions. However, the SRC has its drawback in HSIs processing. The SRC is time-consuming  as the computation of the sparsity 1-norm is a complicated procedure [1]. Meanwhile, Zhang et al. [19] pointed out that it is unnecessary to regularize the 1-norm coefficients if the feature dimension of samples is high enough [1]. They argued that the success of SRC can be largely attributed to the collaborative representation between the test sample and the training samples among all classes [19]. Therefore, CRC has attracted many attentions due to its good performance. For example, in [20], a relaxed CRC method has been proposed for multiple features fusion and classification with good performance. In [21], the good classification performance has been produced based on a collaborative neighbor representation method. In [22], a probabilistic collaborative representation classifier (PCRC) for pattern classification based on the viewpoint of probabilities and achieved better performance than CRC and some other methods [22]. Even for HSIs classification, the CRC also produced good performance. For example, a joint within-class collaborative representation has been proposed for HSI classification in [23]. In [17], a nonlocal joint collaborative representation with locally adaptive dictionary has been proposed for HSIs classification. In [24], a joint collaborative representation with multitask learning has been proposed for HSI classification. These CRC-based methods have achieved good performance to some extent, but they ignored a critical issue of CRC that 2-norm is sensitive to noise when HSIs are prone to be corrupted by noise as stated in literature [16].
Noises may be introduced to HSIs during the process of HSIs data acquisition and transmission [25]. Hence, it is important to design an algorithm to extract features from noisy HSIs efficiently and effectively. For this purpose, many algorithms have been proposed, such as graph-regularized low-rank destriping (GRLD) [26], principal component analysis and wavelet shrinkage (PCA-WS) [27], spectral-spatial adaptive hyperspectral total variation (SSAHTV) [28], and singular spectrum analysis (SSA) [25]. However, these methods focused on either feature extraction or denoising independently, resulting in the classification results may not be optimal due to the lack of comprehensive consideration for both preprocessing and classification tasks.
To address the above challenges, we propose a locality regularized robust probabilistic collaborative representation classification (LRR-PCRC) framework to process the feature extraction and classification simultaneously for HSIs classifications. First, we introduce the sparse representation (SR) to the PCRC. That is because it is well-known that the 1-norm SR [29], [30] is robust to characterize the loss function if the data sets are corroded by noises. Then, we introduce the rich prior information of HSIs to PCRC in order to extract the efficient feature from HSIs, including the Euclidean distance (ED) between training samples and testing samples, and the coordinate information (CI) of the HSIs. The motivations can be summarized in two aspects. On one hand, the same category shares similar spectral features to some extent even when the data points are corrupted, hence, the features can be extracted. More importantly, on the other hand, the neighborhood pixels within the regions are more likely to be in the same category [1]. Therefore, the CI within HSIs can be used for extracting the spatial information since the coordinate of the pixels in HSIs cannot be changed even the data points in HSIs are corrupted. For the ED and CI information, we calculate the ED and coordinate distance between each training sample and the whole testing samples and add the prior information to impose the constraints to the representation coefficient. Finally, the locality regularization will be employed for robust HSIs feature extraction and classification simultaneously.
The main contributions of our proposed work can be summarized as follows.
1) A novel LRR-PCRC framework for simultaneous feature extraction and classification is proposed.  The remaining of this work is structured as follows. Section II introduces the related works. The details of the proposed LRR-PCRC framework are discussed in Section III. Section IV shows the experimental results and the comprehensive analysis. Section V concludes this article with insights of future work.

II. PROBABILISTIC COLLABORATIVE REPRESENTATION CLASSIFICATION (PCRC)
Given N training samples from K classes of an HSI: X = [X N 1 , X N 2 , . . . , X N K ] ∈ R d×N , where d is the number of band in an HSI and X N k is the data matrix of the kth class training samples (N 1 + N 2 + . . . + N K = N). Let la X and S denote the label set of all the classes in X and the linear subspace collaboratively spanned by the total samples in X, respectively [22]. Then given a data point x ∈ R d×1 in the collaborative subspace S, it can be represented as follows: The PCRC [22] formulated S as a probabilistic collaborative subspace, as it argued that there should be different probabilities of la x ∈ la X for a data point x. Once the 2-norm of α N k is small, the probability of P(la x ∈ la X N k ) should be high [22]. It is intuitive to use a Gaussian function to define the probability, that is, where b is a positive constant. Then the probability of every testing sample y that has the sample label of x can be formulated as P(la y = la x ). Hence, the probabilities that the testing sample y lies in the subspace S can be expressed as P la y ∈ la X = P la y = la x |la x ∈ la X P(la x ∈ la X ) (2) where P la y = la x |la x ∈ la X ∝ exp(−cy − x 2 2 ), c is a positive constant. Then (2) can be rewritten as and can be seen as a sample that belongs to the kth class, then the probability that the testing sample y belongs to kth class can be formulated as where P(la x = k|la x ∈ la X ) can be formulated as where β is a positive constant.
Since P la y = la x |la x ∈ la X is independent of the kth class, once k belongs to la X , i.e., P la y = la x |la x ∈ la X = P la y = la x |la x = k ∝ exp −cy − x 2 2 , then P la y = k can be expressed as Assuming the probabilities of P la y = k are independent, then all kinds of probabilities (k = 1, 2, . . . ,K ) can be computed by max P la y = 1,la y = 2, . . . ,la y = K = max  Then ignoring the constant term after applying the logarithmic operator to the above equation, we can havê The solution of (8) can be expressed aŝ Finally, the label of testing sample y can be formulated as

A. LRR-PCRC Model
Although the PCRC has achieved good performance, there still exist some problems. As mentioned above, PCRC is sensitive to noises. The performance of PCRC will degrade if HSIs contain much noise. In order to address this drawback, we propose the LRR-PCRC for HSIs feature extraction and classification simultaneously. For N training samples from K classes of an HSI ∈ R d×n , we need to represent the n testing samples Y using N training samples and assume both of them are corrupted by noise, that is, (11), we can see that the coefficient A needs to represent the n testing samples Y using N training samples under the situation that both of them are corrupted. In order to address this problem, we can formulate the PCRC to the following structure: In (12), we replace the 2-norm with 1-norm considering that the 1-norm is robust to noise [30]. Furthermore, the feature extraction performance can be improved if we incorporate the rich prior information in HSIs to the model, thus producing better classification results. Recall the ED between training samples and testing samples, the spectral signal will be similar if the training samples and testing samples belong to the same class. Hence, we incorporate this information in (12), and then the model can be rewritten aŝ where is Hadamard operator [24], , represents the similarity between the testing pixel Y j and the whole training samples X. The probability of the testing pixel Y j belonging to the kth class should be higher when the which has been explained in (1). Also, we can deduce that the probability of the testing pixel Y j belonging to the kth class should be higher when the 1-norm of coefficient On the other hand, the 1-norm sparse representation is robust to noise as reported in [30].
Furthermore, we use the ED between the testing samples and training samples to enhance the coefficient A, which can improve the classification performance both in terms of accuracies and robustness. Other than the ED between the testing samples and the training samples, the CI of training samples and testing samples can also enhance the coefficient A. First, we calculate the coordinate distances between each training sample and the whole testing samples, then the CI which contains the locality information of training samples and testing samples will be used for enhancing the performance of HSIs classification. This can be explained from two aspects: first, the probabilities that two samples belonging to the same class will be larger if they are located closely [4]; second, even the data points in HSIs are corrupted by noise, the CI of each pixel in HSIs remains the same. Hence, (13) can be represented aŝ , CC i, j can be expressed as follows: where h i and w i are the abscissa and ordinate of sample in the image, respectively. f is the smooth parameter that adjusts the distance decay speed. Fig. 1 shows the flowchart of the proposed LRR-PCRC.

B. Solutions for LRR-PCRC
In this section, we will derive the optimization algorithm to solve the proposed LRR-PCRC model based on the inexact augmented Lagrange multiplier (IALM) [31]. Recall that the proposed LRR-PCR has three contributions compared with the PCRC, i.e., SR, ED between the training samples and testing sample, CI of the HSIs. Now we derive the solution for the proposed LRR-PCRC as follows. (The other solutions of the PCRC with SR or ED, PCRC with SR and ED will be shown in Supplementary.) First, we introduce two auxiliary variables H and J to convert (14) by splitting the variable Then, the corresponding augmented Lagrangian function for (16) can be rewritten as where Y 1 and Y 2 are the Lagrange multipliers. Then the alternative optimization algorithm [32] can be applied to solve the model of (17). Details are introduced as follows.
Update H : Fix A and J , then the H can be updated as The solution of (18) can be solved by the transformation of the soft-threshold rule [33] (19) where e = A t − Y t 1 /τ t . Update J : Fix A and H , J can be updated as The solution of (20) can be computed as where e = A t − Y k 2 /τ t . Update A: Fix H and J , A and be updated as Then the solution of (22) can be achieved by the first-order derivation  Finally, the overall optimization problem for solving the proposed LRR-PCRC is summarized in Algorithm 1.

B. Benchmarking Approaches
To validate our proposed LRR-PCRC, we have compared LRR-PCRC with the state-of-the-art methods to benchmark the performance of our method, including the SMLR [9], SVM [34], and SVM with composite kernel (SVM-CK), SMLR with attribute profile (SMLR-AP) [9], PCRC and PCRC-AP. The MATLAB codes of SVM and SVM-CK are downloadable from [35], while the codes of SMLR and AP are downloadable from [36]. The parameters of AP are set according to the recommendation in [9].
The experiments are carried out on a computer with 2.9 GHz i7 7820HQ CPU with 32 GB RAM running win10 OS. The codes are implemented in MATLAB (R2015a) and all the  experimental results in this article are the averaged values by repeating 10 times.
The key performance indicators (KPI) include overall accuracies (OA), average accuracies (AA), kappa coefficient (k), category accuracies (CA), training time (Tr), and testing time (Ts). In addition, all the abbreviations in this article have been listed in Table I.

C. Parameter Analysis
The key parameters for PCRC are β and λ in (9) and additional parameters for proposed LRR-PCRC are f in (15) and γ in (14). Four experiments are carried out to evaluate the parameters of λ, β, γ , and f , respectively, using five training samples per class. Since the proposed LRR-PCRC focuses on the feature extraction and classification, we consider two different situations of the raw data, i.e., without additional noise and with normally distribution noise (i.i.d.: zero mean with σ 2 covariance). In this section, we set σ to be 0.02. We fix the other parameters when evaluating one parameter. The experimental parameters setting for this section can be seen in Table II. In order to achieve a reliable result, we have iterated each experiment ten times and averaged the results. Experiment 1 (λ) : In this experiment, we evaluate the parameter λ = 2 a1 which a1 ranges from [−20, −19,…,5]. β = 2 −10 for both PCRC and LRR-PCRC. f = 3 and γ = 2 12 for LRR-PCRC. It can be seen from Fig. 2(a) and (b) that λ has some impact on both PCRC and LRR-PCRC. In addition, in Indian Pines and Pavia University data sets, the OA of the proposed LRR-PCRC is higher than PCRC under the situation with or without addition noise. We also can see that the OA of PCRC decreased dramatically when add the noise to HSIs data. However, the proposed LRR-PCRC is more robust to noise than PCRC. As shown in Fig. 2(a) and (b), the PCRC achieved best accuracies at about λ = 2 −7 and λ = 2 −2 under the situation with and without noise, respectively, in Indian Pines. In Pavia University data set, the PCRC achieves good classification accuracies at λ = 2 −7 for both no additional noise and noise condition. Hence, above parameters have been selected for PCRC if not otherwise mentioned. For the proposed LRR-PCRC, the classification accuracies are relatively stable ranging from λ = 2 −15 to λ = 2 −5 in Indian Pines data set while ranging from λ = 2 −15 to λ = 2 3 in Pavia University data set. Hence, λ = 2 −10 and λ = 2 0 have been selected for Indian Pines and Pavia University data sets, respectively. Experiment 2 (β) : In this experiment, we evaluated the parameter β = 2 a2 which a2 ranges from [−20, −19, …,5]. It can be seen from Fig. 2(c) and (d) that the β also has a certain impact on PCRC in both Indian Pines and Pavia University data sets, and on LRR-PCRC in Indian Pines data set. In addition, the proposed LRR-PCRC not only produces higher classification accuracies than PCRC, but also demonstrates more robustness to noise than PCRC. As also can be seen from Fig. 2(c) and (d) that about at the value of β = 2 −10 , PCRC has produced the best classification accuracies in Indian Pines data set. In Pavia University data set, PCRC produced the best classification accuracies at β = 2 −7 and β = 2 −4 under the condition with and without additional noise, respectively. Hence, above parameters have been selected for PCRC if not otherwise mentioned. For the proposed LRR-PCRC, it performs stably when β varies from [−20,…, −5] and [−20,…,5] in Indian Pines data set and Pavia University data set, respectively. Hence, if not specially mentioned in the following experiments, β is set to 2 −8 and 2 3 for the proposed LRR-PCRC in Indian Pines data set and Pavia University data set, respectively. Experiment 3 (γ ) : In this experiment, we evaluated the parameter γ = 2 a3 which a3 ranges from [1, 2,…,15]. It can be seen from Fig. 2(e) and (f) that the OA of LRR-PCRC increased then decreased when γ increases. In addition, the noise has more impact on Indian Pines data set than in Pavia University data set when γ is small. However, the impact can be mostly eliminated when γ is set to a higher value. It would be good for classification performance when λ and β are set to small values while γ is set to a big value. This is because the CI may dominate the important role for feature extraction and classification. We can see that LRR-PCRC yield best classification accuracies both in Indian Pines data set and Pavia University data set at γ = 2 12 and γ = 2 9 , respectively. Hence, γ = 2 12 and γ = 2 9 have been selected for LRR-PCRC.
Experiment 4 ( f ) : In this experiment, we evaluated the parameter f which f ranges from [1, 2,…,10]. It can be seen that from Fig. 2(g) and (h) that the OA of LRR-PCRC increases then decreases when f increased. It can be also seen that the best value of f for classification accuracies is 3 in both data set and both situations that with and without additional noise. Therefore, f = 3 has been selected in the following experiments.

D. Contribution Analysis
In this section, we analyze the three contributions of the proposed LRR-PCRC. Compared with PCRC, the proposed LRR-PCRC has three main contributions: 1) the LRR-PCRC with sparse representation for improving the feature extraction performance; 2) LRR-PCRC with SR and ED; and 3) LRR-PCRC with SR, ED and CI for extracting the efficient features of HSIs in order to further improve the classification accuracies. Hence, we will show the impact of each contribution of the proposed LRR-PCRC in this section. In order to show the exact impact of contributions on proposed model, we selected the best parameters of these contributions which can be seen in Table III. We used 20 training samples per class under the condition with and without additional noise σ = 0.02 (i.i.d: zero mean with σ 2 covariance).
As can be seen from Table IV, three contributions have its improvement of classification accuracies in both Indian Pines data set and Pavia University data set. We can also see a phenomenon, ED information seems to be less important in Indian Pines data set but important in Pavia University data set. This can be explained by the reason that the Indian Pines and Pavia University data set have different spatial structures. In more details, as can be seen from Fig. 3, the pixel of the same category in Indian Pines data set are located together, and the shapes of the same category are more likely to be blocked. While in Pavia University data set, the shapes of some categories are more likely to be stripe; hence, the CI has a dominant role in Indian Pines data set. However, in Pavia University data set, both the ED and CI are very important. From the above discussion, it can be seen that it is more reasonable to use CI and ED collaboratively.

E. Effect of Varying Numbers of Training Samples and the Varying Levels of Noise
In this section, we will further evaluate the performance in feature extraction and classification of the proposed LRR-PCRC via varying the numbers of training sample and levels of noise. The training samples vary between 20, 25, and 30, while the noise levels vary between 0.2, 0.4, 0.6 and 0.8.
As can be seen from Tables V and VI, the classification results in Indian Pines data set and Pavia University data set by LRR-PCRC are more stable and better than PCRC when we vary the number of training samples and levels of noise. This further demonstrates the superior performance of the proposed LRR-PCRC.

F. Comparison With Other State-of-the-Art Algorithms
In this section, first, we verified the proposed methods using under 1% training samples, and the corresponding training samples and testing samples are shown in Table VII. The experiments are conducted in two conditions, i.e., no additional noise and noise with level of σ = 0.02. Tables VIII and IX display the classification accuracies in Indian Pines and Pavia University data set. As can be seen from Tables VIII and IX, the classification accuracies of the proposed LRR-PCRC are better than PCRC and other methods in the two conditions. This verifies the good performance of the proposed LRR-PCRC again. Figs. 4 and 5 show the corresponding classification map.
Second, to further verify the proposed methods, we have divided the entire hyperspectral data set into three subsets (denoted as Subset 1, 2, and 3Â'jrandomly ten times. In each time, Subset 1 and Subset 2 are composed of randomly selected samples containing five samples per class. It is used for twofold cross validations, i.e., Subset 1 for training and Subset 2 for validation and vice versa. Subset 3 containing the rest of the samples is used for testing. Finally, all the classification results will be averaged. Tables X and XI show the results of training (Tr), validation (Val), and testing (Ts). As can be seen from these two tables, in general, the proposed method acquired better results than PCRC and other methods in both data sets.
Third, Table XII shows the time consumed in training and testing with 20 training samples per class without additional noise. It should be noted that the time costs of these methods would not be changed since the dimensionality of HSIs has not changed when noise are added to HSIs. From Table XII, we can see that in both Indian Pines and Pavia University data set, the time consumed by our proposed LRR-PCRC is higher than SMLR, SVM, SMLR-AP, and SVM-AP, but that of the proposed LRR-PCRC is less than PCRC and PCRC-AP. The proposed LRR-PCRC has higher time cost than SVM-CK in Pavia University data set but lower in Indian Pines and Pavia University data set. Here, we theoretically Quit the algorithm if the stopping criterion is met; otherwise, go back to Step 1.2. Output: Predict the testing sample label: . Furthermore, we can see that the classification accuracies of the proposed LRR-PCRC are much higher than the other classifiers. Hence, we can conclude that the proposed LRR-PCRC has good performance in terms of above analysis.

G. Extended Experiments and Analysis
In this section, in order to show the performance of the proposed LRR-PCRC, we have conducted more comparison experiments with the well-known method random forest (RF) [37] and its variations, including rotation forest (RoF) [38] and rotation random forest-kernel principal component analysis (RoRF-KPCA) [39].
The experiments are conducted under the situation that 20 samples per class are used for training and the remaining is used for testing. The kernel function of RoRF-KPCA has three different types, including linear function, radial basis function (RBF), and polynominal (Poly) function. The results of RF, RoF and RoRF-KPCA are directly taken from [39]. Tables XIII and XIV show the classification results of the proposed LRR-PCRC, the RF and its variations in Indian Pines and Pavia University data sets, respectively. As can be seen from these two tables, in general, the proposed method has obtained better classification accuracies than RF and its variations. In addition, the confusion matrices which are corresponding to the classification results of the proposed method in both data sets have been shown in Tables XV and  XVI. We, hence, can conclude that the proposed method has achieved good performance.

V. CONCLUSION
In this article, a novel framework based on sparse constraint and prior information of HSIs, LRR-PCRC, has been proposed to extract efficient feature and classify HSIs. By imposing the sparse constraint to PCRC, the proposed RPCRC can tolerate environmental noise and thus extracts efficient features of HSIs and improves the classification accuracies. In addition, by adding the prior information of HSIs (ED and CI) to RPCRC, the proposed LRR-PCRC can improve the classification accuracies significantly. Experiments have been conducted to compare our proposed LRR-PCRC with other state-of-the-art methods, and the results show our methods have superior performances.
Currently, the computational time of our algorithm is relatively higher compared to SVM or SMLR due to the inverse operation. One direction of our further work will be focused on time costs reduction. One promising way is to use mathematical models to find theoretical solutions; the other approach is seeking dimensionality reductions methods or semisupervised learning practically to further improve the classification accuracies and in the meantime reduce the computational time costs. With respect to the prerequisite of the proposed method, we need to acquire labeled training samples for training the model due to the supervised learning nature of the method. Hence, another research stream of our work will extend the method to unsupervised learning.

A. Solution of LRR-PCRC Without ED Nor CI
First, we introduce the auxiliary variable H to split the variable to simplify the problem of (12). Thus, (12) can be converted tô Then, the corresponding augmented Lagrangian function for (24) can be rewritten as where Y 1 , H − A = trace(Y T 1 (H − A)), τ > 0 is a penalty parameter. Y 1 is the Lagrange multipliers. Then the alternative optimization algorithm [26] can be applied to solve the model of (25). The details can be seen as follows.
Update H : Fix A, then the H can be updated as The solution of (26) can be solved by the simple softthreshold [32] H t+1 = soft A t − Y t 1 /τ t , λ τ t = max{0, abs(e) − λ τ t × sign(e) (27) where abs(e) is the absolute value of e, e = A t − Y k 1 /τ t , and sign is the sign function [33].
Update A : Fix H , then the A can be updated as Then the solution of (28) can be achieved by the first-order derivation

B. Solution of LRR-PCRC Without ED or CI
The corresponding augmented Lagrangian function for LRR-PCRC without ED or CI can be written as follows.
1) LRR-PCRC Without CI: where e = A t − Y t 1 /τ t . Update A: Fix H , then A can be updated as Then the solution of (33) can be achieved by the first-order derivation where J is auxiliary variable to split the variable to let the model become more easily solvable, Y 2 is the Lagrange multipliers. Update J : Fix A, then J can be updated as The solution of (36) can be solved by the transformation of the soft-threshold rule τ t , C × λ τ t = max 0, abs(e) − × λ τ t × sign(e) (37) where e = A t − Y t 2 /τ t . Update A : Fix J , then the A can be updated as Then the solution of (38) can be achieved by the first-order derivation