## Abstract

In this paper, we propose an outlier detection method from an unlabeled target dataset by exploiting an unlabeled source dataset. Detecting outliers has attracted attention of data miners for over two decades, since such outliers can be crucial in decision making, knowledge discovery, and fraud detection, to name but a few. The fact that outliers are scarce and often tedious to label motivated researchers to propose detection methods from an unlabeled dataset, some of which borrow strengths from relevant labeled datasets in the framework of transfer learning. He et al. tackled a more challenging situation in which the input datasets coming from multiple tasks are all unlabeled. Their method, ML-OCSVM, conducts multi-task learning with one-class support vector machines (SVMs) and yields a mean model plus task-specific increments to detect outliers in the test datasets of the multiple tasks. We inherit a part of their problem setting, taking only unlabeled datasets in the input, but increase the difficulty by assuming only one source dataset in addition to the target dataset. Consequently, the source dataset consists of examples relevant to the target task as well as examples that are less relevant. To cope with this situation, we extend Selective Transfer Machine, which weights individual examples in the framework of covariate shift and learns an SVM classifier, to our one-class setting by replacing the binary SVMs with one-class SVMs. Experiments on two public datasets and an artificial dataset show that our method mostly outperforms baseline methods, including ML-OCSVM and a state-of-the-art ensemble anomaly detection method, in *F*1 score and AUC.

This is a preview of subscription content, access via your institution.

## Notes

- 1.
Transfer learning obtains knowledge while solving one or several source tasks and applies it to solve the target task.

- 2.
We initially proposed OCSTM for detecting anomalous facial expressions [15]. This paper extends it to general outlier detection by adding experiments in ECG and synthetic domains.

- 3.
As we stated in Sect. 1, the test dataset in classification is used as an unlabeled target dataset in several methods.

- 4.
Chen and Liu tackled pain recognition as an application [37] and thus we use “pain” and “normal” as class labels for a clear explanation.

- 5.
Bold lower-case letters represent a column vector \(\mathbf{x}\); \(x_j\) denotes the value of the

*j*th element of \(\mathbf{x}\). \(\mathbf{I}_n\in \mathbb {R}^{n\times n}\) is an identity matrix. \(\mathbf{0}_d\) represents a zero vector whose length is*d*, and \(\mathbf{1}_d\) represents a one value vector whose length is*d*. - 6.
Since the source and target datasets can be regarded as the training and test datasets in classification, we use \(\mathrm{sc}\) and \(\mathrm{tar}\) in their symbols, respectively.

- 7.
Intuitively, in estimating an expectation value of a probabilistic variable by sampling, importance sampling is a statistical technique to oversample the important region which contributes much to the expectation value.

- 8.
Since we assume unlabeled data, we dropped

*y*from the argument of the weighting function. - 9.
STM was originally proposed to a facial expression detection problem of a test subject based on training subjects. Since we handle TL, we call them the target task and the source tasks, respectively.

- 10.
The representer theorem proves that the optimal solution can be written as a linear combination of kernel functions evaluated at the training (source) samples for the optimization problem on a loss function added a regularization term \(\lambda \Vert \mathbf{w}\Vert ^2\) [43].

- 11.
In this subsection, we follow the original notations and unlike in the previous subsection do not stack 1 in each data vector \(\mathbf{x}_i^{\mathrm{sc}}\).

- 12.
We give the details in the next paragraph.

- 13.
One of the

*T*tasks is chosen as the target, and the remaining ones are agglomerated as the source task. We will explain the details in Sect. 5.3. - 14.
- 15.
- 16.
In this case,

*Precision*and*Recall*are both zero. - 17.
- 18.
- 19.
- 20.
To optimize Eqs. (30), (33) in the experiments, we use the interior-point method [60] provided in “CVXOPT” python library (http://cvxopt.org/index.html).

- 21.
The predicted values of the decision function are negative.

- 22.
Recall again that we tuned the parameters of other methods and not ours.

## References

- 1.
Hawkins DM (1980) Identification of Outliers. Chapman and Hall, London

- 2.
Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J. 8(3–4):237–253

- 3.
Deguchi Y, Suzuki E (2015) Hidden fatigue detection for a desk worker using clustering of successive tasks. In: Ambient Intelligence, vol 9425 of LNCS. Springer, pp 268–283

- 4.
Schölkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471

- 5.
Tax DMJ, Duin RPW (2004) Support vector data description. Mach Learn 54(1):45–66

- 6.
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359

- 7.
Ganin Y, Lempitsky V (2015) Unsupervised Domain Adaptation by Backpropagation. In: Proceedings of ICML, pp 1180–1189

- 8.
Sener O, Song HO, Saxena A, Savarese S (2016) Learning transferrable representations for unsupervised domain adaptation. In: Proceedings of NIPS, pp 2110–2118

- 9.
Long M, Zhu H, Wang J, Jordan MI (2016) Unsupervised Domain Adaptation with Residual Transfer Networks. In: Proceedings of NIPS, pp 136–144

- 10.
Yang H, King I, Lyu MR (2010) Multi-task learning for one-class classification. In: Proceedings of IJCNN, pp 1–8

- 11.
He X, Mourot G, Maquin D, Ragot J, Beauseroy P, Smolarz A, Grall-Maës E (2014) Multi-task learning with one-class SVM. Neurocomputing 133:416–426

- 12.
Chu W-S, Torre FDL, Cohn JF (2017) Selective transfer machine for personalized facial expression analysis. IEEE Trans Pattern Anal Mach Intell 39(3):529–545

- 13.
Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B (2009) Covariate shift by Kernel mean matching. In: Dataset shift in machine learning, chapter 8, pp 131–160. The MIT Press, Cambridge

- 14.
Sugiyama M, Krauledat M, Müller K-R (2007) Covariate shift adaptation by importance weighted cross validation. J Mach Learn Res 8:985–1005

- 15.
Fujita H, Matsukawa T, Suzuki E (2018) One-class selective transfer machine for personalized anomalous facial expression detection. In: Proceedings of VISIGRAPP, vol 5: VISAPP, pp 274–283

- 16.
Han J, Kamber M, Pei J (2012) Data mining, 3rd edn. Morgan Kaufmann, Waltham

- 17.
Schapire RE (1999) A brief introduction to boosting. In: Proceedings of IJCAI, pp 1401–1406

- 18.
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, pp 226–231

- 19.
Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of SIGMOD, pp 49–60

- 20.
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. Proc KDD 98:58–65

- 21.
Wang W, Yang J, Muntz RR (1997) STING: a statistical information grid approach to spatial data mining. In: Proceedings of VLDB, pp 186–195

- 22.
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of SIGMOD, pp 94–105

- 23.
Breunig MM, Kriegel H-P, Ng RT, Sander Jörg J (2000) LOF: identifying density-based local outliers. Proc SIGMOD Rec 29(2):93–104

- 24.
Sugiyama M, Borgwardt K (2013) Rapid distance-based outlier detection via sampling. In: Proceedings of NIPS, pp 467–475

- 25.
Bellman RE (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton

- 26.
Vapnik V (1995) The nature of statistical learning theory. Springer, New York

- 27.
Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167

- 28.
Liu H, Liu T, Wu J, Tao D, Fu Y (2015) Spectral ensemble clustering. In: Proceedings of KDD, pp 715–724

- 29.
Zhao Y, Nasrullah Z, Hryniewicki MK, Li Z (2019) LSCP: locally selective combination in parallel outlier ensembles. In Proceedings of SDM

- 30.
Bakker B, Heskes T (2003) Task clustering and gating for Bayesian multitask learning. J Mach Learn Res 4:83–99

- 31.
Yao Y, Doretto G (2010) Boosting for transfer learning with multiple sources. In: Proceedings of CVPR, pp 1855–1862

- 32.
Ge L, Gao J, Ngo H, Li K, Zhang A (2014) On handling negative transfer and imbalanced distributions in multiple source transfer learning. Stat Anal Data Min ASA Data Sci J 7(4):254–271

- 33.
Cao B, Pan SJ, Zhang Y, Yeung D-Y, Yang Q (2010) Adaptive transfer learning. In: Proceedings of AAAI, pp 407–712

- 34.
Tzeng E, Homan J, Darrell T, Saenko K (2015) Simultaneous deep transfer across domains and tasks. In: Proceedings of ICCV, pp 4068–4076

- 35.
Chen J, Liu X, Tu P, Aragones A (2013) Learning person-specific models for facial expression and action unit recognition. Pattern Recognit Lett 34(15):1964–1970

- 36.
Kodirov E, Xiang T, Fu Z-Y, Gong S (2015) Unsupervised domain adaptation for zero-shot learning. In: Proceedings of ICCV, pp 2452–2460

- 37.
Chen J, Liu X (2014) Transfer learning with one-class data. Pattern Recognit Lett 37:32–40

- 38.
Sangineto E, Zen G, Ricci E, Sebe N (2014) We are not all equal: personalizing models for facial expression analysis with transductive parameter transfer. In: Proceedings of ACM international conference on multimedia, pp 357–366

- 39.
Zen G, Porzi L, Sangineto E, Ricci E, Sebe N (2016) Learning personalized models for facial expression analysis and gesture recognition. IEEE Trans Multimed 18(4):775–788

- 40.
Sugiyama M, Nakajima S, Kashima H, Buenau PV, Kawanabe M (2008) Direct importance estimation with model selection and its application to covariate shift adaptation. In: Proceedings of NIPS, pp 1433–1440

- 41.
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

- 42.
Candela JQ, Sugiyama M, Schwaighofer A, Lawrence ND (2009) Dataset shift in machine learning. MIT Press, Cambridge

- 43.
Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19(5):1155–1178

- 44.
Amari S, Wu S (1999) Improving support vector machine classifiers by modifying kernel functions. Neural Netw 12(6):783–789

- 45.
Gorski J, Pfeuffer F, Klamroth K (2007) Biconvex sets and optimization with biconvex functions: a survey and extensions. Math Methods Oper Res 66(3):373–407

- 46.
Monteiro RDC, Adler I (1989) Interior path following primal–dual algorithms. Part II: convex quadratic programming. Math Program 44(1–3):43–66

- 47.
Lucey P, Cohn JF, Prkachin KM, Solomon PE, Matthews I (2011) Painful data: the UNBC-McMaster shoulder pain expression archive database. In: Proceedings of IEEE international conference on automatic face and gesture recognition and workshops, pp 57–64

- 48.
Prkachin KM, Solomon PE (2008) The structure, reliability and validity of pain expression: evidence from patients with shoulder pain. Pain 139(2):267–274

- 49.
Moody GB, Mark RG (2001) The impact of the MIT-BIH arrhythmia database. IEEE Eng Med Biol Mag 20(3):45–50

- 50.
Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of ICCV, pp 1150–1157

- 51.
Ahonen T, Hadid A, Pietikäinen M (2006) Face description with local binary patterns: application to face recognition. IEEE Trans Pattern Anal Mach Intell 28(12):2037–2041

- 52.
Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685

- 53.
Ekman P, Friesen WV (1975) Unmasking the face: a guide to recognizing emotions from facial cues. Prentice Hall, Englewood Cliffs

- 54.
Mohammadian A, Aghaeinia H, Towhidkhah F et al (2016) Subject adaptation using selective style transfer mapping for detection of facial action units. Expert Syst Appl 56:282–290

- 55.
Pan J, Tompkins WJ (1985) A real-time QRS detection algorithm. IEEE Trans Biomed Eng BME–32(3):230–236

- 56.
Yu S-N, Chen Y-H (2007) Electrocardiogram beat classification based on wavelet transformation and probabilistic neural network. Pattern Recognit Lett 28(10):1142–1150

- 57.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

- 58.
Andrei N (2019) PyClustering: data mining library. J Open Source Softw 4(36):1230

- 59.
Zhao Y, Nasrullah Z, Li Z (2019) PyOD: a python toolbox for scalable outlier detection. arXiv preprint arXiv:1901.01588

- 60.
Andersen M, Dahl J, Liu Z, Vandenberghe L (2011) Interior-point methods for largescale cone programming. In: Optimization for machine learning, pp 55–83. MIT Press, Cambridge

## Acknowledgements

A part of this research was supported by Grants-in-Aid for Scientific Research JP15K12100 and JP18H03290 from the Japan Society for the Promotion of Science (JSPS).

## Author information

### Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

## About this article

### Cite this article

Fujita, H., Matsukawa, T. & Suzuki, E. Detecting outliers with one-class selective transfer machine.
*Knowl Inf Syst* **62, **1781–1818 (2020). https://doi.org/10.1007/s10115-019-01407-5

Received:

Revised:

Accepted:

Published:

Issue Date:

### Keywords

- One-class outlier detection
- One-class support vector machines
- Kernel mean matching
- Transfer learning