Size of Data Sets for Support Vector Machine Classification
The College of New Jersey
Jessie Burger
May 10, 2002
Submitted in partial completion of the requirements of CMSC 497,
Mentored Research in Computer Science.
Abstract
When attempting to use support vector machines to classify data, it is useful to know the quantity of training examples as well as the number of features needed to create a support vector machine that can accurately distinguish between classes. This paper will discuss how much data is needed to train a classifier to accurately predict the class of an example as well as which characteristics are most important when creating a data set. The number of instances in a training set was found to be more important than the number of features representing the classes. Also, the balance of a set is very influential to the accuracy of the prediction method.
Size of Data Sets for Support Vector Machine Classification
Introduction
When designing the datasets for training support vector machines, it is useful to know the amount of information needed to create accurate classifiers. Using too many training examples or too many features to describe each class may actually lower the effectiveness of the classifier. However, too little information will also prevent the support vector machine from being able to accurately divide the classes. To ensure the best performance of the classifier, a middle ground between too little and too much data must be found. The purpose of this project is to attempt to find such a compromise.
Support vector machines trained using sets containing a comparable number of examples of each class have a much high accuracy than those trained using sets with a disproportionate number of examples of one class. When the training set contains an equal number of examples for each class, the number of training examples has a greater effect on the accuracy of the classifier than does the number of features describing each example. For both the number of features and the number of examples, the accuracy of the classifier increases with the amount of data used until a certain point at which the gains in accuracy level off.
Background
Support vector machines are based on statistical learning theory. Because the learning method is statistically based, support vector machines are a fast, efficient method of machine learning. Classification-type support vector machines are trained using a data set containing a positive class and a negative class. The support vector machine learns to distinguish between the two classes. It can then classify new examples by deciding if an example is a member of the positive class or not. If the testing example is not a member of the positive class, it is assumed to be a member of the negative class, since the support vector machine can only recognize two classes.
Each instance of a class is referred to as an attribute vector. The vector is labeled with either a positive or a negative one, showing if it is a member of the positive or the negative class. Each attribute vector contains attributes that are represented as real numbers. Each attribute represents a feature of the item being represented. The positive and negative vectors are normally arranged in pairs, so that the data sets alternate between positive and negative vectors.
The vectors representing each class can be thought of as points in multidimensional space, with each attribute providing the value for one dimension. The support vector machine training method finds the hyperplane that separates the two classes by the widest margin possible while minimizing the number of instances of each class that are grouped with the opposite class. The attribute vectors that define the hyperplane are the support vectors. The training method creates a support vector machine model, which contains the information necessary to predict the class of other vectors. The prediction method uses the training set and the model created by the training method to predict the class of each vector in the testing set. It then checks the stated class of each vector and determines its prediction accuracy for the set.
The training sets for support vector machine classifiers can be divided into two types. Balanced training sets use an equal number of positive and negative training examples. Classification using support vector machines trained on balanced sets is the main subject of this paper. Unbalanced training sets contain either a greater number of positive vectors or a greater number of negative vectors. These will be referred to as positively unbalanced sets and negatively unbalanced sets, respectively.
The Web Host Access Tools (WHAT) project is an attempt to create web search tools that learn the preferences of the user and return the most appropriate responses. Support vector machines, because of their efficient learning method, would seem to be an ideal form of machine learning for WHAT. However, since a large number of training examples or a large number of attributes may be difficult to create, it was necessary to discover how much data support vector machines need before they begin to accurately predict the class of new examples.
Design & Methodology
The data used in these experiments are representations of fluorescent patterns in ten different subcellular structures such as anti-mitochondria and anti-nucleolin. Attributes are such properties as the number of fluorescent objects in the image, the ratio of the size of the largest object to the size of the smallest, and the average distance of the objects to the center of the cell. Each class represents a type of subcellular structure. Fifty-five instances from each class were randomly chosen as training examples, and eighteen instances were randomly chosen as test examples. Thus, each training set contained fifty-five positive vectors and fifty-five negative vectors, which were arranged by alternating positive vectors and negative vectors. The corresponding test sets contained eighteen positive and eighteen negative vectors. Each vector contained thirty-four attributes. For each attribute the mean and standard deviation were determined based on the training data in all of the classes, and then the mean and standard deviation were used to normalize both the training and the testing sets. By comparing each class with every other class, thirty-six unique combinations were created. Six of the test sets were corrupted during the initial stages of experimentation and could not be replaced, so thirty data sets were used.
The accuracy of the prediction method of the support vector machine is affected both by the number of attributes per vector and the number of vectors in the training set. Thus, the experiments to determine how much data is needed to accurately train a support vector machine needed to vary both the number of vectors and the number of attributes. Vectors and attributes were methodically removed from the original thirty data sets, which contained fifty-five positive vectors, fifty-five negative vectors, and thirty-four attributes per vector. Attributes were removed from the ends of the vectors, and vectors were removed from the ends of the data sets. For balanced sets, the training and testing sets had between thirty-four and two attributes decreasing in increments of four attributes. The number of vectors in the balanced training sets ranged between fifty-five positive and fifty-five negative vectors to five positive and five negative vectors in increments of five vector pairs. Because five vector pairs still provide ten training examples, two vector pairs were also used to provide a lower number of training vectors. A total of one thousand eight hundred and ninety training sets of various sizes were used to determine the number of vectors and attributes needed in a balanced set to effectively train a support vector machine.
The balance of the set also helps determine the accuracy of prediction. To test the effects of balance on the prediction accuracy of classification-type support vector machines, the number of either the positive or the negative vectors in the training set was varied. Positively unbalanced training sets were created by deleting negative vectors from the ends of the data sets. The number of positive vectors in the positively unbalanced training sets was always fifty-five, while the number of negative vectors ranged between fifty-five and five, in increments of ten negative vectors. Negatively balanced training sets were created by removing positive vectors from the ends of the data sets in the same manner as negative vectors were removed from the positively balanced sets. Three hundred and thirty training sets were used to determine the effect of balance on the training of a support vector machine.
Experimental Results
For classifiers trained using balanced sets, the average prediction accuracy was 92.89%. The average accuracy for sets containing fifty-five vector pairs was 96.45%, sets with forty-five vector pairs had an average accuracy of 95.74%, sets with thirty-five vector pairs were 95.47% accurate, sets with twenty-five vector pairs were 94.12% accurate, sets with fifteen vector pairs had average accuracy of 92.94%, sets with five vector pairs were 88.72% accurate, and sets with two vector pairs were 86.78% accurate. The average accuracy for sets containing thirty-four attributes was 94.29%, the average accuracy of sets with thirty attributes was 93.77%, sets containing twenty-six attributes were 94.01% accurate, sets with twenty-two attributes were 94.22% accurate, sets with eighteen attributes had an average accuracy of 94.17%, sets with fourteen attributes were 93.37% accurate, sets with ten attributes were 91.90% accurate, sets containing six attributes were 92.28% accurate, and sets containing only two attributes were 88.00% accurate.
The percentage of balanced training sets that created classifiers with less than 100.0% prediction accuracy was 61.96%. The number of sets with between 30.00% and 39.99% accuracy was 0.053% of the total number of sets, 0.318% of the sets had accuracy between 40.00% and 49.99%, 3.02% of the sets were between 50.00% and 59.99% accurate, 4.55% of the sets were between 60.00% and 69.99% accurate, 4.13% of the sets had accuracy between 70.00% and 79.99%, 8.04% of the sets had an average accuracy between 80.00% and 89.99%, 41.85% of the sets had between 90.0% and 99.99% accuracy, and 38.04% of the sets were 100.0% accurate in their predictions. The lowest accuracy of any set was 38.89%.
Figure 1 shows the effect of varying the number of vector pairs and the number of attributes in a balanced training set. The number of vector pairs varies along the x-axis, while each series represents a different number of attributes. Each thin trendline shows the effect on the accuracy of varying the number of vector pairs in a balanced set containing the stated number of attributes. The thick blue trendline shows the overall effect varying the number of vector pairs in a balanced training set has on the accuracy of the classifier. Each trendline is a second-order polynomial regression. As can clearly be seen in Figure 1, as the number of vector pairs in a balanced training set increases, so does the accuracy, until a plateau is reached and adding further pairs of vectors does not increase the accuracy. Except for the line representing the data sets using only two attributes, all of the trendlines are closely grouped together, showing that the threshold of the necessary number of attributes is fairly low.
[INSERT FIGURE 1- Balanced Sets with Varying Number of Vector Pairs]
Figure 2 also shows the effect of varying the number of vector pairs and the number of attributes in a balanced training set. However, in Figure 2, the number of attributes varies along the x-axis, while each series represents a different number of vector pairs. Each thin trendline shows how varying the number of attributes in a balanced set containing the stated number of vector pairs affects the prediction accuracy. The thick blue trendline shows the overall effect varying the number attributes used in a balanced training set has on the accuracy of the classifier. Each trendline is a second-order polynomial regression. As can be seen in Figure 2, as the number of attributes in a balanced training set increases, so does the accuracy, until a point where adding further attributes does not increase the accuracy. The trendlines are spaced fairly evenly apart, showing that increasing the number of vectors in a balanced set increases the accuracy. It is clear from Figures 1 and 2 that the number of vectors in a training set plays a larger role in the accuracy of a classifier than does the number of attributes.
[INSERT FIGURE 2- Balanced Sets with Varying Number of Attributes]
For classifiers trained using positively unbalanced sets, the average prediction accuracy was 92.28%. The average accuracy for sets containing fifty-five negative vectors was 97.41%, sets with forty-five negative vectors had an average accuracy of 97.78%, sets with thirty-five negative vectors were 97.04% accurate, sets with twenty-five negative vectors were 94.44% accurate, sets with fifteen negative vectors had average accuracy of 88.89%, and sets with five negative vectors were 78.15% accurate.
The percentage of positively unbalanced training sets that created classifiers with less than 100.0% prediction accuracy was 51.11%. The number of sets with between 50.00% and 59.99% accuracy was 6.11%, 2.22% of the sets were between 60.00% and 69.99% accurate, 3.89% of the sets had accuracy between 70.00% and 79.99%, 13.33% of the sets had an average accuracy between 80.00% and 89.99%, 25.56% of the sets had between 90.0% and 99.99% accuracy, and 48.89% of the sets were 100.0% accurate in their predictions. The lowest accuracy of any set was 50.00%.
Figure 3 shows the effect of varying the number of negative vectors in a positively unbalanced training set. The trendline shows the overall effect varying the number of negative vectors in a positively unbalanced training set has on the accuracy of the classifier. The trendline is a second-order polynomial regression. Clearly, as the number of negative vectors in a positively unbalanced training set increases, so does the accuracy of the resulting classifier.
[INSERT FIGURE 3- Positively Unbalanced Sets]
For classifiers trained using negatively unbalanced sets, the average prediction accuracy was 92.19%. The average accuracy for sets containing fifty-five positive vectors was 97.41%, sets with forty-five positive vectors had an average accuracy of 97.41%, sets with thirty-five positive vectors were 96.30% accurate, sets with twenty-five positive vectors were 95.83% accurate, sets with fifteen positive vectors had average accuracy of 91.57%, and sets with five negative vectors were 77.31% accurate.
The percentage of negatively unbalanced training sets that created classifiers with less than 100.0% prediction accuracy was 62.78%. The number of sets with between 50.00% and 59.99% accuracy was 5.56%, 3.33% of the sets were between 60.00% and 69.99% accurate, 3.33% of the sets had accuracy between 70.00% and 79.99%, 7.78% of the sets had an average accuracy between 80.00% and 89.99%, 42.78% of the sets had between 90.0% and 99.99% accuracy, and 37.22% of the sets were 100.0% accurate in their predictions. The lowest accuracy of any set was 50.00%.
Figure 4 shows the effect varying the number of positive vectors in a negatively unbalanced training set has on the accuracy of a classifier. The trendline shows the overall effect varying the number of positive vectors in a negatively unbalanced training set has on the accuracy of the classifier. The trendline is a second-order polynomial regression. As the number of positive vectors in a negatively unbalanced training set increases, so does the accuracy of the resulting classifier. The trendlines and average accuracies of both forms of unbalanced sets are nearly identical, despite the negatively unbalanced sets having a higher percentage of classifiers that did not achieve 100.0% accuracy. Clearly, the best training sets are well balanced.
[INSERT FIGURE 4- Negatively Unbalanced Sets]
Discussion & Future Work
There are several factors that could potentially affect the results of these experiments. The data sets used in the experiments had been optimized through a process of selecting the attributes best able to distinguish the classes from one another. The data was thus fairly clustered, making the classes easily distinguishable from one another. Also, the manner in which the data sets were created resulted in all training sets containing the same attributes and vectors as the next smaller training set, plus additional vectors and attributes. Thus, if the first attributes were more important for distinguishing between classes than the later attributes, or the first vectors more important than the later vectors, their presence would increase the accuracy of the classifier. The first vectors increasing the accuracy of the classifier are not likely to be too serious of a problem, since the vectors were randomly arranged when the training sets were created and the results for when only the number of vectors was varied were fairly uniform across the classes. Additionally, the unbalanced sets always contained fifty-five examples of one class, and a variable number of the other class. There were no unbalanced data sets of a relatively small size. Unbalanced sets also always contained thirty-four attributes. Thus, the accuracy of a support vector machine trained on an unbalanced set with a low number of attributes could not be observed. When the number of attributes used is lowered, the balance of a training set might more heavily influence the prediction method.
The most logical way to overcome the problem of optimized data is to create the data sets for the express purpose of experimenting with varying the number of attributes and vectors. The problem with this solution is that before a data set can be created, appropriate features must be selected, values measured, vectors created and randomly assigned to training or testing sets, and many other details attended to. While the eventual goal is to create support vector machines for use with the WHAT project, the purpose of this project was to determine how many features were needed to accurately distinguish between classes, how many instances of each class were needed to accurately train each classifier, and other practical issues. Once a general idea of the requirements of support vector machines is reached, data sets for use with WHAT can then be created. A more feasible solution would be to use data sets representing text, as web search results are represented primarily as text, and observe the requirements of support vector machines when classifying different types of text.
To ensure that vectors from the beginning of the training set are not consistently chosen as support vectors, thus increasing the accuracy of the classifier, vectors could be randomly removed from throughout the data set. The only requirement would be that the correct numbers of positive and negative vectors remain in the training set.
Randomly selecting attributes to be removed from a set would ensure that the most distinguishing features of the classes do not remain while the less distinguishing features are removed. The same attribute must removed from all of the vectors in a training set and its corresponding test set. However, the same attributes do not need to be removed from all of the data sets, as long as each set has the correct number of attributes. An additional difficulty would be that the attributes are numbered, so when attributes are removed from locations other than the ends of the vectors the following attributes would need to be renumbered.
Positively unbalanced sets with fewer than fifty-five positive vectors and negatively unbalanced sets with less than fifty-five negative vectors would be relatively simple to create. The easiest way would be to remove vectors of the predominant class from the ends of the training sets, much as vectors were removed to create the current unbalanced sets. However, this method would still be prone to the problems inherent in removing vectors from the ends of training sets. A better way would be to create new unbalanced sets with varying numbers of positive and negative vectors by randomly removing vectors from throughout the training sets. Thus, the accuracy of unbalanced sets of smaller sizes could be tested, and the results would be unaffected by the possibility of the first vectors being the most commonly chosen support vectors.
Unbalanced sets with varying numbers of attributes could also be created easily. The simplest way to create unbalanced training sets with fewer attributes would be to remove the attributes from the ends of the vectors, as was done with the balanced sets. However, this would again raise the possibility of the first attributes being the most useful in distinguishing between classes. The better way to create unbalanced sets with varying numbers of attributes would be to randomly choose attributes to remove from the classes, in the same manner as suggested for balanced sets. This would be more difficult than removing attributes from the end, but still feasible.
Acknowledgements
The author would like to thank Dr. Ursula Wolz for suggesting the topic of the project and providing continuing discussion and support. The author would also like to thank Greg Porreca for his discussion of support vector machines, expert assistance, and availability for consultation. Also, Aaron Archer Waterman provided the programming support that allowed the project to be completed. Dr. Robert F. Murphy provided the data sets used in this project. The software used was the default configuration of the Java version of LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/), written by Chih-Chung Chang and Chih-Jen Lin.
References
[1] M. V. Boland and R. F. Murphy (2001). A Neural Network Classifier Capable of Recognizing the Patterns of all Major Subcellular Structures in Fluorescence Microscope Images of HeLa Cells. Bioinformatics 17:1213-1223.
Appendix
LIBSVM options:
-s svm_type : set type of SVM (default 0)
0 -- C-SVC
1 -- nu-SVC
2 -- one-class SVM
3 -- epsilon-SVR
4 -- nu-SVR
-t kernel_type : set type of kernel function (default 2)
0 -- linear: u'*v
1 -- polynomial: (gamma*u'*v + coef0)^degree
2 -- radial basis function: exp(-gamma*|u-v|^2)
3 -- sigmoid: tanh(gamma*u'*v + coef0)
-d degree : set degree in kernel function (default 3)
-g gamma : set gamma in kernel function (default 1/k)
-r coef0 : set coef0 in kernel function (default 0)
-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)
-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)
-m cachesize : set cache memory size in MB (default 40)
-e epsilon : set tolerance of termination criterion (default 0.001)
-h shrinking: whether to use the shrinking heuristics, 0 or 1 (default 1)
-wi weight: set the parameter C of class i to weight*C in C-SVC (default 1)
-v n: n-fold cross validation mode