plik


ÿþInternational Journal of Computational Intelligence Research. ISSN 0973-1873 Vol.2, No. 1 (2006), pp. 100-104 © Research India Publications http://www.ijcir.info Using Support Vector Machine to Detect Unknown Computer Viruses Bo-yun Zhang1,2, Jian-ping Yin1, Jin-bo Hao1, Ding-xing Zhang1 and Shu-lin Wang1 1 School of Computer Science, National University of Defense Technology, Changsha 410073, China hnjxzby@yahoo.com.cn 2 Department of Computer Science, Hunan Public Security College, Changsha 410138, China In the following sections, we first describe related research Abstract: A novel method based on support vector machine in the area of computer viruses detection. Then we illustrate (SVM) is proposed for detecting computer virus. By utilizing the architecture of our detect model in section III. Section IV SVM, the generalizing ability of virus detection system is still introduces the classification algorithm. Section V details the good even the sample dataset size is small. First, the research method of extraction feature from program, and stating the progress of computer virus detection is recalled and algorithm of detection engine work procedure. Section VI shows the SVM taxonomy is introduced. Then the model of a virus implementation and experiment results. We state our detection system based on SVM and virus detection engine are presented respectively. An experiment using system API conclusion in Section VII. function call trace is given to illustrate the performance of this model. Finally, comparison of detection ability between the II. Related work above detection method and other is given. It is found that the detection system based on SVM needs less priori knowledge So far, there have been few attempts to use machine learning than other methods and can shorten the training time under the and data mining for the purpose of identifying new or same detection performance condition. unknown malicious code. In an early attempt, Lo et al. [2] conducted an analysis of Keywords: computer virus, API function calls, Support Vector several programs evidently by hand and identified tell-tale machine, virus detection. signs, which subsequently be used to filter new programs. Researchers at IBM's T.J.Watson Research Center have I. Introduction investigated neural networks for virus detection [3] and have Excellent technology exists for detecting known malicious incorporated a similar approach for detecting boot-sector executables. Software for virus detection has been quite viruses into IBM's Anti-Virus software. successful, and programs such as McAfee Virus Scan and More recently, instead of focusing on boot-sector viruses, Norton AntiVirus are ubiquitous. These programs search Schultz et al. [4] used data mining methods, such as naïve executable code for known patterns, and this method is Bayes, to detect malicious code. The Bloodhound technology problematic. One shortcoming is that one must obtain a copy of Symantec Inc., uses heuristics for detecting malicious code of a malicious program before extracting the pattern [5]. Szappanos [6] used code normalization techniques to necessary for its detection. Obtaining copies of new or detect polymorphic viruses. Normalization techniques unknown malicious programs usually entail them infecting or remove junk code & white spaces, and comments in programs attacking a computer system. before they generate virus signature. In this paper, we propose a novel method to detect computer To improve the performance of the detector mentioned viruses through SVM [1]. Our efforts to address this problem above, lots of malicious and benign codes as training dataset have resulted in a fielded application, built using techniques should be collected. And they would consume lots of times from statistical pattern recognition and machine learning. The when training the classifiers. Aid to solve these problems, the Viruses Classification System currently detects unknown SVM was utilized in our experiments. malicious executables code without removing any obfuscation. To our knowledge, the experiment is the first III. Model Structure time to established methods based on SVM applying to detect unknown computer viruses. We first describe a general framework for detecting malicious Support Vector Machine to Detect Unknown Computer Viruses 101 executable code. Figure 1 illustrates the proposed in which separates the positive from the negative examples. architecture. The framework is divided into 4 parts: (1) Where w is normal to the hyperplane, | b | / || w || being the application server, (2) Virtual Computer, (3) detection server, perpendicular distance from the hyperplane to the origin, and (4) virus detection firewall based on character code || w || , the Euclidean norm of w , and wÅ" x , the dot product scanning. Once detected, the operating system can be between vector w and vector x in feature space. The designed to observe the behavior of the computer viruses. optimal hyperplane should be a function with maximal margin This would require the system to contain a virtual between the vectors of the two classes , which subject to the environment or machine. So the virtual operating system- constraint as : VMWare[7] was used in our experiments. The computer virus l maxWa) = - yi±j yjKxi,xj) ( ( would be executed in the virtual environment to learn its "± 1"±i 2 i=1 behavior. In this environment, the virus would not destroy the l real detection system. st. yi =0, ±i "[0,C], i =1,...,l. (2) . "±i Before a file save to the application server, it will be scanned i=1 by the virus detection firewall. If the file is infected with virus where ±i is Lagrange multipliers, K (xi , xj ) is kernel then quarantine it. Otherwise if there is no malicious function, C is a constant. information about the file, it will be replicated 2 copies. Then, For a test sample x" we could use decision function : l one copy will be sent to the application server, another one f (x) = sgn( yiK (xi , x) + b) (3) "±i will be sent to the detect server based on SVM detection i=1 engine after extracting feature in the virtual computer. At the to determine which class it is. following stage, the SVM detector check the copy again. If the file is infected with unknown viruses, the application V. Viruses Detection Engine server will be remind to remove the copy from its application database. And then quarantine it in a special database or sent A. Basic hypothesis it to an expert to analyze it by hand. The area of coverage is characterized by the assumptions mentioned below. Assumption 1. Computer virus interacts with Operating Applicaition System at runtime. Server User Most types of malicious activities, such as accessing HUB network or file services, involve interactions with the OS. By Ntw a e orks However, some malicious activities, such as denying service User or corrupting data, can be done without interactions with the Router OS, virus that limits itself to such activities will not be User detected by our system. Assumption 2 In interacting with the Operating System, Detect Virtual Viruses use the Win32 APIs. Computer Server Instead of using the Win 32 APIs, it is possible to interact Figure 1. Architecture with the OS through the Windows NT native API functions. Our detection system can be extended to cover this type of interaction. IV. Classification algorithm Assumption 3. Once a infected program begins to execute, Virus detecting can be look as a binary classification problem. the infection procedure ofvirus must immediately usurp Our detecting model is based on SVM, which is a kind of control of the computer, whereas the other procedure of the machine learning method based statistical learning theory. virus may not always run. The advantage of this method is that its general capability can Assumption 4. The Win32 APIs used by software execut- be improved by using structural risk minimization principle. ables can be effectively monitored at runtime. That is to say, even using limited training sets, we can get a relatively small error rate on independent testing sets. In B. Sample extracting addition, since SVM is a convex problem, the local optimum Our first intuition into the problem was to extract information found by this method is also global optimum. Here we mainly from the PE executables that would dictate its behavior. discuss the method of calculating decision hyperplane and Professor Forrest [8] had studied thoroughly about the role of sample classification. First we label the training system call sequence of Unix OS in intrusion detection. data (x1, y1),...,(xl , yl ) " Rk ×{±1} , i = 1,...,l , yi "{+1, -1} , According to their studies, we choose the Windows API function calls as the main feature in our experiments. To xi " Rk . Suppose we have some hyperplane extract resource information from Windows executables we wÅ" x + b = 0 (1) used GNU s Bin Utils [9]. Because more and more sophisticated computer virus, which using polymorphic and Firewall 102 Bo-yun Zhang et. al Figure 2. (a) An malicious code snippet (b) API call trace (c) When obfuscation techniques foil the commercial virus scanner. window size k=4,obtaion short API call traces s1=35 36 78 Sometimes one cannot extract API function calls only using 79,s2=36 78 79 92,s3=78 79 92 132,s4=79 92 132 50. the method above, so a tracing tool --APISPY.EXE was designed in our experiments. When using k length slide windows to scan system call history by the program that contain malicious code, we can 1) Size of window get a set of short sequences which including normal and By monitor the behavior of each sample in the Virtual abnormal system calls. Since few activities are illegal, Computer, we could trace the API function calls of them. abnormal short sequences is only a small part of the whole After extracting the system call sequence , index the API short sequences. We use the following procedure to judge function in system call mapping file, then each function has a whether API call sequence is legal or not. index value, for example  35 assign to function  openfile( ) , After got short sequence of malicious program, we compare show as Fig 2(b). We save the numerical sequence correspond it with records in sample database, if it matches with any item, to the api function trace into a data file and slide a window of it will be deleted. Otherwise, the Hamming distance between size k across the trace, recording each unique sequence of normal samples will be calculated. The similarity between length k that is encountered. For example, if k=4, one get the two sequences can be computed using a matching rule that unique sequences show detail as Fig 2(c). determines how the two sequences are compared. The Short sequence of system call represent the order of system matching rule used here is based on Hamming distance, i.e. call by executing process, then, which value is the best for the difference between two sequences i and j is indicated by short sequence length? Wenke Lee and Steven A. the Hamming distance d(i, j) between them. For each new Hofmey[10] found that one can not get useful message from sequence i, we determine the minimal Hamming distance system call sequence when window size larger than 30. If take conditional entropy and computational consumption into dmin (i) between i and the set of normal sequences: consideration, short sequence length should be 6 or 7. dmin (i) = min{d(i, j),"normal sequence } j The dmin (i) value represents the strength of the anomalous 2) Extracting abnormal samples signal, i.e. how much it deviates from a known pattern. Note We decide parameter value of SVM through training, so both that this measure is not dependent on trace length and is still normal short sequence and abnormal sequence samples are amenable to the use of thresholds for binary decision making. needed. System call short sequence sample can be gotten by If the rate of anomalous to normal sequences is RA , then the scanning the history of system call using k length slide window, these samples are saved in a sample database. average complexity of computing dmin (i) per sequence is Generally, the record in sample database is not larger N(k -1)RA + (k -1)(1- RA ) ,which is O(k(RAN +1)) , where than| " |k , where " being API call sets,| " | , API function k is the size of window, N is the number of normal sample in number and k , the size of slide window. dataset. For a mismatched sequence i, we set thresholds on the lea ebx , eax values, It will be regarded as anomalous any sequence i for index API Function m ov ecx , ebx which dmin (i) e" D , where 1 d" D d" k being the threshold call subA (ecx) 35 OpenFile( ) value. It is say if a sequence i of length k is sufficiently test ecx , ecx 36 ReadFile( ) jz short loc_A different from all normal sequences, it can be flagged as 78 RegOpenKey ( ) call subB (ebx) anomalous. So we obtain empirically the abnormal dataset. 79 RegSetValue( ) jm p loc_B 92 RegCloseKey ( ) L oc_A: m ov esi , var3 132 MessageBox ( ) L oc_B : call MessageBox C. Detection Method 50 Send( ) call Send (var1, esi ) The virus detecting method presented in this paper is a (b) supervising learning algorithm. Firstly, we marked the API subA proc near call sequence, then got the parameters value of SVM by varx dword ptr 4 s1 Call O penFile ( security .txt ) training. To Check a file, we trace the API call sequence, then 35 36 78 79 92 132 50 m ov ebx , eax use k length slide window to divide the sequence into several Call ReadFile (ebx, varx) s2 slide short sequences, these short sequences can be judged by SVM ret classifier, abnormal short sequence will be marked, finally, 35 36 78 79 92 132 50 subB proc near we can judge whether the file contains virus or not based on s3 Call RegOpenKey the output of SVM classifier. m ov ebx , eax 35 36 78 79 92 132 50 In reality, it is likely to be impossible to collect all normal Call RegSetV alue s4 add edi , eax variations in behavior, so we must face the possibility that our Call RegCloseKey 35 36 78 79 92 132 50 normal database will provide incomplete coverage of normal ret behavior. If the normal were incomplete, false positives could be the result. Moreover, the inaccuracy of the SVM itself also " c) (a) needs to set some judge rules to improve the performance of Support Vector Machine to Detect Unknown Computer Viruses 103 the detecting systems. So we judge whether a file contains than SVM algorithm. This shows that SVM algorithm is fit to virus based on the number of abnormal API call short detect computer viruses when the viruses sample gathered is sequence. If the number is larger than a predefined threshold, difficult. the file has been infected by virus, otherwise not. We decide Sample space Training set Testing set the threshold value by training. Benign file 423 50 373 Malicious file 209 50 159 VI. Experiment Results Sum 632 100 532 Table1. Sample data in Experiment. We estimate our results over data set in table 1. The malicious executables were downloaded from http://vx.netlux.org and Training data set Testing data set http://www.cs. Columbia. edu / ids/mef/, the clean programs Number of Number of normal Number of Number of abnormal were gathered from a freshly installed Windows 2000 server traces abnormal traces normal traces traces machine running MS Office 2000 and labeled by a 496 242 2766 876 commercial virus scanner with the correct class Table 2. Distribution of the traces database used in the experiment label(malicious or benign) for our method. The pretreating procedure of experiment data details as follow step: 2 C False Negative False Positive à (1) Tracing API call sequences from viruses set and benign k = 6 k = 7 k = 6 k = 7 set. An api call tracing tool is programmed in our experiment. 50 10 3.21% 4.02% 5.66% 7.54% It can hook all API function calls in Windows 2000 server 100 1 4.82% 5.63% 6.28% 5.66% platform. (2) Disparting each API call sequence into short 200 0.5 6.97% 7.50% 10.06% 11.32% trace by k  the size of sliding window. (3) Identifying the Table 3. Experimental result of detection system abnormal short traces in the short sequence set of viruses. (4) Choose parts of normal and abnormal short trace as training data to train SVM, and the other as testing set. At last, we get VII. Conclusion the distribution of the short traces database used in training and testing in our experiment, show in table 2. We presented a method for detecting previously undetectable During the experiment, we use the software LIBSVM [11]. computer viruses. As our knowledge, this is the first time that To evaluate our system we were interested in several using support vector machine algorithm to detect malicious quantities: (1). False Negative, the number of malicious codes. We showed this model s detect accuracy by comparing executable examples classified as benign; (2). False Positives, our results with other learning algorithms. Experiment result the number of benign programs classified as malicious shows that the present method could effectively use to executables. discriminate normal and abnormal API function call traces. There are some common kernels mentioned in SVM, we The detection performance of the model is still good even the must decide which one to try first. Then the penalty parameter virus sample dataset size is small. C and kernel parameters are chosen. After compare with other kernel, at last we choose Radial Basic Function: Acknowledgment || x - y ||2 K ( x, y ) = exp(-2 ) This work supported by the National Natural Science à Foundation of China under Grant No.60373023 , Hunan 2 There are two parameters while using RBF kernel: 1/à Provincial Natural Science Foundation of China under Grant 2 No.04JJ6032 and the Scientific Research Fund of Hunan and C. It is not known beforehand which C and 1/à are the Provincial Education Department of China under Grant best for one problem. Consequently some kind of parameter No.05B072. The authors would like to thank Dr. Hui Xia of search must be done. So we try some variable group value of 2 Hokkaido University for his helpful comments. (1/à ,C) to test the classification performance of SVM. In our experiments, there are 100 files in training dataset, References and 532 files in testing dataset. The dimension of feature vector is k , which is the length of short API traces. We set the [1] Vapnik,V.N.: Statistaical learning theory. Springer, New value of threshold D is 3, then we test the detection engine York(1998) when k=6, 7 respectively. And the detail experiment result [2] Lo,R., Levitt,K., Olsson,R.: MCF: A Malicious Code shows in table 3 . In another experiment [12], we had used a Filter. Computers & Security.14(1995)541-566 algorithm based on Fuzzy Pattern Recognition [3] Tesauro,G., Kephart,J., Sorkin,G.: Neural networks for algorithm(FPR) to classify the data set in table 1. That computer virus recognition.IEEE Expert. 8(1996)5-6 algorithm had the lowest false positive rate, 4.45%. The [4] Schultz,M., Eskin,E., Zadok,E., Stolfo,S.: Data mining present method has the lowest false positive rate, 3.21%, methods for detection of new malicious executables. In: which has better detection rates than the algorithm based on Needham,R., Abadi M, (eds):. Proceedings of the 2001 FPR. Notice that the detection rates of these two methods is IEEE Symposium on Security and Privacy. Oakland, nearly equal, but the FPR algorithm use more training samples CA: IEEE Computer Society Press (2001) 38-49 104 Bo-yun Zhang et. al [5] Symantec. http://www. symantec. com / avcenter / [12] Zhang,B., Yin,J., Hao,J.: Using Fuzzy Pattern reference/ heuristc.pdf. (Last accessed: Jun 1.2004) Recognition to Detect Unknown Malicious Executables [6] Szappanos,G.: Are There Any Polymorphic Macro Code. In: Wang,L.,Jin,Y.(eds.):Fuzzy Systems and Viruses at ALL (and What to Do with Them).in Knowledge Discovery. LNAI,Vol.3613. Springer- Proceedings of the 12th International Virus Bulletin Verlag, Berlin Heidelberg New York(2005) 629-634 Conference(2001) [7] VMware. http://www.vmware.com. (Last accessed: 10, Author Biographies Nov. 2003) [8] Forrest,S., Hofmeyr, S. A., Somayaji, A.: Computer Boyun Zhang born in YongZhou, Hunan, China on 1972, immunology. Communications of the ACM. 10 (1997) received his B.S. degrees in physics education from Xiangtan 88 96 Normal University , Xiangtan, China, in 1994 and M.S. [9] Cygnus. http://sourceware. cygnus.com / cygwin (Last degree in applied computer science from Hunan University, accessed: 20, Dec. 2004) Changsha, China in 2002. He is currently pursuing his PhD [10] Lee,W., Dong,X.: Information-Theoretic measures for degree at School of Computer Science, National University of anomaly detection. In: Needham,R., Abadi M, (eds):. Defense Technology, Changsha, China. His current research Proceedings of the 2001 IEEE Symposium on Security interests include network security and machine learning. and Privacy. Oakland, CA: IEEE Computer Society Press (2001)130-143. [11] LIBSVM. http://www.csie.ntu.edu.tw/~cjlin/. (Last accessed: 18,Nov. 2004)

Wyszukiwarka

Podobne podstrony:
Universal Procedures to Protect Against Computer Viruses
Using Predators to Combat Worms and Viruses A Simulation Based Study
2006 07 in and Out Using Rcs Version Control to Manage Simple Scripts
OAEB Staining to Detect Apoptosis
ASP NET Module 3 Using Microsoft ADO NET to Access Data
Fitelson etal How not to detect design
Are computer viruses spread by the media
Infection, imitation and a hierarchy of computer viruses
Computer viruses demystified,2
Web Sites Hawk Instructions On Making Computer Viruses
Stochastic Features of Computer Viruses
The Case for Beneficial Computer Viruses and Worms
Measuring virtual machine detection in malware using DSD tracer
Using Verification Technology to Specify and Detect Malware
Using Markov Chains to Filter Machine morphed Variants of Malicious Programs

więcej podobnych podstron