INTRODUCTION:
   
Networks often fall prey to intrusion attempts by attackers. There are many ways these attackers may try to enter the system. They might:
          - break into a legitimate user's account and conduct their activities from there
          - try to disguise their activities as legitimate flows and evade network monitors
          - break into a system and create a 'backdoor' for easier access in the future
          - use non-standard ports in the hopes of breaking in undetected, or use standard ports for non-standard uses.
    Intrusion detection systems (IDS) are commonly classified according to either their data sources or their models of intrusion. When classified according to which intrusion model they follow, IDS are grouped as either anomaly detection or misuse detection systems. Anomaly detection systems analyze system information and use statistical techniques to search for behavior that is outside a certain "normal" range. Such systems must first compile data from normal usage and then compare new behavior patterns to this set of normal patterns. Misuse detection systems, on the other hand, use pattern matching techniques to search for behavior that matches a known pattern of intrusion activity. This is similar to virus detection software that searches a system for signatures of known viruses; like virus detection software, misuse detection systems need to be updated periodically with the signatures of new attacks as they arise.
    My project uses a mix of both approaches to detect intrusive behavior. Specifically, we would like to be able to detect the use of non-standard ports or the use of standard ports in non-standard ways. By analyzing distinguishing characteristics of network flows, such as packet size and interpacket time delay, we can construct profiles of what normal flows (tcp, ftp, icmp, etc) look like. Subsequent flows can then be analyzed according to the same criteria and classified against the known flows. If an unknown or "mystery" flow's characteristics do not match any known flows' characteristics, then the mystery flow can be flagged for further analysis. This would be an example of anomaly detection. If our database contains characteristics for flows known to be intrusive and a mystery flow matches one of these known flows, then we have a case of anomaly detection.
    The code presented here represents the basic pieces needed to build a complete, real-time intrusion detection system. There are four main programs: flowbin.c, pca.c, density.c, and classify.c. In the following sections, a brief background of intrusion detection will be presented, and the input, output, and function of each program will be discussed. I will conclude by presenting our results and commenting on sources of error and areas needing improvement.
INTRUSION DETECTION: PREVIOUS WORK AND BACKGROUND
    Amoroso defines intrusion detection to be "the process of identifying and responding to malicious activity targeted at computing and networking resources." In the past, intrusion detection systems that examine criteria very similar to ours have been developed. For example, Paxson and Zhang in their paper "Detecting Backdoors" examine packet size, directionality, and timing. However, their system searches specifically for interactive sessions. They developed a general purpose algorithm which classifies connections as interactive by testing packet size, timing, and directionality against preset criteria. They also developed a set of special purpose algorithms for identifying specific interactive protocols by filtering incoming packets based on signatures and patterns specific to those protocols. Our classifier also examines packet size, timing, and directionality, but we use our data to create profiles against which mystery flows are compared. In addition, our data is subjected to principal component analysis in order to determine which variables require further analysis.
    This project was originally started by Dr. Tom Dunigan and George Ostrouchov at ORNL. The algorthims and software were developed using the graphing language S-Plus. For my project I ported the code to C and attempted to make the classifier faster and more manageable. The ultimate goal is to create a classifier fast enough and accurate enough to classify flows in real time.
FLOW CAPTURE, ANALYSIS, AND CLASSIFICATION
    Flows are analyzed and certain characteristics of each packet, including packet size, packet interarrival time, flow direction (i.e. from the low numbered to high numbered port, or vice versa) and flow number, are recorded. The data used typically represents a three hour sample of weekday traffic (8:30 am - 11:30 am). The data was collected and filed separately for each flowtype. For example, for an ftp flow (flowtype 21), a file might be generated containing 350 different flows (or sessions), and each of these flows can have any number of packets with whch they are associated. Two data sets from separate days were examined. Data set 1 contained data for flows ftp-data, ftp, ssh, telnet, email, rlogin, irc, and icmp. Data set 2 contained data for flows ftp, ssh, telnet, rlogin, irc, and icmp.
Flow Analysis: flowbin.c
    Each flow file is analyzed. A flow file for a specific port can contain as few as ten to as many as several thousand flows. The packets in each flow are binned, and extremely short flows - for example, less than 10 packets - are discarded. The packets in each flow are then graphed according to three characteristics: packet size, packet interarrival time, and flow direction.
    The first dimension is packet size. Packet size generally ranges from 39 to 1500 bytes. Packets below a preset minimum are dropped, and packets above 1500 bytes are counted but flagged. Flowbin.c takes as input a size file that defines the ranges for packet bins. A sample size file might look like the following:
        39
        40
        41
        100
        575
        576
        1000
        1499
        1500
    This would result in the creation of ten bins. The first bin would contain packets of size 39, the second would contain packets of size 40, the third, packets of size 41, the fourth, packets of sizes 42 to 100, and so forth. The tenth bin would be reserved for packets exceeding 1500 bytes. Packets sizes that are seen frequently or have special significance are binned separately.
    The second dimension is packet interarrival time. Again, an input file provides the time ranges for each bin. A time file such as the following:
        -6
        -4
        -1.5
        0
        1
        2.5
        3
would result in eight bins. Like the size bins, the eighth bin would be reserved for time values larger than the maximum. The values in the time file are not the actual times in seconds, but are the logs of the times. Since the interarrival times are recorded to only six decimal places, the smallest possible time value is 0.000001, and thus, the smallest possible log value is -6. If a log value larger than 3 is found, this means an interarrival time of greater than 10,000 seconds was discovered. Since a packet is considered to be the start of a new flow if more than ten minutes has elapsed, flowbin.c should never encounter packets with such a large log time value. However, if it does, the packets are binned and flagged.
    The third dimension is flow direction. Consider a flow where packets are being sent between hosts A and B. There are four possible directions. The first is the case where a packet from host A follows a packet from the same host. The second would be a packet from host A following a packet from host B. The next case would be a packet from host B that follows a packet from host A. The last case would be a packet from host B that immediately follows a packet from the same host. To determine directionality, each packet is evaluated based on its direction and the direction of the packet preceeding it. Directionality is an important factor to examine, because it can be an indicator of the interactivity of a flow. For example, an ftp flow might involve downloading files, and therefore we would see a large number of packets coming from the same host, one after another. On the other hand, an irc or "chat" flow would involve packets going back and forth from one host to the other and would bin very differently.
    The output of flowbin.c is a 2-D file of flow numbers and bin values. Each flow has a set of size-time-direction bins associated with it. The value in the bins represents the fraction of total packets in that flow that fell into each bin. Files for row and column labels are also generated.
Prinicipal Component Analysis of Flow Characteristics: pca.c
    The code for pca.c was mainly written by F. Murtagh and was modified to work efficiently with our data. It uses lapack and blas routines to perform principal component analysis of the variables in each flowtype.
    Input to pca.c consists of a file containing bin values for all flows of each flowtype. For example, consider an ftp flow containing 350 flows. By writing out the size-time-direction graph in a linear manner, each flow would have 320 bin values (10 sizes x 8 times x 4 directions). These bins can be thought of as variables since they are identical for each flowtype.
    A file consisting of flowbin variables for many flowtypes (for example, six types: ports 21, 22, 23, 513, 6667, icmp) is generated and evaluated by pca.c. Prinicipal component analysis is used to find the three variables that show the greatest variation among all flowtypes. For every flow of each flowtype, a value is assigned for each of these three variables. The resulting output file is then evaluated in the next step, density.c.
Density Analysis of Prinicipal Components: density.c
    Each flow's 320 variables have now been reduced to the three most important variables using pca.c. Next, the flowtypes are analyzed one at a time using density.c. Density.c looks at the three variables as the dimensions of a 3-D graph and uses the given values as x,y,z coordinates to plot the location of each flow. For our earlier example of an ftp flow, the input to density.c would be a file of 350 lines (for the 350 flows) and each line would have 3 values for the x,y,z coordinates. Density.c would generate a 3-D graph containing 350 points. This graph could then be viewed at as a profile representing a typical ftp flow.
    At this time, we divide each dimension of the graph into 50 bins based on the range of data values. Since we are looking at 3 dimensions, density.c produces a graph with 125,000 bins. The graph is output in a linear manner. Since most of the values of the 125,000 bins will be zero, instead of writing all the values, it is much more time and space efficient to instead write out the non-zero values and their bin locations. A second file containing information such as total number of flows, total number of dimensions, and the minimum value in each dimension is also generated.
    By running density.c on each flowtype, we can generate profiles that represent what each flowtype should look like. In the next step, classify.c, we will take mystery flows and compare them, value by value, to the profiles of known flows.
    Note that this step is only needed when creating profiles for each flow. When a mystery flow is ready to be evaluated, only flowbin, pca, and classify need to be run.
Classification of Unknown Flows: classify.c
    Classify.c takes as input a pca-outputed file for a mystery flow and a list of files containing profiles of known flows. Classify takes the mystery flow data and compares the values against each of the known flow profiles. If the mystery flow and a known flow both have values in the same bin position, a counter is incremented. The output for classify.c is a file indicating what fraction of the mystery flow's packets ended up matching each known profile.
RESULTS
    The run-times for flowbin.c, density.c, and classify.c were extremely low, with most times ranging from less than one second to less than one minute. The step that took the longest was pca.c. This is understandable, considering the amount of computation required for principal component analysis of such high dimensional flows.
    The classification error rates were quite high when flows from two different data samples were compared. The reasons for this poor performance are discussed below.
flowbin.c
    Most run times were quite good, but this can be attributed to the fact that only a fraction of each input file was considered "good" - i.e. containing flows of 10 or more packets. Flow 21, at 39 MB, took more than twice as long as flow 22, which was larger at 44 MB. The reason for this is 98% of the data in flow 21 was usable, as compared to only 48% in flow 22.
| irc | rlogin | icmp | telnet | ftp | ssh | |
| Size (KB) | 34 | 411 | 17,869 | 24,643 | 38,915 | 44,004 |
| % Good Flows | 0.5% | 40% | 3.1% | 36.2% | 98.3% | 48.3% |
| Time (sec) | 0.06 | 0.31 | 21.69 | 17.13 | 65.17 | 28.62 |
    This program took the longest to execute, averaging 22 minutes when run on files of size 160 MB. Files of this size represented 6 different flow types .
density.c
    Times for density.c were excellent, with most times under one second. Note that this step is used only when creating profiles.
| rlogin | irc | ssh | telnet | ftp_data | ftp | icmp | ||
| Size (Kb) | 1.4 | 2 | 5 | 10 | 18 | 26 | 84 | 845 |
| Time (sec) | 0.01 | 0.03 | 0.03 | 0.03 | 0.03 | 0.04 | 0.1 | 0.7 |
    Figure 3 shows the analysis times for the classification of flows against various numbers of profiles. The classification time increased at a constant rate with increasing numbers of profiles. The analysis time doubled when the flow size increased from 100 KB to 1000 KB. The fast analysis times were obtained by reducing the input file size as much as possible. Namely, instead of writing the entire 125,000 bin profile, only non-zero values were included in the input file. Classification time increases at a constant rate as the number of flows to compare against increases. The time doubles as the mystery file size goes from the 100 KB range to the 1000 KB range.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
| rlogin (1 KB) | 3.0 | 6.0 | 10.0 | 13.0 | 16.0 | 20.0 | 23.0 | 25.0 |
| ssh (5 KB) | 3.0 | 7.0 | 10.0 | 13.0 | 17.0 | 20.0 | 23.0 | 27.0 |
| ftp_data (18 KB) | 3.0 | 7.0 | 10.0 | 13.0 | 17.0 | 20.0 | 23.0 | 27.0 |
| ftp (26 KB) | 3.0 | 7.0 | 10.0 | 13.0 | 17.0 | 20.0 | 24.0 | 27.0 |
| icmp (84 KB) | 4.0 | 8.0 | 11.0 | 14.0 | 18.0 | 21.0 | 25.0 | 29.0 |
| email (845 KB) | 8.0 | 15.0 | 21.0 | 27.0 | 33.0 | 40.0 | 46.0 | 54.0 |
    While classify.c had very good run times, its performance as a classifier was poor. Two separate data sets were examined. Both were run through density.c, creating two different sets of profiles. When the flows in set 1 were compared against the profiles of set 1, excellent classification rates were obtained. The same held true for the classification of set 2 flows against set 2 profiles. However, when the two data sets were compared against each other, no flows were classified properly. In fact, most flows were misclassified. Figures 4 and 5 below show the classification of data set 1 against profile 1 and data set 2 against profile 2. The classification rate is quite good; looking at the diagonal of each figure, we can see that 89-100% of the flows in each flowtype matched the correct profile. For example, ftp flows match the ftp profile, telnet flows match the telnet profiles, and so forth. This is to be expected since we are classifying the data using their very own profiles.
| ftp_data | ftp | ssh | telnet | rlogin | irc | icmp | Total Flows | |||
| ftp_data | 0.99 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 322 | |
| ftp | 0.00 | 0.99 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 449 | |
| ssh | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 85 | |
| telnet | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 182 | |
| 0.00 | 0.00 | 0.00 | 0.00 | 0.99 | 0.00 | 0.00 | 0.01 | 14376 | ||
| rlogin | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 24 | |
| irc | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 35 | |
| icmp | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1476 |
| ftp | ssh | telnet | rlogin | irc | icmp | Total Flows | ||
| ftp | 0.98 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 46200 | |
| ssh | 0.00 | 0.95 | 0.02 | 0.00 | 0.00 | 0.02 | 877 | |
| telnet | 0.00 | 0.06 | 0.89 | 0.00 | 0.00 | 0.00 | 1950 | |
| rlogin | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 40 | |
| irc | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 5 | |
| icmp | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.99 | 5711 |
    The next two figures show the classification of the two different data sets against one another. As shown by the diagonal of each figure, the classification rates are quite low. In fact, the highest classification rates are often in the wrong profile. For example, in figure 6, most flows are misclassified as being ftp flows, and in figure 7, most flows are misclassified as being email flows. The reason figure 6 has most flows being classified as ftp is because out of all of the set 2 profiles, the ftp profile has the largest number of points (46,200), so flows compared against it will be much more likely to have matching points. The same can be said for figure 7, where the largest profile is the email profile, which contains 14,376 points.
| ftp | ssh | telnet | rlogin | irc | icmp | Total Flows | ||
| ftp_data | 0.08 | 0.22 | 0.07 | 0.00 | 0.00 | 0.02 | 322 | |
| ftp | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 449 | |
| ssh | 0.11 | 0.02 | 0.04 | 0.01 | 0.00 | 0.02 | 85 | |
| telnet | 0.04 | 0.00 | 0.01 | 0.00 | 0.00 | 0.01 | 182 | |
| 0.33 | 0.02 | 0.03 | 0.00 | 0.00 | 0.09 | 14376 | ||
| rlogin | 0.21 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 24 | |
| irc | 0.03 | 0.09 | 0.06 | 0.00 | 0.00 | 0.11 | 35 | |
| icmp | 0.00 | 0.05 | 0.04 | 0.00 | 0.00 | 0.00 | 1476 |
| ftp_data | ftp | ssh | telnet | rlogin | irc | icmp | Total Flows | |||
| ftp | 0.00 | 0.00 | 0.00 | 0.00 | 0.62 | 0.00 | 0.00 | 0.00 | 46200 | |
| ssh | 0.00 | 0.00 | 0.00 | 0.00 | 0.09 | 0.00 | 0.00 | 0.02 | 877 | |
| telnet | 0.00 | 0.00 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.01 | 1950 | |
| rlogin | 0.00 | 0.00 | 0.00 | 0.00 | 0.12 | 0.00 | 0.00 | 0.12 | 40 | |
| irc | 0.00 | 0.00 | 0.00 | 0.00 | 0.40 | 0.00 | 0.00 | 0.00 | 5 | |
| icmp | 0.00 | 0.00 | 0.01 | 0.00 | 0.19 | 0.00 | 0.00 | 1.00 | 5711 |
DISCUSSION AND FUTURE WORK
Poor Performace Due to Binary Binning
    When flow profiles are created, values are assigned to each bin in whic a flow point has fallen. Thus, a flow file with many flows will have a larger profile than one with fewer flows. As the mystery flow is compared to each profile, flow points are considered to match only if they are in the same bins. In other words, if two ftp flows have points that are close together but not in the same bins, those points will not be considered a match. When a flow is compared to a profile generated from a different data set, there is very little probability that the points will match exactly. In addition, large profiles will "swallow" any mystery flowpoints that fall in the same bins. Instead of binary binning, a system of weighted bins should be used to ensure that points that are close to the profile are considered.
Size and Time Optimization
    An important step in the classification process that needs to be addressed is the optimization of the size and time boundaries. The best size and time boundaries are those where all flow types showed the most difference from each other. Currently, the size and time boundaries are chosen by inspection. In the future, all possible combinations of boundaries should be tested to find the fastest and most effective set.
The Classifier is Content-free
    The algorithms used are not content specific. Because the packet length, interarrival time, and directionality are sufficient to classify a flow, the classifier does not need to waste time and resources analyzing the content of each packet. All the necessary information can be obtained quickly from the packet header. In addition, by avoiding content analysis, the classifier need not be concerned with encrypted flows or words that have been aliased to avoid detection.
Adaptivity
    The classifier is adaptive, because it can be retrained on additional data sets as they become available. Currently, retraining must take place by combining the new data sets with the old data sets and retraining from the start. There is a distinct disadvantage to this: namely, the amount of space used by saving all of the old data sets could prove to be enormous. In addition, retraining will take longer as the number of data sets grows larger. However, if one keeps all of the data sets it will be easy to choose which sets to use. Perhaps certain combinations will give better results. The obvious goal is to have a file of training results that can be easily updated by averaging the data from a new flow and all of the old flows, and not have to keep all old data sets.
Increased Number of Variables
    Currently, only three variables are examined - the three variables being those chosen using principal component analysis. The code has been written to handle any number of variables but has not been tested for such cases.
ACKNOWLEDGEMENTS
    I would like to thank Dr. Dunigan for his guidance and help throughout this project, and I would like to thank George Ostrouchov for his advice, feedback, and patience.
REFERENCES
(1) Abdulrahman, Salma. Network Intrusion Detection Systems. May, 2000.
(2) Amoroso, Edward. Intrusion Detection: An Introduction to Internet Surveillance, Correlation, Trace, Back, and Response. Intrusion.Net Books, 1999.
(3) Dunigan, Tom. Flow Characterization. 9/20/00
(4) Paxson, V. and Y. Zhang. Detecting Backdoors. Proc. 9th USENIX Security Symposium, August 2000.