Malware can be thought of as analogous to biological viruses, which are comprised of large numbers of polypeptide proteins.1 The shape and function of the protein strands determine the functionality of each segment (similar to a subroutine in malware). The malware organism is represented by the full combination of subroutines. This cyber organism can be considered as a collection of polypeptides forming information-bearing (subroutine) protein structures. We propose to apply bioinformatics methods to analyze cyber malware in binary data streams in real time. This must be accomplished without knowing where in the data flow the worm or virus code starts or ends. We will explore whether laboratory methods (wet-lab-bench techniques) can provide analogous analysis features for in vivo use (OC-48—an optical-carrier transmission rate of up to 2488.32Mbit/s—or faster data streams). Bioinformatics amino-acid sequencing methods provide a valid framework within which to tackle cyber virus organisms.
Current methods for malware detection and identification use static and dynamic features such as the name and size of the binary section, hash values, hostname, universal resource locator, registry and file access, and process injections. Many of these may not be easily measured in real time because they require buffering and storage of the entire data segment for post-processing and extraction, and are not specific attributes that are unique to individual malware. This makes information fusion and the required decision architecture to discern viruses both complicated and brittle. We propose to use entropy-characterizing features (such as the standard deviation, skewness, kurtosis, 6th- and 8th-order moments, and two fractal metrics) that exploit the fractal and statistical self-similar information structures within malware agents. Under the right conditions, in situ discrimination and classification is possible if each feature is calculated in real time from the streaming data.2,3
By thinking of malware as an executable comprised of binary 0s and 1s, we can decompose the malware organism into a 1D strand of such primary sequences through cleaving.4,5 We convert the 1D binary sequence into fractal form by changing the 0s into −1s and integrating using direct running summation, resulting in 2D conformation or analogy of secondary structures of the malware sequence (see Figure 1).
Figure 1. Fractal (1/f, where f is the fractal dimension) representation of random 0/1 sequence (top left), Slammer packet (top center), and Red Probe (top right) protein analogs. Ligated partitions of random sequence (middle left), Slammer packet (middle center), and Red Probe (middle right). 2D sodium dodecyl sulfate polyacrylamide gel electrophoresis for the random (bottom left), Slammer (bottom center), and Red Probe (bottom right).
These fractal-transformed data are now processed in a sliding- and expanding-window fashion for cleaving into individual analogs of amino-acid residues by calculating a coefficient that determines when the data structure within the expanding window is stationary or most closely resembles a random set. The data are cleaved from the data stream as a trimmed amino-acid residue that is isolated from the remaining strand (see Figure 1). The residue is characterized using descriptive features, such as window size and fractal integral (analogous to the isoelectric point, pI), which is analogous to performing Edman degradation to determine all residues present in the polypeptide-analog fractal function.
Critical discriminating features are isolated using a mathematical analog to the 2D sodium dodecyl sulfate polyacrylamide gel electrophoresis (2D SDS-PAGE) process6 for virus purification and fractionation using purely entropy-based chromatography analogs. Figure 1 shows 2D SDS-PAGE results for the random, Slammer, and Red Probe protein analogs. To isolate marker proteins useful for discrimination, we use a method analogous to western blot and derive an antibody to recognize the presence of the virus analogs in situ. We employ a data-model polynomial as a high-affinity-recognizing antibody using the molecular weight and pI analog features. This classifier equation is very sensitive to recognizing exact structural matches and has no affinity for other random structures, like its biological counterpart.
Finally, the residues from each known virus type can be converted into analytical equations (data models) and used as basis functions for matched filtering or analytical wavelet analysis in real time against live-streaming binary data. Similarity in scale space becomes an indicator of the possible presence of malware in the message flow.1 We will next use analytical, fractal-based data-modeling methods to derive anticipated binary sequences for (unknown-unknown) model-free malware detection.
Licht Strahl Engineering Inc.
Holger Jaenisch earned his PhD in arts and sciences from Columbia Pacific University in 1990 for work developing a formal laser analogy using video feedback. He performed postdoctoral research at the University of Alabama's Huntsville Center for Applied Optics in CO2 laser modification for NASA's Laser Atmospheric Wind Sounder program. In 2009 he completed a DSc in astronomy at James Cook University and is currently pursuing an MSc in bioinformatics at Johns Hopkins University. His research interests include inverse data modeling, data fusion, and DNA/molecular computing. He owns Licht Strahl Engineering (LSEI consulting). He is a SPIE Life and Senior Member, has published over 60 papers, and holds laser- and fractal-related patents.