One thing I have enjoyed studying most in my research is the spread of malicious data (be it a program or a piece of intentional misinformation) through a network. One practical reason for this fascination is that it plays a key role in both offensive and defensive security. Offensively, understanding the key nodes in a network allow you to figure out a path inside a network. As a defender, you can use it as another tool to help decide where to allocate your all-to-limited resources for maximum effect.
Initially, the idea came about as a way to estimate the resilience of a self-replicating Botnet. The concept can be applied generally to any type of network, though. For simplicity I performed several Monte Carlo simulations. I used NetworkX to define and analyze the network graphs (which I will describe more below). Each simulation selected 5 top-scoring nodes for a given centrality measure. For each of these nodes (called 'patient 0 nodes'), 200 Monte Carlo simulations were run. Each simulation lasted for 1000 steps (evolutions). At the end the total number of infected systems left were tallied as a scaled percentage of the total nodes in the graph.
Network Setup
I started with a simple graph description and gradually added to the depth of the simulation. The first iteration of the network definition was simply a Multi-Edge Directed graph. Each node represents a system and each edge represents a directional communication channel between them. To be more accurate modes form into LAN like clusters with only a few nodes crossing into other LAN spaces. These nodes that cross LAN boundaries are known as "Routers". Several other variables were setup to allow the network to form a semi-random topography at the onset of each test.- min_clusters (5) - The minimum number of LAN-like clusters to generate.
- max_clusters (10) - The maximum number of LAN-like clusters to generate.
- min_in_cluster (100) - The minimum number of Nodes in each LAN-like cluster.
- max_in_cluster (200) - The maximum number of Nodes in each LAN-like cluster.
- min_routers (1) - The minimum number of Nodes in a LAN-like cluster which bridge to another cluster.
- max_routers (5) - The maximum number of Nodes in a LAN-like cluster which bridge to another cluster.
- min_router_edges (1) - The minimum number of edges connecting routers to other clusters.
- max_router_edges (3) - The maximum number of edges connecting routers to other clusters.
- min_lan_edges (1) - The minimum number of edges connecting nodes within the same LAN-like cluster.
- max_lan_edges (10) - The maximum number of edges connecting nodes within the same LAN-like cluster.
- cluster_prefixes - A set of differentiators between LAN-like clusters.
The next step was to define other roles for the nodes in the network. For simplicity's sake I kept it to 3 roles:
- The previously described router role, which connects to another LAN-like cluster.
- The User Node type represents systems which are in-use by an end user (say for email or web browsing)
- The Server type represents machines that are more likely to be contacted by other machines.
The first display of the graph was nice but to view simulation results I thought it would be better to reposition the nodes visually so I could see them better. First, I removed the edges, then I used a spring layout with the distance parameter k=1 to calculate a more evenly spaced placement.
Edges removed. The graph is already looking better |
The final position of the nodes after applying the k parameter. |
Communication Parameters
As mentioned above, after roles were defined the next step I took was to probabilistically define the communication for each possible combination of node types.- User -> User - 10%
- User -> Server - 70%
- User -> Router - 20%
- Server -> User - 5%
- Server -> Server - 85%
- Server -> Router 10%
- Router -> User - 10%
- Router -> Server - 70%
- Router -> Router - 20%
Infection Parameters
- Now that I can generate test networks and probabilistically simulate communication among the different node types it is time to discuss the infection parameters, which control the behavior of the infectious agent. This behavior can be modeled after a theoretical piece of malicious code (as I am here) or using the infection statistics of a live threat. In either case, what is being described is the probability of infection spreading. To do this I defined several parameters:
- inbound_infection (15%) - The probability of infecting a node which has initiated a connection with an infected node.
- outbound_infection (20%) - The probability of infecting a clean node, once contacted by an infected node.
- inbound_detection (5%) - The probability of an infection being noticed as it is spread across an inbound infection. This is analogous to the clean node which initiated the contact becoming aware of the infection on the sending machine. Inbound Detection cleans the original infected machine, but does not necessarily protect the initiator from infection.
- outbound_detection (10%) This is similar to inbound detection but for outbound infections. This is analogous to an operator recognizing the machine that they have contacted is infected
- proxy_infection (0%) - A special case where the infected machine doesn't have to be the start or end of a communication chain to cause an infection. This woulds be analogous to using an infected proxy which could inject into traffic streams.
- proxy_detection (0%) - As with the other detection probabilities this is the chance that the infection will be noticed by one of the uninfected nodes.
A post-infection graph for the out-degree centrality. |
An aside for Python
To keep things somewhat scientific, I added an ID to each network I generated. I stored the Graph as a file along with the recorded stats. Now, for the fun Python part. There are dozens of ways I have seen to generate a random string in python, but there is one I wanted to call attention to as I think it is a fairly elegant way which utilizes list comprehension:import string
from random import choice
the_id = "".join([choice(string.ascii_lowercase) for i in range(32)])
This one-liner generates a pseudo-random 32 lowercase alphabetic character ID. Replacing string.ascii_lowercase with one of the other character collections contained in the string library (for example string.ascii_letters has both upper and lowercase letters in it) will set the character depth (in this case, there are 26 possible characters).
Simulation
Let's not forget the point of all these parameters is to simulate a wide variety of unknown possible interactions. Your network is different than my test network, which is different than the network at my office. In a lot of cases, these differences go beyond the network topology to the very core of the intention for the network to exist in the first place. When simulating such an unknown set of outcomes, a popular choice is to use a Monte Carlo Simulation. From the Wikipedia on Monte Carlo methods:Monte Carlo methods (or Monte Carlo experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. Their essential idea is using randomness to solve problems that might be deterministic in principle. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to use other approaches. Monte Carlo methods are mainly used in three problem classes: optimization, numerical integration, and generating draws from a probability distribution.It is that last case we are interested in here. If we consider the final state of each network graph as a a member of a set (U), then we are randomly generating m possible outcomes (which is a subset of U). There are two primary controlling parameters for each Monte Carlo simulation, that is the number of simulations to run and the number of steps in each simulation. As an example let's look at a different popular use for Monte Carlo Simulation: Game Theory. Specifically Chess. When one first starts to study computational game theory, they will be faced with a problem wherein some function needs to compare the potential outcomes associated with some move and choose the one it scores highest. In a game like chess, the end Game is easier to simulate because there are often times few moves and shorter chains of moves until the game is concluded. In some scenarios it is even possible to use exhaustive search to find optimal conclusions. However in the beginning of the game, there are an impractical number of possible outcomes to simulate all the way to the end using exhaustive search. In this case a popular option is to heuristically select some candidate moves and run X number of Monte Carlo Simulations from each position and aggregate the overall scores for each potential move. The number of steps is important to the reality of the simulation. Continuing the chess example: Running a Monte Carlo simulation for 1 of the 20 possible first moves for a game for 2 steps (first move by White followed by Black's response) leads to 400 possible board configurations. There are 5,362 possible positions (White’s second ply move) or 8,902 total positions after two ply moves each. There are 71,852 possible positions or 197,742 total positions after four moves. There are 809,896 possible positions or 4,897,256 total positions after 5 moves. As you can see the more steps in a simulation the more possible paths there could have been that were not followed. You can view the simulation after 5 moves as 1 of the 4,897,256 possibilities. I think this slide describes it well (not sure where I found it, sorry).
The last point under disadvantages is something interesting. In this case I could have a lot more observables for a few networks. Rather than randomly choosing network behavior I could use packet analysis to generate a much more comprehensive model of network communication. The limitation on observables is an attempt to (hopefully) help my observations generalize. in fact, I have left the option to remove the communication probabilities between node types so that the simulations can take on an even wider spread from the potential outcomes. We will look at the difference in spread from a network with and without these communication restrictions later.
As mentioned previously, I selected the 5 best nodes for each centrality measure. For each node in this set, a series of 100 Simulations were run. Within each simulation 200 hundred steps were simulated. At each step a random set of nodes between min_comm and max_comm were selected as communication initiators. For each of these nodes, an endpoint is randomly selected. If simulating communication, the type of the end node is in the type probabilistically selected first, and the endpoint is then randomly selected from the set of nodes of that type. The endpoint must be reachable by some shortest path. Since there is no guarantee we generated a strongly connected graph, this property isn't guaranteed. I set a max tries of 50, meaning i would try to randomly select an endpoint 50 times. If after those tries no endpoint was located the program will look at what shortest paths exist from the start node and randomly select the end of one of these as the end point. This sounds more complicated than it really works out being. Hopefully, the flow chart below will clarify things a bit.
The yellow box is another function which handles determining whether or not the infect is spread from one node to another during this communication. It also handles the check to see if the infection is noticed by the uninfected node. Each Time A node becomes infected it's state is changed from uninfected to infected, and the infection counter is incremented. Every time a node is cleaned of infection, the state is returned to uninfected and the cleaned counter is incremented. However, both of these have a complimentary metric. We can tally how many times a node cleans an infected node and we can tally how many times an infected node has infected a cleaned node.
After all the steps are completed in this fashion, the statistics for that simulation are calculated. Early next year I plan to release the analysis of these statistics.
Styling the Graphs
If you choose to run your own simulations, you may want to change the style of the graphs. There are several ways in which you can achieve this. First is choosing the appropriate layout and distancing parameter as I did with the spring layout and k=1. The next way is to use NetworkX's in-line parsing for MatPlotLib:In the above code, I call the draw_networkx_nodes function with a copy of the graph I want to draw from (G). I pass in a list of current X,Y positions for each node (returned from the call to nx.spring_layout(G)earlier in the code). I then pass in a MPL axes object. If I skipped this, NetworkX would create a new axes object for me. Next I pass in the node size (in pixels). Finally I pass in a list of colors the same length as G.nodes(). There are many other available options you can use to fine tune the display a bit.nodes = nx.draw_networkx_nodes(G, curr_pos, ax=ax_u, node_size=15, node_color=cs)
https://networkx.github.io/documentation/stable/reference/generated/networkx.drawing.nx_pylab.draw_networkx_nodes.html#networkx.drawing.nx_pylab.draw_networkx_nodes
The default shape is a circle, and the default color is red. Here is a link to a list of named colors you can use for both node and edge coloration.
https://matplotlib.org/2.0.0/examples/color/named_colors.html
The above code creates an axes object with an equal aspect ratio. The ax object is suitable for use with the previously mentioned draw_network_nodes function, meaning you can combine the two methods seamlessly.fig = plt.figure()ax = fig.add_axes([.1, .1, .8, .8])ax.set_aspect('equal')
Parallelization
One major advantage to Monte Carlo Simulations lies in the fact they can be distributed to different processes to complete. In general parallelization of a process is simply dividing the work to be done among many workers capable of working independently, but who's results can be combined into one overall answer. There are several possible work distribution schemes one could choose. One possible method to accomplish this parallelization is to give each worker a patient 0 as a start point. Each worker returns the summarized results for their respective start points and the main application aggregates the totals to deliver the simulation scores. In this example that would lead to 5 processes (one for each patient 0) for each centrality measure checked.Another possible division of labor would be to give each worker a centrality measure to test. This has the advantage in my situation of only requiring 4 workers to be managed from the main application. How to divide things up will be a matter of needs, development taste, and resources available.
Multi-Tiered Distributed Processing of Network Graphs |
In fact, I chose to go with a combined setup. I have a cluster of 5 computers I can use to split the task up. I used 1 system as the primary controller, and the other 4 systems as works. After generating the network graph, I send a copy to each worker system. Each systems specializes in a different centrality measure, so it will select it's own patient 0 population. Then, within each worker system, a new thread is created to manage the simulation for each patient 0. The worker system aggregates it's threads and sends the data back to the main processor who in turn aggregates that with the other worker's responses.
No comments:
Post a Comment