Thursday, June 21, 2018

Simulating Network Infections with Python and NetworkX


One thing I have enjoyed studying most in my research is the spread of malicious data (be it a program or a piece of intentional misinformation) through a network. One practical reason for this fascination is that it plays a key role in both offensive and defensive security. Offensively, understanding the key nodes in a network allow you to figure out a path inside a network. As a defender, you can use it as another tool to help decide where to allocate your all-to-limited resources for maximum effect.
Initially, the idea came about as a way to estimate the resilience of a self-replicating Botnet. The concept can be applied generally to any type of network, though. For simplicity I performed several Monte Carlo simulations. I used NetworkX to define and analyze the network graphs (which I will describe more below). Each simulation selected 5 top-scoring nodes for a given centrality measure. For each of these nodes (called 'patient 0 nodes'), 200 Monte Carlo simulations were run. Each simulation lasted for 1000 steps (evolutions). At the end the total number of infected systems left were tallied as a scaled percentage of the total nodes in the graph.

Network Setup

I started with a simple graph description and gradually added to the depth of the simulation. The first iteration of the network definition was simply a Multi-Edge Directed graph. Each node represents a system and each edge represents a directional communication channel between them.  To be more accurate modes form into LAN like clusters with only a few nodes crossing into other LAN spaces. These nodes that cross LAN boundaries are known as "Routers". Several other variables were setup to allow the network to form a semi-random topography at the onset of each test.
  • min_clusters (5) - The minimum number of LAN-like clusters to generate.
  • max_clusters (10) - The maximum number of LAN-like clusters to generate.
  • min_in_cluster (100) - The minimum number of Nodes in each LAN-like cluster.
  • max_in_cluster (200) - The maximum number of Nodes in each LAN-like cluster.
  • min_routers (1) - The minimum number of Nodes in a LAN-like cluster which bridge to another cluster.
  • max_routers (5) - The maximum number of Nodes in a LAN-like cluster which bridge to another cluster.
  • min_router_edges (1) - The minimum number of edges connecting routers to other clusters.
  • max_router_edges (3) - The maximum number of edges connecting routers to other clusters.
  • min_lan_edges (1) - The minimum number of edges connecting nodes within the same LAN-like cluster.
  • max_lan_edges (10) - The maximum number of edges connecting nodes within the same LAN-like cluster.
  • cluster_prefixes - A set of differentiators between LAN-like clusters. 
Changing these settings will define the variability in the networks being tested on. setting accompanying min/max variables (for example min_routers and max_routers) to the same value has the effect of setting a constant. The default settings in the bracket are what I have set at the time of writing.

The next step was to define other roles for the nodes in the network. For simplicity's sake I kept it to 3 roles:
  • The previously described router role, which connects to another LAN-like cluster.
  • The User Node type represents systems which are in-use by an end user (say for email or web browsing)
  • The Server type represents machines that are more likely to be contacted by other machines. 
With the concept of roles defined I was able to create some more variables to control the likelihood of different types of nodes interacting. For example, in my network definition, Users are less likely to send messages to other user nodes. They are, on the other hand, very likely to contact server nodes. I also defined probability distributions for the Server and User types. I set the probability of being a user to 70% leaving a 30% chance of being a server node.
The first display of the graph was nice but to view simulation results I thought it would be better to reposition the nodes visually so I could see them better. First, I removed the edges, then I used a spring layout with the distance parameter k=1 to calculate a more evenly spaced placement.

Edges removed. The graph is already looking better

The final position of the nodes after applying the k parameter.
To understand how the nodes moved around to create the final position pictured above, I created an animation. This fun bit of code deserves it's own blog post and I will probably write it up...eventually.

Communication Parameters

As mentioned above, after roles were defined the next step I took was to probabilistically define the communication for each possible combination of node types.
  • User -> User - 10%
  • User -> Server - 70%
  • User -> Router - 20%
  • Server -> User - 5%
  • Server -> Server - 85%
  • Server -> Router 10%
  • Router -> User - 10%
  • Router -> Server - 70%
  • Router -> Router - 20%

Infection Parameters

  • Now that I can generate test networks and probabilistically simulate communication among the different node types it is time to discuss the infection parameters, which control the behavior of the infectious agent. This behavior can be modeled after a theoretical piece of malicious code (as I am here) or using the infection statistics of a live threat. In either case, what is being described is the probability of infection spreading. To do this I defined several parameters:
  • inbound_infection (15%) - The probability of infecting a node which has initiated a connection with an infected node.
  • outbound_infection (20%) - The probability of infecting a clean node, once contacted by an infected node.
  • inbound_detection (5%) - The probability of an infection being noticed as it is spread across an inbound infection. This is analogous to the clean node which initiated the contact becoming aware of the infection on the sending machine. Inbound Detection cleans the original infected machine, but does not necessarily protect the initiator from infection.
  • outbound_detection (10%) This is similar to inbound detection but for outbound infections. This is analogous to an operator recognizing the machine that they have contacted is infected
  • proxy_infection (0%) - A special case where the infected machine doesn't have to be the start or end of a communication chain to cause an infection. This woulds be analogous to using an infected proxy which could inject into traffic streams.
  • proxy_detection (0%) - As with the other detection probabilities this is the chance that the infection will be noticed by one of the uninfected nodes.
When an infection is noticed, the originally infected node is 'cleaned'. That is, it's state is returned to uninfected and it would not count towards the score at the summation step. In a proxy environment it is the receiving machine that is at risk of infection.


A post-infection graph for the out-degree centrality.

At this point: I have defined a network, the roles and communications within that network, and the parameters describing the spread of the contaminant. We are ready to start simulating...almost.

An aside for Python

To keep things somewhat scientific, I added an ID to each network I generated. I stored the Graph as a file along with the recorded stats. Now, for the fun Python part. There are dozens of ways I have seen to generate a random string in python, but there is one I wanted to call attention to as I think it is a fairly elegant way which utilizes list comprehension:

import string
from random import choice

the_id = "".join([choice(string.ascii_lowercase) for i in range(32)])

This one-liner generates a pseudo-random 32 lowercase alphabetic character ID. Replacing string.ascii_lowercase with one of the other character collections contained in the string library (for example string.ascii_letters has both upper and lowercase letters in it) will set the character depth (in this case, there are 26 possible characters).

Simulation

Let's not forget the point of all these parameters is to simulate a wide variety of unknown possible interactions. Your network is different than my test network, which is different than the network at my office. In a lot of cases, these differences go beyond the network topology to the very core of the intention for the network to exist in the first place. When simulating such an unknown set of outcomes, a popular choice is to use a Monte Carlo Simulation. From the Wikipedia on Monte Carlo methods:
Monte Carlo methods (or Monte Carlo experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. Their essential idea is using randomness to solve problems that might be deterministic in principle. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to use other approaches. Monte Carlo methods are mainly used in three problem classes: optimization, numerical integration, and generating draws from a probability distribution.
It is that last case we are interested in here. If we consider the final state of each network graph as a a member of a set (U), then we are randomly generating m possible outcomes (which is a subset of U). There are two primary controlling parameters for each Monte Carlo simulation, that is the number of simulations to run and the number of steps in each simulation. As an example let's look at a different popular use for Monte Carlo Simulation: Game Theory. Specifically Chess. When one first starts to study computational game theory, they will be faced with a problem wherein some function needs to compare the potential outcomes associated with some move and choose the one it scores highest. In a game like chess, the end Game is easier to simulate because there are often times few moves and shorter chains of moves until the game is concluded. In some scenarios it is even possible to use exhaustive search to find optimal conclusions. However in the beginning of the game, there are an impractical number of possible outcomes to simulate all the way to the end using exhaustive search. In this case a popular option is to heuristically select some candidate moves and run X number of Monte Carlo Simulations from each position and aggregate the overall scores for each potential move. The number of steps is important to the reality of the simulation. Continuing the chess example: Running a Monte Carlo simulation for 1 of the 20 possible first moves for a game for 2 steps (first move by White followed by Black's response) leads to 400 possible board configurations. There are 5,362 possible positions (White’s second ply move) or 8,902 total positions after two ply moves each. There are 71,852 possible positions or 197,742 total positions after four moves. There are 809,896 possible positions or 4,897,256 total positions after 5 moves. As you can see the more steps in a simulation the more possible paths there could have been that were not followed. You can view the simulation after 5 moves as 1 of the 4,897,256 possibilities. I think this slide  describes it well (not sure where I found it, sorry).


The last point under disadvantages is something interesting. In this case I could have a lot more observables for a few networks. Rather than randomly choosing network behavior I could use packet analysis to generate a much more comprehensive model of network communication. The limitation on observables is an attempt to (hopefully) help my observations generalize. in fact, I have left the option to remove the communication probabilities between node types so that the simulations can take on an even wider spread from the potential outcomes. We will look at the difference in spread from a network with and without these communication restrictions later.

As mentioned previously, I selected the 5 best nodes for each centrality measure. For each node in this set, a series of 100 Simulations were run. Within each simulation 200 hundred steps were simulated. At each step a random set of nodes between min_comm and max_comm were selected as communication initiators. For each of these nodes, an endpoint is randomly selected. If simulating communication, the type of the end node is in the type probabilistically selected first, and the endpoint is then randomly selected from the set of nodes of that type. The endpoint must be reachable by some shortest path. Since there is no guarantee we generated a strongly connected graph, this property isn't guaranteed. I set a max tries of 50, meaning i would try to randomly select an endpoint 50 times. If after those tries no endpoint was located the program will look at what shortest paths exist from the start node and randomly select the end of one of these as the end point. This sounds more complicated than it really works out being. Hopefully, the flow chart below will clarify things a bit.


The yellow box is another function which handles determining whether or not the infect is spread from one node to another during this communication. It also handles the check to see if the infection is noticed by the uninfected node. Each Time A node becomes infected it's state is changed from uninfected to infected, and the infection counter is incremented. Every time a node is cleaned of infection, the state is returned to uninfected and the cleaned counter is incremented. However, both of these have a complimentary metric. We can tally how many times a node cleans an infected node and we can tally how many times an infected node has infected a cleaned node.


After all the steps are completed in this fashion, the statistics for that simulation are calculated. Early next year I plan to release the analysis of these statistics.

Styling the Graphs

If you choose to run your own simulations,  you may want to change the style of the graphs. There are several ways in which you can achieve this. First is choosing the appropriate layout and distancing parameter as I did with the spring layout and k=1. The next way is to use NetworkX's in-line parsing for MatPlotLib:
nodes = nx.draw_networkx_nodes(G, curr_pos, ax=ax_u, node_size=15, node_color=cs)
In the above code, I call the draw_networkx_nodes function with a copy of the graph I want to draw from (G). I pass in a list of current X,Y positions for each node (returned from the call to nx.spring_layout(G)earlier in the code). I then pass in a MPL axes object. If I skipped this, NetworkX would create a new axes object for me. Next I pass in the node size (in pixels). Finally I pass in a list of colors the same length as G.nodes(). There are many other available options you can use to fine tune the display a bit.
https://networkx.github.io/documentation/stable/reference/generated/networkx.drawing.nx_pylab.draw_networkx_nodes.html#networkx.drawing.nx_pylab.draw_networkx_nodes
The default shape is a circle, and the default color is red. Here is a link to a list of named colors you can use for both node and edge coloration.
https://matplotlib.org/2.0.0/examples/color/named_colors.html

One final way to style the graphs is through the MatPlotLib Pyplot module directly.
fig = plt.figure()
ax = fig.add_axes([.1, .1, .8, .8])
ax.set_aspect('equal')
The above code creates an axes object with an equal aspect ratio. The ax object is suitable for use with the previously mentioned draw_network_nodes function, meaning you can combine the two methods seamlessly.

Parallelization

One major advantage to Monte Carlo Simulations lies in the fact they can be distributed to different processes to complete. In general parallelization of a process is simply dividing the work to be done among many workers capable of working independently, but who's results can be combined into one overall answer. There are several possible work distribution schemes one could choose. One possible method to accomplish this parallelization is to give each worker a patient 0 as a start point. Each worker returns the summarized results for their respective start points and the main application aggregates the totals to deliver the simulation scores. In this example that would lead to 5 processes (one for each patient 0) for each centrality measure checked.
Another possible division of labor would be to give each worker a centrality measure to test. This has the advantage in my situation of only requiring 4 workers to be managed from the main application. How to divide things up will be a matter of needs, development taste, and resources available.

Multi-Tiered Distributed Processing of Network Graphs

In fact, I chose to go with a combined setup. I have a cluster of 5 computers I can use to split the task up. I used 1 system as the primary controller, and the other 4 systems as works. After generating the network graph, I send a copy to each worker system. Each systems specializes in a different centrality measure, so it will select it's own patient 0 population. Then, within each worker system, a new thread is created to manage the simulation for each patient 0. The worker system aggregates it's threads and sends the data back to the main processor who in turn aggregates that with the other worker's responses.

Conclusion

At this point, I have defined a way to generate a network to work with, an infectious agent to simulate, a model for simulating communication, and a method for aggregating the statistics.  While the question of centrality choice is a primary interest, there are many other questions that can be explored using this framework. Including this modeling capability in routers and gateways could allow for better traffic analysis allowing analysts to spend resources on the events that truly pose a risk. In the future, perhaps we will see this type of analysis becoming more common in both the defensive and offensive capabilities of infosec practitioners.  

No comments:

Post a Comment