Saturday, October 20, 2018

Functional Obfuscation in python






As a researcher, I have always enjoyed exploring the technology and methodology of Reverse Engineering. One area which fascinated me specifically were the attempts by various systems to try and resist reversing. Companies and criminals both use these methods in their efforts to protect their intellectual property and dissuade would-be attackers. In this post I will describe a few methods for protecting Python code from analysis which I think more developers should be aware of.

Background

First, let's go over a bit about the various ways one can distribute Python code.
  • The script method - This is the most basic form of distributing code to others. simply pass the new source files to each system. This relies heavily on the other system having the proper environment already configured. It is generally only good for developers working together on a project.
  • The Compiled method - Python is an interpreted language, which usually means that there is no compilation involved. The interpreter can work off the source files. However, this slow since it involves an additional translation step from high-level instructions to interpreted instructions. To save time, you can have the interpreted results stored in python compiled (.pyc) files. These files are no longer human-readable, but that doesn't imply any form of obfuscation. There are tools which can fairly reliably recover source code constructs from .pyc files.
  • The Self-Distributed Package - Python has, for a very long time, supported a full set of distribution tools called distutils. This has been updated and superseded by setuputils. I will cover this in more detail later, but for the key is that these tools allow you to configure how your code will get installed on other machines.
  • The Python Package Index - If you have ever installed a Python package using pip or easy_install, then you have used the Python Package Repo. This is the collection of known modules created by the Python community. They are maintained in the Python Package Index (PyPI). There are several great tutorials on submitting packages for consideration. I am not going to dive into great detail on this topic here.
  • Stand-Alone Apps - To be fair, this is really a combination of the compiled method and the self-distributed package method, but with all of the supporting files required to run on a given system. Examples of this type are Py2Exe, PyInstaller, and Py2App. All of which bundle your (often compiled) code with the required imported modules, system files, and a copy of the interpreter. This means your code can run on that operating system, even if Python is not installed.
Each of these methods has it's own benefits and drawbacks. I encourage you to research each of them more deeply. The processes I will describe are just meant to get you started creating your own methods, implementations, and tools.

Obfuscation is an attempt to frustrate Static Analysis efforts. Static Analysis (SA) has long been the first (and sometimes only) line of defense against Malware. Signature-based Anit-Virus products often use SA looking for matches with known-malicious byte-patterns or function call heuristics to decide if an unknown sample may be malicious. Obfuscation makes this difficult or impossible to automate by hiding these malicious portions of code behind some method which make it's bytes appear one way at rest, and then change to the correct information as as needed at run time. The most basic example I can think of is encoding text. Base 64 encoding the string "hello world" results in the string aGVsbG8gd29ybGQK. If we were to write a byte matching pattern, in Yara for example, we may look for "68:65:6c:6c:6f:20:77:6f:72:6c:64" (the hello world bytes) but not "61:47:56:73:62:47:38:67:64:32:39:79:62:47:51:4b" (the encoded bytes). This brings up an important concept when reasoning about obfuscation: There are multiple interpretations to consider. One for automated analysts and one for human analysts.

Basic Obfuscation

Once it is time to package your code for distribution, you can begin to assess what level of protection is required. Two common methods are minification and symbol renaming.
Minification rewrites the code in as few characters as it can, while still maintaining the semantic meaning. As an Example, the code:
if is_day == 1:
    if num_dogs >= 1:
        do(feed_dogs)
is 60 characters (including new line and space characters). It can be rewritten:
if x==1 and y>=1: do(c);
which is only 24 characters, or about a 60% reduction. Of course, the 'readability' of the code has gone down some. It is no longer clear what the variables c, x, and y represent beyond the fact that they resolve to something that can be compared to integers. This is a very basic example and several other good ones can be found. You aren't expected to know all the creative ways you may reduce your code. To accomplish this, I usually use PyMinifier. One great aspect of PyMinifier, if your concern is more about obfuscation and less about size reduction, is that you can choose to use longer random names which can include Unicode characters. This has the effect of making the source largely illegible, even without any other obfuscations.

Obfuscation via compilation

This one doesn't barely counts in my opinion since Python helpfully comes with functions to disassemble it's compiled code. That said, there are other benefits besides the obfuscation that make compiling the code before distribution more attractive. First, it saves time on the client side. In most circumstances the interpreter will find the modules your application loads and determine if it is a source script, or a python byte-code file. If it is the former, the contents of the file are sent to python's builtin compile() command (which turns it into the latter). The result is stored in the sys.modules table for later reference. You can side step the whole compile step simply by handling it prior to distribution. As the client goes to load the module it will find the python byte-code files and load them directly into the sys.modules table. Another benefit is the size of compiled files is often considerably smaller than it's source code representation. Meaning your final deliverable will be considerably smaller in a lot of cases. Finally, when combined with other forms of obfuscation, compilation has the effect of exacerbating the confusion.

Function Reference Obfuscation

One of the most powerful features of Python (in my opinion) is the ability to use functions as data. Calling a function without the accompanying parenthesis tell Python to return a reference to that function in memory.
foo.bar()  # Calls the function bar
print(foo.bar)  # Prints a reference to the function bar
b = foo.bar  # Make the variable b an alias for foo.bar
b()  # Calls foo.bar via the alias
There is another way you can load a builtin function based off of a string which is related but slightly different. Using the getattr(module,'function') method. As a more concrete example, lets consider the Python Builtin function exec(). Exec is widely considered a 'suspicious' function for a program to rely on. First, because executing arbitrary statements which can be affected by users is risky for a benign application due to the potential for injection abuse. Second exec shows up in a lot of bot-like Malware to facilitate modular execution. For this reason you may want to avoid directly referencing the exec() function.
from base64 import b64decode
guid = "ZXhlYwo="
print("You are %s" % guid) # GUID used as a string
f = getattr(__builtins__,
    b64decode(guid.encode()).decode("utf-8")[:-1])
f("print(\"Aaarrgh\")")
The last line, f("print(\"Aaarrgh\")") is functionally equivalent to exec("print(\"Aaarrgh\")") but the getattr() and b64decode() functions are both very common and not likely to raise any automated analysis alarms.


No comments:

Post a Comment