Debugging a Xul Javascript Memory Corruption Bug in Firefox 3.5
November 2009
Introduction
Debugging complex software can be a difficult task. Even software as familiar as Firefox is often black boxes that few understand. This paper describes the process and mistakes I used to investigate a bug I discovered in the PL_DHashTableOperate function. This function is a memory management hash table that is used during garbage collection, among other things, inside XPCOM Cycle Collector.
Noticing the bug
This bug wasn’t discovered using elaborate fuzzing techniques. Rather, it was noticed while browsing a friend’s website:
The bug only presents in Firefox and only consistently (among systems I’ve tested) on Windows 7 (32 bit). Crashes are more likely when Google Toolbar is installed but that is not a requirement.
The page obviously utilizes a Google Maps and loads a plethora of data as seen above. After many markers are loaded the browser hangs and crashes with a memory read error near NULL. During the crash Firefox consumes in exess of 1.4GB of memory.
Debugging the Process
Firefox is a complex multithreaded application. Consuming large quantities of memory and/or crashing does directly point toward the problem. First we must consider our debugging environment. Initially, we were inclined to use Immunity Debugger (immdbg) as the layout and functionality are, in our opinion, superior to other options. Immdbg’s integrated Python is almost a must have. Unfortunately, Immdbg does not support recent debug symbols (PDB files). Given the complexity of Firefox, this is a significant handicap and we therefore opted to use Windbg. First to leverage this, remember to add Firefox’s debug server: http://symbols.mozilla.org/firefox
Next we attach the debugger and identify the runway thread:
Once we’ve identified the runaway thread, we obtain a stack trace.
The stack trace shows multiple calls to xul!AppendNodeTextNodeContentsRecurse. This seems like a good place to start. Appending generally implies a memory operation and we know our problem is related to heap allocations. Search for Xul, we find the XUL is Mozilla’s XML parsing library. Now we have a parser and a memory problem, which seems even more likely to be buggy.
We next use Immdbg to visualize the data being consumed by the thread:
The data corresponds to the restaurant information being loaded by the Google Maps API. This data is in a for loop and is XML. Looking at the Javascript we see:
In particular we note map_sidebar.appendChild() call. Appending map XML data seems very relevant to our bug so we begin testing. Unfortunately, it turns out this isn’t directly related to our bug. We will shortly provide an accurate stack trace of the problem, but our testing yielded some (mildly) notable results.
First, we tried excessively calling the appendChild() function to create a memory fault. This will sometimes cause Firefox to crash, but not with our fault. It also takes excessively long for the crash to occur (likely due to the small amount of data we are appending).
Observers might note the s string we’re building. We also thought we might be able to read out of bounds on markers to obtain unauthorized reads. This didn’t work. Javascript appears to alloc() memory for any marker[i].
Next we tried to manipulate the markers by going out of bounds and by inserting fixed data. This consumed memory but generally didn’t create a real problem.
Again, no notable effect.
Changing Directions
Next we observe the stack traces more thoroughly. We use Windbg’s !analyze and !exploitable features. These help some – mainly analyze.
We identify the problem in xul!PL_DHashtable. The problem occurs when ECX is passed in and used as a read address.
The calling function to PL_DHashTableOperate is xul!GCGraphBuilder::NoteRoot
A stack argument to NoteRoot is passed to PL_DHashTableOperate which uses it without bounds checking. Oddly, NoteRoot does a check on this value but still calls PL_DHashTableOperate.
Next we research the class a little more:
Looking at a chart of the relationships of the class we’re simply bewildered. Look at this:
How could such a thing not be buggy?
Tracing the problem is complicated. The libraries use C++, which often mangles call stacks. The objects use inheritance and the Garbage collector itself is a bit outside the scope of any individual class. It’s simply a related class that cleans up stray objects according to a complex tree algorithm.
Source
We obtain the source for the functions from Mozilla. The source may not be entirely accurate as it’s not guaranteed to be the build source.
AddXPConnectRoots iterates over Javascript contexts using the variable acx. It always calls NoteRoot. NoteRoot appears to do some validation, but looking at the disassembly we know any NULL void * root, should trigger the bug – as anything less than 0x0A should always continue to the vulnerable function. We’re not sure why the assert fails – the compiled code does some checking but not as the souce code indicates. We might very well be looking at different versions of code/binary.
Exploitation – Epic Fail?
Given the above, exploitation should consist of:
- Generate objects in javascript.
- Create a condition where the object is preselected for cleanup
- Trigger the Garbage Collector
- Delete the object.
- Crash
This is a race condition bug and we have not been able to produce reliable code to exploit it. I’m not sure if I will try further for a DoS. I have seen EIP over written (once in MANY debug iterations). I’ve also seen a few writes. This likely occurs do to quirks in the free-ing process and timing.
The details are complicated heap kung-fu. The GCGraphBuilder class implements the purple portion of the garbage cycle collector. Described as:
64 : // Purple nodes are *candidates* for being scanned. They are nodes we
65 : // haven't begun scanning yet because they're not old enough, or we're
66 : // still partway through the algorithm.
67 : //
68 : // XPCOM objects participating in garbage-cycle collection are obliged
69 : // to inform us when they ought to turn purple; that is, when their
70 : // refcount transitions from N+1 -> N, for nonzero N. Furthermore we
71 : // require that *after* an XPCOM object has informed us of turning
72 : // purple, they will tell us when they either transition back to being
73 : // black (incremented refcount) or are ultimately deleted.
Continues:
78 : // An XPCOM object is either scan-safe or scan-unsafe, purple-safe or
79 : // purple-unsafe.
80 : //
81 : // An object is scan-safe if:
82 : //
83 : // - It can be QI'ed to |nsXPCOMCycleCollectionParticipant|, though this
84 : // operation loses ISupports identity (like nsIClassInfo).
85 : // - The operation |traverse| on the resulting
86 : // nsXPCOMCycleCollectionParticipant does not cause *any* refcount
87 : // adjustment to occur (no AddRef / Release calls).
88 : //
89 : // An object is purple-safe if it satisfies the following properties:
90 : //
91 : // - The object is scan-safe.
92 : // - If the object calls |nsCycleCollector::suspect(this)|,
93 : // it will eventually call |nsCycleCollector::forget(this)|,
94 : // exactly once per call to |suspect|, before being destroyed.
95 : //
96 : // When we receive a pointer |ptr| via
97 : // |nsCycleCollector::suspect(ptr)|, we assume it is purple-safe. We
98 : // can check the scan-safety, but have no way to ensure the
99 : // purple-safety; objects must obey, or else the entire system falls
100 : // apart. Don't involve an object in this scheme if you can't
101 : // guarantee its purple-safety.
102 : //
103 : // When we have a scannable set of purple nodes ready, we begin
104 : // our walks. During the walks, the nodes we |traverse| should only
105 : // feed us more scan-safe nodes, and should not adjust the refcounts
106 : // of those nodes.
107 : //
108 : // We do not |AddRef| or |Release| any objects during scanning. We
109 : // rely on purple-safety of the roots that call |suspect| and
110 : // |forget| to hold, such that we will forget about a purple pointer
111 : // before it is destroyed. The pointers that are merely scan-safe,
112 : // we hold only for the duration of scanning, and there should be no
113 : // objects released from the scan-safe set during the scan (there
http://people.mozilla.com/~mnandigama/codecoverage_html/xpcom/base/nsCycleCollector.cpp.gcov.html
The exact problem occurs here:
PtrInfo*
GCGraphBuilder::AddNode(void *s, nsCycleCollectionParticipant *aParticipant
IF_DEBUG_CC_PARAM(PRUint32 aLangID)
)
{
PtrToNodeEntry *e = static_cast<PtrToNodeEntry*>(PL_DHashTableOperate(&mPtrToNodeMap, s, PL_DHASH_ADD));
PtrInfo *result;
if (!e->mNode) //BUG HERE. e may be null, but we access e+0x4
{
// New entry.
result = mNodeBuilder.Add(s, aParticipant
IF_DEBUG_CC_PARAM(aLangID)
);
if (!result) {
PL_DHashTableRawRemove(&mPtrToNodeMap, e);
Update: This bug – https://bugzilla.mozilla.org/show_bug.cgi?id=502687 also appears that the crash occurs on out of memory (OOM). Our testing and other reports indicate this bug can also appear when NOT OOM.
Lessons Learned
Going from random bug to some level of understanding in this case was difficult. A stack trace doesn’t really identify the problem. The source of the problem is in Javascript, which we have a lot of and it’s difficult to pinpoint the exact problem.
We laid out this paper, hopefully clearly. Our testing wasn’t so straightforward. Hopefully our process will help others. We suggest profiling the memory operations using runaway or following the heap to see what particular operations are most likely related to the memory corruption problem. Next do some simple searching in the Javascript to find possibly related strings. Next we suggest getting a solid stack trace and understanding of the disassembly to direct further testing. We largely failed at this. We approached the two ends (javascript and stack trace) as separate objectives and wasted much time.
If we were repeating a javascript bug, we would probably use a proxy such as WebScarab. Several times during the testing it would have been nice to modify javascript from the original offending website. A proxy with data meddling would have been beneficial. Simply copying the page and editing doesn’t work as the app is AJAX oriented.
Future
Like COM for windows. Sounds like fun.
Also we know of a bug in Flash that we’d like to leverage similar techniques to exploit.
Finally
Proposed fix (someone posted to Mozilla for bug 502687)
Bug 502687 - GCGraphBuilder::AddNode crashes on OOM.
diff --git a/xpcom/base/nsCycleCollector.cpp b/xpcom/base/nsCycleCollector.cpp
--- a/xpcom/base/nsCycleCollector.cpp
+++ b/xpcom/base/nsCycleCollector.cpp
@@ -1323,16 +1323,19 @@ GCGraphBuilder::~GCGraphBuilder()
}
PtrInfo*
GCGraphBuilder::AddNode(void *s, nsCycleCollectionParticipant *aParticipant
IF_DEBUG_CC_PARAM(PRUint32 aLangID)
)
{
PtrToNodeEntry *e = static_cast<PtrToNodeEntry*>(PL_DHashTableOperate(&mPtrToNodeMap, s, PL_DHASH_ADD));
+ if (!e)
+ return nsnull;
+
PtrInfo *result;
if (!e->mNode) {
// New entry.
result = mNodeBuilder.Add(s, aParticipant
IF_DEBUG_CC_PARAM(aLangID)
);
if (!result) {
PL_DHashTableRawRemove(&mPtrToNodeMap, e);
References
https://developer.mozilla.org/En/Using_the_Mozilla_symbol_server
http://doxygen.db48x.net/mozilla-full/html/dd/ded/nsCycleCollector_8cpp.html
https://developer.mozilla.org/en/XPConnect
https://developer.mozilla.org/en/Using_XPCOM_in_JavaScript_without_leaking
https://developer.mozilla.org/en/interfacing_with_the_xpcom_cycle_collector
http://en.wikipedia.org/wiki/XUL
https://developer.mozilla.org/En/XUL
https://wiki.mozilla.org/CrashKill/QA
http://doxygen.db48x.net/comm-central/html/classnsCycleCollectionTraversalCallback.html
https://bugzilla.mozilla.org/show_bug.cgi?id=502687