Friday, February 15, 2013

sscanf() Gotcha

Interesting gotcha I ran across using sscanf() today.

Here's the example scenario.  In my case, I was parsing a .mcs file and looking for pairs of characters that were hex representations of bytes of data.  That is, the string "FF001122" would translate to data = {0xFF, 0x00, 0x11, 0x22}.


fstream stream;
stream.open(fileLoc);

char data[8] = {};
char line[16];
stream.getline( line, 16 );

for (int i=0; i<8; i++) {
    sscanf( line + (i*2), "%2x", &(data[i]) );
}


Upon program exit, I get a stack corruption error around 'data'. Weird, right? We don't go out of bounds at any point on data, right?

Well, we don't... But sscanf() is a different story.

Let's step through the for loop.  We'll say line = "0011223344556677".

i=0.
data = { ?, ?, ?, ?, ?, ?, ?, ?}
Perform the sscanf.
data = { 00, 00, 00, 00, ?, ?, ?, ? }

See the problem yet?
We'll keep going:

i=1.
data = {00, 00, 00, 00, ?, ?, ?, ?}
Perform the sscanf.
data = {00, 11, 00, 00, 00, ?, ?, ?}

At the end, we have:

i=7.
data = {00, 11, 22, 33, 44, 55, 66, 77} 00, 00, 00 (out of bounds!!!)

What's going on here is that sscanf() doesn't realize that data is an array of chars.  It assumes that since the "%x" identifier was used, we want it to give us unsigned ints.  But it doesn't check, or ask.  Turns out there's a way to specify manually - the correct form to receive data back in char size (and thus write out of bounds) is to use the identifier string "%2hhx" - that is, 2 hex chars output into one byte of data (specified by the "hh").  If you look carefully, the scanf formatting table in the spec tells you that this is expected behavior, but it's not particularly outspoken about it.

So, keep an eye out, from your good friend Emily, wasting four hours of her day so you don't have to.

EDIT:
Turns out this doesn't quite work depending on your architecture and whether or not you're using Unicode.... I ran this same code in Unicode and the 'hh' format string doesn't help - sscanf still reads into 4 bytes every time.  Internet searches show that the best solution is to use a temporary variable. Barf.

1 comment: