Beware the string

.NET holds an enormous advantage over C++.

Well, okay, there are a few, but I’m thinking about one in particular: A single string type.

What’s not to love about that? Anybody who has done more than an hour’s worth of Windows programming in C++ should appreciate this feature. No more zero-terminated char* vs. length-prefixed char* vs. BSTR vs. wchar_t vs. CStringA vs. CStringW vs. CComBSTR. Just System.String. Hurray!

There’s one very specific thing not to love, however: The ease with which you can allocate a new one.

I’ve been working in an environment where performance is critical, and everything is managed code, for several years now. That might sound like an oxymoron, but our system can in fact beat the pants off all the popular native programming environments. The key to success? Thought and discipline.

We, in fact, love our single string type. And yet our team has learned (the hard way) that string allocations, while seemingly innocuous and small, spell certain death.

It may seem strange to pick on string. There are dozens of other objects you might allocate, like custom data types, arrays, lists, and whatnot. But there tend to be many core infrastructural pieces that deal with string manipulation, and if you build atop the wrong abstractions then things are sure to go wrong.

Imagine a web stack. It’s all about string parsing and processing. And anything to do with distributed processing of data is most likely going to involve strings at some level. Etc.

There are landmine APIs lurking out there, like String.Split and String.Substring. Even if you’ve got an interned string in hand (often rare in a server environment where strings are built from dynamically produced data), using these APIs will allocate boatloads of tiny little strings. And boatloads of tiny little strings means collections.

For example, imagine I just want to perform some action for each substring in a comma-delimited string. I could of course write it as follows:

string str = ...;
string[] substrs = str.Split(',');
foreach (string subtr in substrs) {
    Process(substr);
}

Or I could write it as follows:

string str = ...;
int lastIndex = 0;
int commaIndex;
while ((commaIndex = str.IndexOf(',', commaIndex)) != -1) {
    Process(substr, lastIndex, commaIndex);
    lastIndex = commaIndex + 1;
}

The latter certainly requires a bit more thought. That’s primarily because .NET doesn’t have an efficient notion of substring – creating one requires an allocation. But the performance difference is night and day. The first one allocates an array and individual substrings, whereas the second performs no allocations. If this is, say, parsing HTTP headers on a heavily loaded server, you bet it’s going to make a noticeable difference.

Honestly, I’ve witnessed programs that should be I/O bound turn into programs that are compute-bound, simply due to use of inefficient string parsing routines across enormous amounts of data. (Okay, the developers also did other sloppy allocation-heavy things, but string certainly contributed.) Remember, many managed programs must compete with C++, where developers are accustomed to being more thoughtful about allocations in the context of parsing. Mainly because it’s such a pain in the ass to managed ad-hoc allocation lifetimes, versus in-place or stack-based parsing where it’s trivial.

"But gen0 collections are free," you might say. Sure, they are cheaper than gen1 and gen2 collections, but they are most certainly not free. Each collection is a linked list traversal that executes a nontrivial number of instructions and trashes your cache. It’s true that generational collectors minimize the pain, but they do not completely eliminate it. This, I think, is one of the biggest fallacies that plagues managed code to this day. Developers who treat the GC like their zero-cost scratch pad end up creating abstractions that poison the well for everybody.

Crank up .NET’s XmlReader and profile loading a modest XML document. You’ll be surprised to see that allocations during parsing add up to approximately 4X the document’s size. Many of these are strings. How did we end up in such a place? Presumably because whoever wrote these abstractions fell trap to the fallacy that "gen0 collections are free." But also because layers upon layers of such things lie beneath.

It doesn’t have to be this way. String does, after all, have an indexer. And it’s type-safe! So in-place parsing at least won’t lead to buffer overruns. Sadly, I have concluded that few people, at least in the context of .NET, will write efficient string parsing code. The whole platform is written to assume that strings are available, and does not have an efficient representation of a transient substring. And of course the APIs have been designed to coax you into making copy after copy, rather than doing efficient text manipulation in place. Hell, even the HTTP and ASP.NET web stacks are rife with such inefficiencies.

In certain arenas, doing all of this efficiently actually pays the bills. In others arenas, it doesn’t, and I suppose it’s possible to ignore all of this and let the GC chew up 30% or more of your program’s execution time without anybody noticing. I’m baffled that such software is written, but at the same time I realize that my expectations are out of whack with respect to common practice.

The moral of the story? Love your single string type. It’s a wonderful thing. But always remember: An allocation is an allocation; make sure you can afford it. Gen0 collections aren’t free, and software written to assume they are is easily detectible. String.Split allocates an array and a substring for each element within; there’s almost always a better way.

7 thoughts on “Beware the string

  1. Tom

    In this example code:

    string str = …; int lastIndex = 0; int commaIndex; while ((commaIndex = str.IndexOf(‘,’, commaIndex)) != -1) { Process(substr, lastIndex, commaIndex); lastIndex = commaIndex + 1; }

    Is the first argument to Process() supposed to be str? I assume that it is, and it occurs to me that then inside of Process() you still have to substring the string to get the value you want to process on. How is that better than just splitting the comma-delimited string into a string array?

    Reply
  2. Andrey Tamelo

    @Tom:

    You have the source string, you also have the boundaries (start index, end index) to "extract" the substring. That is functional equivalent to having a separately allocated substring with a benefit of avoiding extra allocations. This saves you GC time. And as already pointed out, the difference can be huge in certain scenarios.

    Reply
  3. alex

    Joe, Perhaps the IndexOf clause should refer to lastIndex in: while ((commaIndex = str.IndexOf(‘,’, commaIndex)) != -1) I would expect the compiler to catch the use of an uninitialized variable, (if not the long string of zeroes), Maybe I am reading it wrong, or over interpreting, but it looks like lastIndex was added to make the code clearer and the loop originally used commaIndex+1 (I remember pred() and succ() from Delphi for this) Is it really clearer to add the lastIndex variable I wonder? It may help the conceptual clarity, but that very same clarity can hide potential errors by disconnecting the “compilers in our minds” which in my case are slow and feeble, but essential. At a higher level, although the task here is rather simple, it is a lovely illustration of how our tools can get in the way of breaking a problem down into component tasks: In some ways the “better” code feels rather 1990 with the crafted loop, (all it lacks is looping downwards to optimize for the CX register). Back here in the future we might prefer the first code fragment in terms of decoupling of responsibilities, and, (in an ideal world, on a huge string, performance, since the splitting could be distributed). In some sense we just re-implement a specific case of String.Split here. On a really huge string we might need to be flinging out the chunks of separated tokens from the splitters to the processors. (Of course we would also need to map or route chunks of string from the incoming stream to the splitters and keep track of the indexes and processing around the breaks or overlaps between the chunks which might be needed for more complex split criteria (extending the problem to audio (or (moving) image) segmentation).

    The levels thing again Alex

    Reply
  4. alex

    I like your transient substring, Consider processing DNA sequences (ATGC…); Strings are the obvious choice, (although the information has only (about) 4 bits per location), but strings really bite for splitting and joining primers etc, I remember interning was some help (each fragment can have a name), but there were a lot of small, essentially random strings. Consider that modern DNA sequencing machines dump this stuff out in Gigabytes as small 100 character string fragments, and we split, compare “edit distances between strings”, splice and walk our way through the graph of overlapping sequences, trying to optimize the path for various measures like fewest gaps, best match to a reference etc. One ends up with arrays and sub indexing (and in C# the stack allocation of array space), Strings are still useful for keys and a good way to pass things between parts of the processing, but the low level machinery tends to be byte arrays. Its an extreme version of your string split problem, but growing in importance. The representation of the sequences and higher level structures (graphs of overlaps and similarities) is obviously similar to text processing. Using Linq on strings (like your splitting example) is very satisfying but suffers strongly from the lack of transient strings that you illustrate, however there is (almost always) no permanent allocation apart from the return, languages like R enforce such behavior which might simplify things. Time to get back to work, A

    Reply
  5. Jason Evans

    As with Tom’s comment, I’m unsure how to make use of the start and end index values to extract from the main string, without using either Substring(), or trying something like:

    var result = new List();

    int lastIndex = 0; int commaIndex = 0;

    while ((commaIndex = input.IndexOf(',', lastIndex)) != -1) { int bufferLength = commaIndex - lastIndex; char[] buffer = new char[bufferLength];

    for (int idx = 0; idx < bufferLength; idx++) buffer[idx] = input[lastIndex + idx];

    lastIndex = commaIndex + 1; result.Add(new string(buffer)); }

    if (input[lastIndex] != '') result.Add(input.Remove(0, lastIndex));

    I’ve checked the speed (using StopWatch) and memory usage (via sos.dll) and I’m not seeing any benefit vs “Split()”. Thus I’m certain I’m not doing this correctly, so I’d appreciate if someone could post a code example of how to use the more efficient method of splitting the string, and getting substrings via lastIndex and commaIndex.

    Cheers.

    Reply
  6. George

    The idea is correct the actual code implementation could be this: public void TestAlternativeStringSplit() { var result = new List(); string text = “a,aaa,bb,ccc”; char separator = ‘,’;

    int lastIndex = 0; int separatorIndex = 0;

    while ((separatorIndex = text.IndexOf(separator, separatorIndex)) != -1) { result.Add(text.Substring(lastIndex, separatorIndex-lastIndex)); lastIndex = separatorIndex + 1; ++separatorIndex; } result.Add(text.Substring(lastIndex)); //Grab the last part }

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Enter the word concurrency, in upper case: (my crude spam filter)