So there really are quite substantial performance gains to be had by dropping to lower and lower abstractions levels. The biggest gain is just by using substring over string, but even using unicode scalars and UTF-8 are big enough to consider using it if possible.
Now how do we apply what we’ve learned to parsers?
Well, so far all of our string parsers have been defined on Substring
, and we’ve used lots of string APIs such as removeFirst
, prefix
and range subscripting. As we have just seen in very clear terms, these operations can be a little slow on Substring
because of the extra work that must be done to properly handle traversing over grapheme clusters and normalized characters. The time differences may not seem huge, measured in just a few microseconds, but if you are parsing a multi-megabyte file that can really add up.
So, let’s see what kind of performance gains can be had by switching some of our parsers to work with UTF-8 instead of Substring
.