[Scheme-reports] Regarding the String & Character Spans Library

Jesse Towner

2015-02-03 06:17:00 UTC

Hello, I recently took an interest in the direction R7RS-large would be
taking with the strings library and so I began to investigate
CharacterSequencesCowan (
http://trac.sacrideo.us/wg/wiki/CharacterSpansCowan ) and its predecessor.
After mulling it over, I've come to the conclusion that the spans library
isn't really something I want nor need as a language user. While I
absolutely agree that R7RS-large needs a string library, and SRFI-13 needs
to be updated, I don't think it's the right primitive for the job, and
instead think using string cursors directly when needed is a better
approach.

Consider the added costs of type-checking to the string cursor procedures
in order to determine if the first input argument is either a span or a
string (string-cursor-start, string-cursor-end, string-cursor-ref,
string-cursor-next, string-cursor-prev, etc.). These procedures are going
to be called from your inner most loops and so it would be best if they
were as efficient as possible and didn't need to check types. The original
proposal of course had separate cursor procedures for both strings and
span, but even if this were the case, I still don't think spans are a good
idea.

Whenever you need to do some string processing, you're going to need to
start with actual string objects. Whether you use indices, cursors, or
spans, the results of this processing is likely going to need to be
converted back to string objects eventually so that it can be displayed to
the terminal, or written to a file/network port, or perhaps stored back in
some sort of runtime data structure. For this later case, your data
structures could simply store span objects that wrap newly created strings,
but this is unneeded baggage when say you only need to mutate strings on
occasion. And if I really want to store a cursor or pair of cursors into a
string, I could just just store the string and the cursors into my data
structure. I don't think it's worth the cost of adding a new span type and
a host of span-related functions to the language to support this edge case.

Now, let's consider how spans might work when doing some string processing.
Consider a pathname-split procedure that takes a string containing a file
path and splits it into its directory path, file name, and extension
components. Let's say the file path I'm splitting is
"~/src/utilities/frobnicate.scm", and so the expected components would be
"~/src/utilities/", "frobnicate" and ".scm". If I'm implementing this
procedure using spans, I would end up with three span objects, but the end
cursor from one span would need to be be duplicated as the start cursor for
the subsequent span. This requires more storage space for the duplicate
string (it's the same string for all spans) and cursor cons cells, than if
I were to just have variables for the cursors that I needed. It's just not
as tuned memory-wise as it could be. And in order to duplicate these
cursors across spans, I may have to call string-cursor-end on the previous
span, which then requires the previously mentioned type check and dispatch.
I can avoid all this if I use only string cursors, and the string cursor
procedures always expect string objects and so don't need to do any type
checking. Now, you might consider there might be some performance gains by
using spans so as not to create new substrings of the file path components.
But what am I going to do with them? I'm probably going to append the
filename and extension to a new directory path, or I'm going to take the
take directory path by itself, and I'm going to fire it off to a file
system procedure, or I'm going to save the string out to a file. So I need
to convert it to a string anyway. It would be better if the string library
just offered a string-join/cursor procedure that took six arguments, one
for each string, and two for each pair of cursors, for the substrings I
want to join for this use case.

If I'm working with very large strings, and want to have views into this
string to avoid creating copies, again I'm probably just going to use
string cursors directly to avoid the overhead of working with spans, and so
I would want a string library highly tuned for cursors. Alternatively, if
I'm mutating subsections of a very large string, I'd consider using a
different data structure altogether, such as a rope.

Now, I don't think the concept of a span is a bad idea, I just don't think
it's a good fit for Scheme. For example, the string_view template class was
recently accepted for standardization into a post-C++14 library TR, and
it's pretty much the same idea as the character span. The reason it works
well in C++ is largely because of C++'s compile-time polymorphism for
templates, the so the generic template functions you write will work with
any type that implements the same interface or concept, with essentially
zero runtime overhead. Furthermore, the compilers are very mature and can
inline everything, optimize out loop invariants, merge duplicated pointer
variables into the strings within objects allocated on the stack, and so on.

As was mentioned in the previous thread, Scheme is more monomorphic. And so
adding a new span type that acts like a string and trying to make it a
first-class citizen that interoperates well with the rest of the libraries
just puts more burden on language implementors.

I think a better approach would be along one of two paths:

1) Bring in most of SRFI-13, update it for consistency with the rest of the
R7RS language, and then create variants of the procedures with a suffix of
"/cursor" appended to their identifier names that work with string cursors
in place of indices. For example, you would have string-take/cursor and
string-drop/cursor procedures, but no string-take-right/cursor and
string-drop-right/cursor procedures, as the later two only make sense with
character counts from the end of a string.

Overall, this path would also help ease porting old code.

2) If the above is too large of an API surface, or if you really want to
force people to uses cursors over indices, bring in SRFI-13 but change all
of the procedures to work with cursors instead of indices. This is more or
less what Chibi Scheme's string library does or is in the process of
becoming from what I've seen.

Some final thoughts on string cursors: while I think they're significantly
better over the old paradigm, I still think there is room for improvement.
Cursors eliminate the O(n) overhead of using indices to be sure, but there
can still be a significant amount of constant time overhead due to the
decoding of the UTF-8 or UTF-16 code units into code points. This entirely
unnecessary when doing string or substring comparisons--you can just
compare the code unit sequences directly. I'm not decided on if it would
make sense to expose a couple of lower-level procedures to give users
access to the underlying code units pointed to by string cursors, or if
that's too much effort for the payoff. After all, if you really need the
performance for the string library, maybe it's better for implementors to
just write such procedures as builtin native code functions through the FFI?

Anyway, sorry for the long rant, but I hope it offers a constructive
critique.