[Scheme-reports] Requesting early review of a character sequence library

Discussion:

John Cowan

2014-12-18 01:49:02 UTC

I'd appreciate getting feedback on the pre-SRFI version of a library I
have designed for immutable shared character sequences called "spans".
It's available at <http://trac.sacrideo.us/wg/wiki/StringSlicesCowan>
(the term "string slice" is no longer used). Here are the current issues:

1) Is "character span" a satisfactory name?

2) Allow negative indices in make-span and subspan? Convenient, but
irregular.

3) Titlecase doesn't really fit; keep it?

4) Keep string trees?

5) Keep compatibility routines, possibly in a different package?

6) I have made the argument order of string-tabulate compatible with SRFI
1 list-tabulate rather than SRFI 13's string-tabulate; the discrepancy
was accidental. Revert to SRFI 13 argument order?

7) Currently string-mismatch returns a cursor into its second
argument. Should it return a cursor into the first argument, or into
both arguments? (In Chibi it makes no difference.)

8) Keep in-char-set?? It creates a dependency on SRFI 14, but is essential
for SRFI 13 emulation.

9) Bring back -ci variants of the prefix and suffix operations and
span-compare from SRFI 13?

--
John Cowan http://www.ccil.org/~cowan ***@ccil.org
It was dreary and wearisome. Cold clammy winter still held sway in this
forsaken country. The only green was the scum of livid weed on the dark
greasy surfaces of the sullen waters. Dead grasses and rotting reeds loomed
up in the mists like ragged shadows of long-forgotten summers.
--LOTR, "The Passage of the Marshes"

Peter Bex

2014-12-18 08:50:10 UTC

Permalink

Post by John Cowan
I'd appreciate getting feedback on the pre-SRFI version of a library I
have designed for immutable shared character sequences called "spans".

This is great! A more efficient redesign of SRFI-13 was long overdue.
Here's my initial reply after a quick scan of the library.

Post by John Cowan
It's available at <http://trac.sacrideo.us/wg/wiki/StringSlicesCowan>

I'm a little worried about the duplication of -span and -string operations.
This seems confusing and causes the API surface to be pretty large.
I'd prefer to let all the "span" operations accept either a span or a
string, but that would probably slow things down considerably, again,
which defeats the point of this redesign.

Post by John Cowan
1) Is "character span" a satisfactory name?

Seems fine to me.

Post by John Cowan
2) Allow negative indices in make-span and subspan? Convenient, but
irregular.

Ugly and unSchemely.

Post by John Cowan
3) Titlecase doesn't really fit; keep it?

Too complicated and special-purpose and like you said, it doesn't fit;
drop it.

Post by John Cowan
4) Keep string trees?

I'm not sure the name is good, but the operation is potentially useful.
On the other hand, it doesn't fit so well with the design of the library;
conceptually, operating on string internals is a very different operation
than writing various objects (among which strings) to a port. It's a little
ill-defined too, especially the "otherwise, write-string-tree does nothing"
part. What about user-defined records? What about records with an associated
record printer? And so on. It's probably best to drop it.

Post by John Cowan
5) Keep compatibility routines, possibly in a different package?

Which compatibility routines are those? The remaining SRFI-13 ones?
That doesn't make much sense to me, unless you propose re-adding all the
variable argument versions as well. That's pretty ugly and defeats the
point of this library, IMHO.

Post by John Cowan
6) I have made the argument order of string-tabulate compatible with SRFI
1 list-tabulate rather than SRFI 13's string-tabulate; the discrepancy
was accidental. Revert to SRFI 13 argument order?

No; this library isn't (fully) compatible anyway. Let's fix any mistakes.
I'd rather see a brand new well-designed library that's completely
incompatible with SRFI-13 than a half-baked 70% compatible library.

Post by John Cowan
7) Currently string-mismatch returns a cursor into its second
argument. Should it return a cursor into the first argument, or into
both arguments? (In Chibi it makes no difference.)

Unsure. Maybe someone else has an idea about this.

Post by John Cowan
8) Keep in-char-set?? It creates a dependency on SRFI 14, but is essential
for SRFI 13 emulation.

I'd *really* prefer to get rid of the annoying interdependency between
SRFI-13 and SRFI-14 in this redesign. Besides, I don't see any point
in having it: character sets are not mentioned anywhere else.

If someone wants to implement just this library without intending to
implement SRFI-13, this procedure makes no sense. On the other hand,
if someone wants to implement SRFI-13, they can do it in terms of the
primitives in their Scheme implementation. This *would* cause some
unportability, but we still have cond-expand, and most current Schemes
ship SRFI-13 already anyway.

It may make more sense to write another "bridge" library between this
and (the new version of) SRFI-14. Passing a predicate lambda to search
for will likely be less efficient than having specialised searching
functions for characters, character sets etc.

On that note: I'd like to see */char variants of the searching functions.
And perhaps */char-list, which can be implemented efficiently by inlining
eqv? inside span-find/char-list for example. This would return a cursor
to the index of the first occurrance of any character in the list.

Post by John Cowan
9) Bring back -ci variants of the prefix and suffix operations and
span-compare from SRFI 13?

These would probably be useful.

Cheers,
Peter

--
http://www.more-magic.net

John Cowan

2014-12-18 13:13:39 UTC

Permalink

Peter Bex scripsit:

Thanks for the enthusiastic and early feedback!

Post by Peter Bex
I'd prefer to let all the "span" operations accept either a span or a
string, but that would probably slow things down considerably, again,
which defeats the point of this redesign.

Indeed. As I keep telling people, Scheme is relentlessly monomorphic
(even though some implementations, like Gauche and Kawa, have their
own private polymorphism).

Post by Peter Bex
conceptually, operating on string internals is a very different operation
than writing various objects (among which strings) to a port. It's a little
ill-defined too, especially the "otherwise, write-string-tree does nothing"
part.

I'll make that "takes an implementation-specific action".

Post by Peter Bex

Post by John Cowan
5) Keep compatibility routines, possibly in a different package?

Which compatibility routines are those? The remaining SRFI-13 ones?

No, the ones in the section marked "Compatibility" (bad name, I'll
change it). Basically the case-folding and comparison routines.
The sample implementations of these will have to reify their arguments
as strings and then go through the local string-upcase, string=?,
etc, to ensure compatibility, since R7RS-small allows a fair amount of
implementation-dependence here. On the other hand, span programming
without them wouldn't really fly.

I think we are stuck with them, and they can be made more efficient by
implementers who can allow access to the internal conversion/comparison
tables.

Post by Peter Bex

No; this library isn't (fully) compatible anyway. Let's fix any mistakes.

Agreed; issue closed.

Post by Peter Bex
I'd rather see a brand new well-designed library that's completely
incompatible with SRFI-13 than a half-baked 70% compatible library.

Well, I think we need some measure of compatibility, but I agree that we
don't need this much. Arguably it was SRFI 1 that made the mistake (procs
normally come first in argument order), but that's long past fixing.

Post by Peter Bex

Post by John Cowan
8) Keep in-char-set?? It creates a dependency on SRFI 14, but is essential
for SRFI 13 emulation.

That is the point of it; instead of making every procedure that accepts
a pred also accept a charset, you can do things like

(span-every (in-char-set? char-set:ascii) foo)

to verify that the span 'foo' is made entirely of ASCII characters,
without having to write an 'ascii?' predicate.

Arguably this should have been in SRFI 14, but I'm not currently interested
in replacing that SRFI.

Post by Peter Bex
If someone wants to implement just this library without intending to
implement SRFI-13, this procedure makes no sense.

"Emulation" was a poorly chosen term. I meant that if you want to do
things like the line above, you need this procedure. On the other
hand, it's a two-liner.

Post by Peter Bex
It may make more sense to write another "bridge" library between this
and (the new version of) SRFI-14. Passing a predicate lambda to search
for will likely be less efficient than having specialised searching
functions for characters, character sets etc.

I'll think about such a library.

Post by Peter Bex
On that note: I'd like to see */char variants of the searching functions.
And perhaps */char-list, which can be implemented efficiently by inlining
eqv? inside span-find/char-list for example. This would return a cursor
to the index of the first occurrance of any character in the list.

I'll think about those too.

Post by Peter Bex

Post by John Cowan
9) Bring back -ci variants of the prefix and suffix operations and
span-compare from SRFI 13?

These would probably be useful.

But expensive in terms of calling span->string, then string->fold, etc. etc.

--
John Cowan http://www.ccil.org/~cowan ***@ccil.org
How they ever reached any conclusion at all is starkly unknowable
to the human mind. --"Backstage Lensman", Randall Garrett

Jonathan A Rees

2014-12-18 14:28:17 UTC

Permalink

Can you say what your goals are?

-- apologies for brevity / using handheld gizmo --

Post by John Cowan
I'd appreciate getting feedback on the pre-SRFI version of a library I
have designed for immutable shared character sequences called "spans".
It's available at <http://trac.sacrideo.us/wg/wiki/StringSlicesCowan>
1) Is "character span" a satisfactory name?
2) Allow negative indices in make-span and subspan? Convenient, but
irregular.
3) Titlecase doesn't really fit; keep it?
4) Keep string trees?
5) Keep compatibility routines, possibly in a different package?
6) I have made the argument order of string-tabulate compatible with SRFI
1 list-tabulate rather than SRFI 13's string-tabulate; the discrepancy
was accidental. Revert to SRFI 13 argument order?
7) Currently string-mismatch returns a cursor into its second
argument. Should it return a cursor into the first argument, or into
both arguments? (In Chibi it makes no difference.)
8) Keep in-char-set?? It creates a dependency on SRFI 14, but is essential
for SRFI 13 emulation.
9) Bring back -ci variants of the prefix and suffix operations and
span-compare from SRFI 13?
--
It was dreary and wearisome. Cold clammy winter still held sway in this
forsaken country. The only green was the scum of livid weed on the dark
greasy surfaces of the sullen waters. Dead grasses and rotting reeds loomed
up in the mists like ragged shadows of long-forgotten summers.
--LOTR, "The Passage of the Marshes"
_______________________________________________
Scheme-reports mailing list
http://lists.scheme-reports.org/cgi-bin/mailman/listinfo/scheme-reports

John Cowan

2014-12-19 00:27:45 UTC

Permalink

Post by Jonathan A Rees
Can you say what your goals are?

Here's the content of the Rationale section:

When SRFI 13 was defined in 1999, it was intended to provide
efficient string operations on both whole strings and substrings. At
that time, only Guile and T provided true shared copy-on-write
substrings, and SRFI 13 could not reasonably require them of a Scheme
implementation. Consequently, almost all the SRFI 13 procedures accept
optional start and end arguments for each of the string arguments,
indexing the beginning and the end of the substring(s) to be operated on.

Unfortunately, variable-arity procedures are often slow and may not
interact well with type checking in Schemes that provide it. In addition,
it is now fairly common to store strings internally as UTF-8 or UTF-16
code unit sequences, which means that indexing operations are often O(n)
rather than O(1), and string mutation can be extremely expensive if the
storage used for the string needs to be expanded and the implementation
does not use an indirect pointer to it (as in Chicken).

As for shared substrings, they are no more common in 2015 than they were
in 1999. Fortunately, however, since then it has become normal for Schemes
to provide user-defined records, and they are required by both R6RS and
R7RS. This makes it possible to portably provide a representation for
a segment of a string, provided the string is never mutated. The most
portable such record consists of a string and two indexes, but other
more efficient representations may be used instead.

This proposal, therefore, is intended to help move the practice of Scheme
programming away from mutable strings, string indexes, and SRFI 13,
while maintaining as much backward compatibility as is consistent with
these goals. It does not require any particular run-time efficiencies
from its procedures. The string procedures, as well as string-transform,
make it possible to migrate a code base gradually.

It is also possible to implement character spans as ropes (trees of
strings), which makes concatenation more efficient at the expense of
more complex cursor objects and/or slower conversion to strings. For
this reason, as well as for security and efficiency reasons, there is
no operation to retrieve an underlying string from a character span,
as there may be more than one such string or none at all. The operations
provided here (with the exception of those in the Compatibility section)
are entirely independent of the character repertoire supported by the
implementation. In particular, this means that the case-insensitive
procedures of SRFI 13 are excluded. There is also no provision for R6RS
normalization procedures or for an string->integer procedure that was
proposed for SRFI 13 but not included. These may appear in future SRFIs.

--
John Cowan http://www.ccil.org/~cowan ***@ccil.org
Go, and never darken my towels again!
--Rufus T. Firefly

Per Bothner

2014-12-19 01:13:52 UTC

Permalink

I didn't find any mention of spans over mutable strings.
One can make it undefined, or an error; if the latter, is it one that needs to be caught?
(For example, each string and span could have a timestamp: the string timestamp
is increment when it is modified; a span timestamp copies that of the underlying
string when the span is created; a span is invalid if its timestamp is less than
the underlying string. Or just make it caveat programmer, which would be my preference.)

Is this proposed as part of a future Scheme specification? I really hope not.
I find the duplicate set of procedures a very ugly API, and would be strongly against
any language standard that has (for example) both string-any and span-any,
or even both span-ref and string-ref. What does this API give me that
substring/shared doesn't? Of course I know the answer: The span API can
be implemented portably without modifying the underlying string representation.
But so can substring/shared since a valid implementation is string-copy.

I see the use-case for this API: As a portable library with guaranteed
sharing, not requiring changing the string implementation. But I really don't
want to encourage people to program in this API. As a language designer and
implementor, I'd rather change the underlying string representation in my
implementation, and tell people to use substring/shared. That makes for a
much simpler and elegant language.

--
--Per Bothner
***@bothner.com http://per.bothner.com/

Alex Shinn

2014-12-19 01:45:08 UTC

Permalink

Post by Per Bothner
Is this proposed as part of a future Scheme specification? I really hope not.
I find the duplicate set of procedures a very ugly API, and would be strongly against
any language standard that has (for example) both string-any and span-any
[...]

+1

From the discussions I had seen I thought the general direction
for a new string API would be based on string-cursors, which have
been implemented and used for some time in multiple Schemes.
String cursors have the advantage that they are a different way to
access the same underlying datatype, so you don't need to
duplicate every API that works with textual data.

For special optimization cases something like this library may be
useful, but I'd like to see it implemented and used for a while before
we decide whether it's good for general purpose code, and further
whether to standardize it.

--
Alex

John Cowan

2014-12-19 04:54:01 UTC

Permalink

Post by Alex Shinn
From the discussions I had seen I thought the general direction
for a new string API would be based on string-cursors, which have
been implemented and used for some time in multiple Schemes.

I do provide the whole Chibi string-cursor API, I just provide a
higher level built on it that emulates strings, but with the
shareability/immutability advantages.

Post by Alex Shinn
For special optimization cases something like this library may be
useful, but I'd like to see it implemented and used for a while before
we decide whether it's good for general purpose code, and further
whether to standardize it.

Agreed, but SRFIs are not standards.

--
John Cowan http://www.ccil.org/~cowan ***@ccil.org
My corporate data's a mess!
It's all semi-structured, no less.
But I'll be carefree / Using XSLT
On an XML DBMS.

Alex Shinn

2014-12-19 08:17:03 UTC

Permalink

Post by John Cowan

I do provide the whole Chibi string-cursor API, I just provide a
higher level built on it that emulates strings, but with the
shareability/immutability advantages.

The problem is you force a duplication of APIs. With a string API
you only have one type, and libraries may or may not provide
optional start/end arguments (probably yes when inputs may be
large as in SRFI-115, probably no for things like dealing with file
names). If you introduce string cursors you can use the same
APIs and either ignore the start/end arguments or convert if
the source and target are using different conventions. However,
if you introduce a character-span type you either need two versions
of every API or you need to convert from spans to strings, which
is potentially expensive.

Post by John Cowan
For special optimization cases something like this library may be

Post by Alex Shinn
useful, but I'd like to see it implemented and used for a while before
we decide whether it's good for general purpose code, and further
whether to standardize it.

Agreed, but SRFIs are not standards.

I was +1'ing Per's sentiment that this not be part of a standard.
From the lists you're sending it to it would seem this is intended for R7RS
large.

--
Alex

Peter Bex

2014-12-19 08:27:40 UTC

Permalink

Post by Alex Shinn

Post by Alex Shinn
For special optimization cases something like this library may be

Post by Alex Shinn
useful, but I'd like to see it implemented and used for a while before
we decide whether it's good for general purpose code, and further
whether to standardize it.

Agreed, but SRFIs are not standards.

I was +1'ing Per's sentiment that this not be part of a standard.

Post by Alex Shinn
From the lists you're sending it to it would seem this is intended for R7RS

large.

You make a strong argument, and I would like to express my agreement.
The duplication is too ugly to be standardized. I still believe the
optional start/end argument stuff is a problem to be solved, but not
in this way.

Cheers,
Peter

--
http://www.more-magic.net

John Cowan

2014-12-24 02:01:14 UTC

Permalink

Post by Alex Shinn
The problem is you force a duplication of APIs.

I hear you loud and clear, as Olin used to say in these
situations. I have abandoned the string API in favor of
just the span API, and moved the current description to
<http://trac.sacrideo.us/wg/wiki/CharacterSpansCowan>. However, I will
retain the string API in the sample implementation, so that those who
want it will have it.

Post by Alex Shinn
If you introduce string cursors you can use the same APIs and either
ignore the start/end arguments or convert if the source and target are
using different conventions.

Yes, though that conversion isn't free: you have to do an unbounded
amount of work to convert cursors to indexes or vice versa, and if
your target API does not accept either cursors or indexes, you end up
creating new strings anyway.

Post by Alex Shinn
However, if you introduce a character-span type you either need two
versions of every API or you need to convert from spans to strings,
which is potentially expensive.

No more so than in the cases above, at least when spans are implemented
as a string and two cursors. I have modified the language so as not to
suggest that other implementations of spans are equally appropriate.

--
weirdo: When is R7RS coming out?
Riastradh: As soon as the top is a beautiful golden brown and if you
stick a toothpick in it, the toothpick comes out dry.

John Cowan

2014-12-19 04:51:36 UTC

Permalink

I didn't find any mention of spans over mutable strings. One can make
it undefined, or an error; if the latter, is it one that needs to be
caught?

It's an error, which does not prohibit implementers from allowing it as
an extension.

Is this proposed as part of a future Scheme specification? I really hope not.

That decision is premature. Here the question is: if we are to have
such a thing, what should it look like?

I find the duplicate set of procedures a very ugly API, and would be
strongly against any language standard that has (for example) both
string-any and span-any, or even both span-ref and string-ref.

I don't usually talk about libraries in the SRFIs I write, but maybe it
makes sense to put the two sets into different libraries.

What does this API give me that substring/shared doesn't? Of
course I know the answer: The span API can be implemented portably
without modifying the underlying string representation. But so can
substring/shared since a valid implementation is string-copy.

Valid but inefficient. The idea here is to have something both portable
and efficient.

But I really don't want to encourage people to program in this
API. As a language designer and implementor, I'd rather change the
underlying string representation in my implementation, and tell people
to use substring/shared. That makes for a much simpler and elegant
language.

I agree, but most Schemers aren't in that position. And if true shared
strings are such a great idea, why are they confined to Guile alone?
(Guile does copy-on-write transparent sharing as well as explicit
sharing.) Because they are pain to implement? I suspect so.

--
John Cowan http://www.ccil.org/~cowan ***@ccil.org
If I have seen farther than others, it is because I was standing on
the shoulders of giants. --Isaac Newton