Discussion:
[Scheme-reports] vector-insert
Sascha Ziemann
2014-08-19 15:30:56 UTC
Permalink
I am wondering if R7RS lacks a vector-insert.

If you want to write a vector-insert in R7RS you have to do:
- calculate the length of the new vector
- create a new vector with make-vector
- copy the old content with vector-copy to the beginning and the end
- copy the new content into the middle with vector-copy

This includes twice as much assignments as necessary, because first
make-vector initializes the new vector with unspecified values and after
that each element of the new vector is assigned by the vector-copy calls.

I think the only way to avoid this is to have a dedicated vector-insert.

Sascha
Taylan Ulrich Bayirli/Kammer
2014-08-19 16:15:52 UTC
Permalink
Post by Sascha Ziemann
I am wondering if R7RS lacks a vector-insert.
This comes down to resizable vectors in general, no?

How useful of a data structure are they? If resizing/insertion is a
common operation, there might be more suitable data structures?

My only experience is with Apple's peculiar implementation of the
NSMutableArray Objective-C class, which falls back to internally using
something more akin to a hash table after all:
http://ridiculousfish.com/blog/posts/array.html

Also if I remember correctly, resizable vectors incur a general overhead
on access, in addition to implementation complexity. So they should
likely be a separate library, no?

Taylan
John Cowan
2014-08-19 16:56:01 UTC
Permalink
Post by Taylan Ulrich Bayirli/Kammer
Post by Sascha Ziemann
I am wondering if R7RS lacks a vector-insert.
This comes down to resizable vectors in general, no?
I interpret the remarks about assignment as showing a desire to avoid
initializing a newly created vector only to clobber all of its locations
immediately thereafter. This is still only O(n), but the constant factor
plus the page churn, if the vector is large, may be undesirable.

A possible implementation-specific approach would be to treat
`make-vector` without a fill argument as turning on a bit which says
"Don't allow `vector-ref` (vel sim.) on this vector, and don't follow
its elements for GC purposes." Then the obvious vector-insert algorithm
would Just Work, provided there is some way of clearing the bit when
the programmer knows that every element has been set.
Post by Taylan Ulrich Bayirli/Kammer
How useful of a data structure are they? If resizing/insertion is a
common operation, there might be more suitable data structures?
See BuffersCowan for mutable resizable (within limits) vectors.
--
John Cowan http://www.ccil.org/~cowan ***@ccil.org
What is the sound of Perl? Is it not the sound of a [Ww]all that people
have stopped banging their head against? --Larry
Alan Watson
2014-08-19 17:10:15 UTC
Permalink
Post by John Cowan
A possible implementation-specific approach would be to treat
`make-vector` without a fill argument as turning on a bit which says
"Don't allow `vector-ref` (vel sim.) on this vector, and don't follow
its elements for GC purposes." Then the obvious vector-insert algorithm
would Just Work, provided there is some way of clearing the bit when
the programmer knows that every element has been set.
If I understand correctly, when the first element is assigned to the new vector, surely the implementation would have to clear the bit and initialise the other elements?

I don’t doubt there are ways to optimise this, but I don’t think a naive implementation of this helps.

Regards,

Alan
John Cowan
2014-08-19 17:22:20 UTC
Permalink
Post by Alan Watson
If I understand correctly, when the first element is assigned to the
new vector, surely the implementation would have to clear the bit and
initialise the other elements?
That would be the safe thing, yes, but R7RS does not explicitly require either
safety or unsafety. I would not want to standardize anything that can't be
made safe, but an unsafe implementation that can only be used in limited
ways (viz. you are not allowed to gc the source vector until the destination
vector is fully initialized) isn't necessarily unusable.
--
John Cowan http://www.ccil.org/~cowan ***@ccil.org
At the end of the Metatarsal Age, the dinosaurs abruptly vanished.
The theory that a single catastrophic event may have been responsible
has been strengthened by the recent discovery of a worldwide layer of
whipped cream marking the Creosote-Tutelary boundary. --Science Made Stupid
Pierpaolo Bernardi
2014-08-19 18:26:11 UTC
Permalink
Post by John Cowan
Post by Alan Watson
If I understand correctly, when the first element is assigned to the
new vector, surely the implementation would have to clear the bit and
initialise the other elements?
That would be the safe thing, yes, but R7RS does not explicitly require either
safety or unsafety. I would not want to standardize anything that can't be
made safe, but an unsafe implementation that can only be used in limited
ways (viz. you are not allowed to gc the source vector until the destination
vector is fully initialized) isn't necessarily unusable.
I don't think this is an operation useful enough to spend brain cycles
on possible ways to optimize it. But, if we really need this, here's
my proposal:

(make-vector-from-pieces dim default init...)

where each of the inits specifies a range to be filled in the new
vector, and an existing vector with a starting point, to be copied in
that range. Ranges not included in any /init/ will be initialized
with /default/, as in make-vector.

This function can be implemented trivially in any Scheme, and an
implementor can supply a version for his implementation which avoids
double assignments in the newly created vector, and is also safe.

P.
John Cowan
2014-08-19 18:56:58 UTC
Permalink
Post by Pierpaolo Bernardi
(make-vector-from-pieces dim default init...)
where each of the inits specifies a range to be filled in the new
vector, and an existing vector with a starting point, to be copied in
that range. Ranges not included in any /init/ will be initialized
with /default/, as in make-vector.
That's an interesting variant of vector-append. I've added it to
VectorsCowan, my proposed vector library (a superset of SRFI 43).
Post by Pierpaolo Bernardi
This function can be implemented trivially in any Scheme, and an
implementor can supply a version for his implementation which avoids
double assignments in the newly created vector, and is also safe.
Indeed.
--
John Cowan http://www.ccil.org/~cowan ***@ccil.org
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash,
The day and hour soon are coming / When all the IT folks say "Gosh!"
It isn't from a clever lawsuit / That Windowsland will finally fall,
But thousands writing open source code / Like mice who nibble through a wall.
--The Linux-nationale by Greg Baker
Pierpaolo Bernardi
2014-08-20 00:33:29 UTC
Permalink
Post by John Cowan
Post by Pierpaolo Bernardi
(make-vector-from-pieces dim default init...)
where each of the inits specifies a range to be filled in the new
vector, and an existing vector with a starting point, to be copied in
that range. Ranges not included in any /init/ will be initialized
with /default/, as in make-vector.
That's an interesting variant of vector-append. I've added it to
VectorsCowan, my proposed vector library (a superset of SRFI 43).
Post by Pierpaolo Bernardi
This function can be implemented trivially in any Scheme, and an
implementor can supply a version for his implementation which avoids
double assignments in the newly created vector, and is also safe.
Indeed.
And here's, attached below, a sample implementation written in racket,
and pseudocode for a possible implementation specific version.
Post by John Cowan
(define vect #(a b c d e f g h i j))
(make-vector-from-pieces 16 'default
(range 0 5 vect 0)
(range 10 15 vect 5))

#(a b c d e default default default default default f g h i j default)

========
;; Tested only lightly. Do not trust this code.

#lang racket

(struct range
(from to from-vector from-vector-start))

(define (make-vector-from-pieces dim default . inits)
(define sorted-inits (sort inits range-<))
(let ((vector (make-vector dim)))
(let loop ((cur 0) (inits sorted-inits))
(cond ((null? inits)
(vector-fill vector cur dim default))
(else
(define init (first inits))
(vector-fill vector cur (range-from init) default)
(vector-block-copy vector init)
(loop (range-to init) (rest inits)))))
vector))

(define (range-< range1 range2)
(< (range-from range1)
(range-from range2)))

(define (vector-block-copy vector range)
(define to (range-to range))
(define from-vector (range-from-vector range))
(let loop ((cur (range-from range))
(i (range-from-vector-start range)))
(when (< cur to)
(vector-set! vector cur (vector-ref from-vector i))
(loop (add1 cur) (add1 i)))))

(define (vector-fill vector from to value)
(let loop ((from from))
(when (< from to)
(vector-set! vector from value)
(loop (add1 from)))))

#|
;; implementation specific version
(define (make-vector-from-pieces dim default . inits)
(define sorted-inits (sort inits range-<))
(with-gc-off
(lambda ()
(let ((vector (make-uninitialized-vector dim)))
(let loop ((cur 0) (inits sorted-inits))
(cond ((null? inits)
(vector-fill vector cur dim default))
(else
(define init (first inits))
(vector-fill vector cur (range-from init) default)
(vector-block-copy vector init)
(loop (range-to init) (rest inits)))))
vector))))
|#
Per Bothner
2014-08-20 21:31:42 UTC
Permalink
Post by Taylan Ulrich Bayirli/Kammer
Also if I remember correctly, resizable vectors incur a general overhead
on access, in addition to implementation complexity. So they should
likely be a separate library, no?
Simple re-sizable vectors require an indirection that might otherwise
not be needed. This overhead is much less than using linked lists.

Many programs needs to "build" a sequence, typically appending at the end
each element as it is produced. Using an array that gets reallocated (doubled)
when full is usually the most efficient representation. Having a standard re-sizable
(growable) vector type would simplify the programming.

I added string-append! to Kawa. It seems to be a good idea to also add vector-append!.
--
--Per Bothner
***@bothner.com http://per.bothner.com/
Vassil Nikolov
2014-08-21 00:19:48 UTC
Permalink
Post by Per Bothner
...
Many programs needs to "build" a
sequence, typically appending at the end
each element as it is produced. Using an
array that gets reallocated (doubled)
when full is usually the most efficient
representation. Having a standard
re-sizable (growable) vector type would
simplify the programming.
I added string-append! to Kawa. It seems
to be a good idea to also add
vector-append!.
As a corroborating side note (for what
it's worth), the designers of Common Lisp
found it fit to include VECTOR-PUSH and
VECTOR-PUSH-EXTEND, which also covers
strings. Unlike NCONC for lists,
however, Common Lisp does not offer a
primitive to append a whole (sub)vector
destructively. This can be done with a
call to REPLACE [*].

Then Common Lisp has nothing like
`vector-insert'. For the non-destructive
variant, one can compose CONCATENATE and
SUBSEQ. The destructive variant can be
done with two calls to REPLACE [*].

_________
[*] possibly after calling ADJUST-ARRAY,
and incrementing a fill pointer as
needed

---Vassil.
--
Vassil Nikolov | Васил Николов | <***@pobox.com>

"Be careful how you fix what you don't understand." (Brooks 2010, 185)
Taylan Ulrich Bayirli/Kammer
2014-08-21 09:03:22 UTC
Permalink
Post by Per Bothner
Post by Taylan Ulrich Bayirli/Kammer
Also if I remember correctly, resizable vectors incur a general
overhead on access, in addition to implementation complexity. So
they should likely be a separate library, no?
Simple re-sizable vectors require an indirection that might otherwise
not be needed. This overhead is much less than using linked lists.
For accessing elements, yes.
Post by Per Bothner
Many programs needs to "build" a sequence, typically appending at the
end each element as it is produced. Using an array that gets
reallocated (doubled) when full is usually the most efficient
representation. Having a standard re-sizable (growable) vector type
would simplify the programming.
I think the typical way of doing this is to cons up a reverse list,
reverse! it, and use list->vector. If the sequence is built at once
then this should be fine.

If the sequence needs to be accessed and expanded together efficiently
then that doesn't work and a different data structure is needed. I
don't know how common of a situation that is, and how undesirable the
intrinsic overhead is for situations that don't need it.

One of the nice things about Scheme in my opinion is that it doesn't
overload much functionality into one API and instead offers specialized
APIs which allow for efficient code; that is in contrast to other "very
high level" or "scripting" languages which trade off much more
efficiency for programmer convenience. I suspect that if we force the
vector API to use resizable vectors, some users will end up wishing for
a static-size vector API (like fixnums). I don't know which use-case
should be marginalized. Perhaps the best thing to do is to go by what
most implementations currently do.

We could also leave it up to an implementation for now whether the
normal vector API should work on vectors created by a hypothetical
resizable vector API. Forcing their merge can be left to the future,
whereas breaking them apart later would be backwards incompatible.

Tell me if I'm bikeshedding. :-)

Taylan
Per Bothner
2014-08-21 16:08:54 UTC
Permalink
Post by Taylan Ulrich Bayirli/Kammer
Post by Per Bothner
Simple re-sizable vectors require an indirection that might otherwise
not be needed. This overhead is much less than using linked lists.
For accessing elements, yes.
Constructing a vector by appending to the end is also likely to be more efficient
than constructing a list: Less memory allocation/gc; better memory locality; about
the same memory needs.
Post by Taylan Ulrich Bayirli/Kammer
Post by Per Bothner
Many programs needs to "build" a sequence, typically appending at the
end each element as it is produced. Using an array that gets
reallocated (doubled) when full is usually the most efficient
representation. Having a standard re-sizable (growable) vector type
would simplify the programming.
I think the typical way of doing this is to cons up a reverse list,
reverse! it, and use list->vector. If the sequence is built at once
then this should be fine.
Using a growable vector will be more efficient, and less programming.
--
--Per Bothner
***@bothner.com http://per.bothner.com/
Kevin Wortman
2014-08-21 16:56:22 UTC
Permalink
IMO there are enough use cases where the growing feature is unneeded and it's overhead is unwanted, that the best compromise is to have both a raw fixed-size array type (i.e. an R7RS-small vector), _and_ a higher level self-resizing array. It is not hard to write a self-resizing array on top of a fixed-size array. That's what e.g. C++ and Java do.

Best regards,
Kevin Wortman
John Cowan
2014-08-21 17:12:46 UTC
Permalink
Post by Kevin Wortman
IMO there are enough use cases where the growing feature is unneeded and
it's overhead is unwanted, that the best compromise is to have both a
raw fixed-size array type (i.e. an R7RS-small vector), _and_ a higher
level self-resizing array. It is not hard to write a self-resizing
array on top of a fixed-size array. That's what e.g. C++ and Java do.
+1

And if you want it to grow in the middle as well as the end, see BuffersCowan,
an API for (traditional) Emacs-style gap buffers.
--
John Cowan http://www.ccil.org/~cowan ***@ccil.org
"Hacking is the true football." --F.W. Campbell (1863) in response to a
successful attempt to ban shin-kicking from soccer. Today, it's biting.
Sascha Ziemann
2014-08-22 07:47:10 UTC
Permalink
Post by Kevin Wortman
IMO there are enough use cases where the growing feature is unneeded and
it's overhead is unwanted, that the best compromise is to have both a raw
fixed-size array type (i.e. an R7RS-small vector), _and_ a higher level
self-resizing array. It is not hard to write a self-resizing array on top
of a fixed-size array. That's what e.g. C++ and Java do.
Maybe we could come back to the original question how to avoid unnecessary
initializations during the the creation of a new vector by parts of other
vectors. The standard requires to use make-vector and vector-copy which
initializes the new vector twice.

Is here a broad consent that it is not worth to care about, because one
should disregard vectors in favour of gap buffers?

Sascha
John Cowan
2014-08-22 13:07:14 UTC
Permalink
Post by Sascha Ziemann
Maybe we could come back to the original question how to avoid unnecessary
initializations during the the creation of a new vector by parts of other
vectors. The standard requires to use make-vector and vector-copy which
initializes the new vector twice.
I've added a vector-append-subvector procedure to VectorsCowan, my proposal
for a general vector library for R7RS-large. A Scheme is free to implement
this portably or using unsafe primitives.
Post by Sascha Ziemann
Is here a broad consent that it is not worth to care about, because one
should disregard vectors in favour of gap buffers?
Gap buffers, or indirect vectors that can be swapped out, or any number
of other things.
--
John Cowan http://www.ccil.org/~cowan ***@ccil.org
I must confess that I have very little notion of what [s. 4 of the British
Trade Marks Act, 1938] is intended to convey, and particularly the sentence
of 253 words, as I make them, which constitutes sub-section 1. I doubt if
the entire statute book could be successfully searched for a sentence of
equal length which is of more fuliginous obscurity. --MacKinnon LJ, 1940
Vassil Nikolov
2014-08-22 18:01:44 UTC
Permalink
Post by John Cowan
Post by Sascha Ziemann
Maybe we could come back to the original
question how to avoid unnecessary
initializations during the the creation
of a new vector by parts of other
vectors. The standard requires to use
make-vector and vector-copy which
initializes the new vector twice.
I've added a vector-append-subvector
procedure to VectorsCowan, my proposal
for a general vector library for
R7RS-large.
Is it thinkable here to advise the user
simply to call `vector-append' and rely
on an optimizing compiler to avoid
unnecessary copying of the subvectors
that are apparently constructed afresh
just to be thrown away almost
immediately?

---Vassil.
--
Vassil Nikolov | Васил Николов | <***@pobox.com>

"Be careful how you fix what you don't understand." (Brooks 2010, 185)
John Cowan
2014-08-22 18:37:47 UTC
Permalink
Is it thinkable here to advise the user simply to call `vector-append'
and rely on an optimizing compiler to avoid unnecessary copying of
the subvectors that are apparently constructed afresh just to be
thrown away almost immediately?
If you like, certainly. But SRFI 43 (and R7RS-small, which incorporate
parts of it) have lots of "start end" argument pairs precisely to
avoid the need for such optimization. The alternative would be shared
subvectors, but that would need to be a primitive capability.
--
John Cowan http://www.ccil.org/~cowan ***@ccil.org
My corporate data's a mess!
It's all semi-structured, no less.
But I'll be carefree / Using XSLT
On an XML DBMS.
Vassil Nikolov
2014-08-22 20:54:27 UTC
Permalink
Post by John Cowan
Post by Vassil Nikolov
Is it thinkable here to advise the user
simply to call `vector-append' and rely
on an optimizing compiler to avoid
unnecessary copying of the subvectors
that are apparently constructed afresh
just to be thrown away almost
immediately?
If you like, certainly. But SRFI 43 (and
R7RS-small, which incorporate parts of
it) have lots of "start end" argument
pairs precisely to avoid the need for
such optimization.
Quite so, and these are very useful
indeed, but this approach (sometimes)
involves its own trade-offs. For
example, if we are content with a
two-argument append (concatenate)
operation, it is (reasonably) easy to add
optional parameters start1, end1, start2,
and end2. If, however, we insist on
being able to append an arbitrary number
of arguments in a single call, I don't
know if either of the following is
obviously better than the other: to
complicate [*] or proliferate the
interfaces, or to pass the buck on to the
compiler.

To a large degree, this is a matter of
language design style (or "philosophy").

_________
[*] If we wish to be able to use the
append operation easily with a reduce
(fold) operation, say, that would
constraint the interface of the
append operation with regards e.g. to
additional parameters such as
start-end pars.
Post by John Cowan
The alternative would be shared
subvectors, but that would need to be a
primitive capability.
True, and a different kettle of fish,
with its own usefulness and cost. I
would argue that such a feature should be
considered independently of
("orthogonally to") the others.

---Vassil.
--
Vassil Nikolov | Васил Николов | <***@pobox.com>

"Be careful how you fix what you don't understand." (Brooks 2010, 185)
Kevin Wortman
2014-08-23 00:52:57 UTC
Permalink
Post by Sascha Ziemann
Maybe we could come back to the original question how to avoid unnecessary
initializations during the the creation of a new vector by parts of other
vectors. The standard requires to use make-vector and vector-copy which
initializes the new vector twice.
Is here a broad consent that it is not worth to care about, because one
should disregard vectors in favour of gap buffers?
If your implementation supports SRFI 43 (
http://srfi.schemers.org/srfi-43/srfi-43.html ) or similar, then you can
use vector-unfold or vector-unfold-right to create and initialize a vector
all at once.

IMO, small constant-factor costs, like the time to initialize an array, are
not a big concern in Scheme code. For reference, a quick test shows my 2
year old laptop can memset approx. 3.7 GB/sec. So this cost is almost
certainly dominated by other factors. In cases where this kind of thing
makes a user-visible make-or-break difference, I'd probably need to be
write assembly or low-level C anyway. So I prefer for Scheme code to do the
Right Thing even at the expense of some small constant factors here and
there. I would not go so far as to say there is broad consent to this
position.

Best regards,
Kevin Wortman
Vassil Nikolov
2014-08-23 22:28:55 UTC
Permalink
Post by Kevin Wortman
...
IMO, small constant-factor costs, like
the time to initialize an array, are not
a big concern in Scheme code. For
reference, a quick test shows my 2 year
old laptop can memset approx. 3.7
GB/sec. So this cost is almost certainly
dominated by other factors. In cases
where this kind of thing makes a
user-visible make-or-break difference,
I'd probably need to be write assembly or
low-level C anyway. So I prefer for
Scheme code to do the Right Thing even at
the expense of some small constant
factors here and there. I would not go so
far as to say there is broad consent to
this position.
I agree with the spirit of this
statement. Nevertheless, I can't resist
making a pedantic note. The rate given
above, 3.7 GB/s, is on the order of
1 word/ns; in other words, initializing a
million-element vector would take on the
order of a million CPU cycles. This
might be significant in the proverbial
inner loop. In other words, your CPUage
will indeed vary...

---Vassil.
--
Vassil Nikolov | Васил Николов | <***@pobox.com>

"Be careful how you fix what you don't understand." (Brooks 2010, 185)
Continue reading on narkive:
Loading...