more about build-index+cl-ppcre branch & encodings



Hello Steve,

Yes, you are right, sorry, so it's not broken.

The SCL uses the character position for the convenience of allowing the file to be recoded while still using the same offsets, and 
also to support file-position on file encodings that do not increase monotonically such as compressed and encrypted formats.

Positioning is very fast when within the buffer, perhaps faster than a byte-offset implementation, but may require some or all of a 
file to be re-read which could be very slow.  If speed is critical then it is only a small extra step to open a binary stream, seek 
to the location, and then encapsulate it in a character stream.  The SCL optimizes fixed width encodings, and avoids re-reading if 
the buffer can be used.  A fixed width encoding may be a good choice, and necessary if randomly writing.

Using a byte offset also makes buffering problematic.  For example:

(with-open-file (ostream "ctest.txt" :direction :output
			 :external-format #+clisp "utf-8" #-clisp :utf-8)
   (dotimes (i 1000)
     (write-char (code-char #x1234) ostream)))

(with-open-file (stream "ctest.txt" :direction :input
			:external-format #+clisp "utf-8" #-clisp :utf-8)
   (let ((p0 (file-position stream))
	(ch (read-char stream)))
     (unread-char ch stream)
     (let ((p0* (file-position stream)))
       (if (eql p0* p0) "Ok" "Broken"))))

SCL: Ok
CLISP: Broken
CMUCL (Unicode): Broken

Needless to say the use of a character offset for the Maxima info documents suits the SCL, but it could also work with byte offsets, 
by opening a binary stream and then positioning and encapsulating in a character stream, or just reading a chunk of bytes and 
converting them to characters.

Regards
Douglas Crosher

On 03/03/11 13:55, Steve Haflich wrote:
> Please read the ANS about file position.  There are good reasons it is
> defined the way it is.  If it were strictly defined to operate on
> character position rather than allowing monotonic octet position, it
> would be impossible to seek to a particular place on a
> variable-width-character stream without rereading the entire stream (or
> maintaining some complex binary tables).
> _______________________________________________
> Maxima mailing list
> Maxima at math.utexas.edu
> http://www.math.utexas.edu/mailman/listinfo/maxima
>