more about build-index+cl-ppcre branch & encodings
Subject: more about build-index+cl-ppcre branch & encodings
From: Raymond Toy
Date: Wed, 02 Mar 2011 21:52:08 -0500
On 3/2/11 5:59 PM, Douglas Crosher wrote:
>
> Using a byte offset to position in a character file, exploiting broken
> implementations of 'file-offset, does not seem a good approach. At
> any time the broken implementations could correct 'file-offset and
> Maxima would then look up the wrong location.
>
> The SCL, and it would also seem CMUCL, do correctly position in a
> character file so currently return text from the wrong location.
I'm surprised this works in cmucl. FILE-POSITION in cmucl is the octet
position, not the character position. For variable length encodings,
how do you position by character other than by, more or less, reading
every character?
Could be a bug in CMUCL.
>
> It is frustrating that 'file-position is inconsistent with the number
> of characters read in some CL implementations. It would seem like a
> bug, but is easy to work around. Leo's code reads the entire file
> into a string and then extracts characters from the string using a
> character offset, avoiding 'file-position. Alternatively are
> 'read-char loop could be used for broken implementations.
I have a vague memory that the original info system did read in the
entire file. This was the version before Robert created the current system.
Ray