more about build-index+cl-ppcre branch & encodings

Subject: more about build-index+cl-ppcre branch & encodings
From: Douglas Crosher
Date: Thu, 03 Mar 2011 09:59:56 +1100
Using a byte offset to position in a character file, exploiting broken implementations of 'file-offset, does not seem a good 
approach.  At any time the broken implementations could correct 'file-offset and Maxima would then look up the wrong location.

The SCL, and it would also seem CMUCL, do correctly position in a character file so currently return text from the wrong location.

It is frustrating that 'file-position is inconsistent with the number of characters read in some CL implementations.  It would seem 
like a bug, but is easy to work around.  Leo's code reads the entire file into a string and then extracts characters from the string 
using a character offset, avoiding 'file-position.  Alternatively are 'read-char loop could be used for broken implementations.

I note that CLISP and SBCL seem to have issues here, while the SCL and it would also seem CMUCL can correctly position in a 
character file.  Here's a test:

(let ((text1 (make-string 100))
	(text2 (make-string 100))
	(file "doc/info/es.utf8/maxima.info-1")
	(pos 22071))
     (with-open-file (in file :direction :input
			:external-format #+clisp "utf-8" #-clisp :utf-8)
       (file-position in 22071)
       (read-sequence text1 in :start 0 :end 100))
     (with-open-file (in file :direction :input
			:external-format #+clisp "utf-8" #-clisp :utf-8)
       (dotimes (i pos) (read-char in))
       (read-sequence text2 in :start 0 :end 100))
     (if (string= text1 text2) "OK" "Broken"))

SCL: OK
CMUCL: OK
CLISP: Broken
SBCL: Broken

Separate documents and indexes are being generated for each codeset (iso-8859-1, utf-8), but I assume that most CL implementations 
could read in a utf-8 file and then write out the text in the current codeset, and this would make the build smaller and easier, and 
also allow the user more flexibility.  If all the supported languages could use iso-8859-1 then this could also be used to store the 
info documents.  Were there any issues in doing this?

Regards
Douglas Crosher

On 03/03/11 08:29, Robert Dodier wrote:
> OK, when I launch xterm with
> LC_ALL=foo LANG=foo xterm
> and then run Maxima 5.21.1 in that, describe text
> (titles and content) is displayed correctly in both
> ISO-8859 and UTF-8 locales.
>
> What was Ray's original proposal? I don't remember.
>
> At any rate, it occurs to me now that it seems possible to use
> CL-PPCRE to construct the index, but use the existing
> code to display stuff. The one wrinkle is that the existing
> index has a byte offset + character length (i.e. not both byte counts
> nor both character counts). That's to accomodate Lisp -- FILE-POSITION wants
> a byte count, and READ wants a character count.
>
> FWIW
>
> Robert Dodier
>
> On 3/2/11, Leo Butler<l.butler at ed.ac.uk>  wrote:
>>
>>
>> On Wed, 2 Mar 2011, Robert Dodier wrote:
>>
>> <  I've updated my sandbox to revision 9c49048 and built Maxima.
>> <  I'm seeing the same behavior today as I did a day or two ago;
>> <  titles&  content is displayed correctly in ISO-8859 locales,
>> <  in UTF-8 locales, titles are correct and content is messed up.
>> <
>> <  I guess that the encoding for the content is set incorrectly.
>> <  I don't know how the encoding for the titles could be correct
>> <  and the content incorrect.
>>
>> Because they use differenct functions to write their output.
>> The output to *standard-output* is being written with the
>> wrong encoding for you (but not me). Could you try Ray's
>> cmucl fix, please.
>>
>> <
>> <  As it happens, the code for the existing describe system
>> <  in src/cl-info.lisp doesn't bother with encodings at all;
>> <  it falls on the Lisp implementation to figure out the encoding.
>> <  That scheme displays titles&  content correctly in ISO-8859
>> <  and UTF-8 locales so far as I know.
>>
>> It would be nice if you would test this supposition, so we
>> can know for certain.
>>
>> <  That suggests that the encoding stuff in src/build-index.lisp
>> <  could be simplified. Just a guess.
>>
>>   And now we go full circle back to Ray's initial idea.
>>
>>   Leo
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
> _______________________________________________
> Maxima mailing list
> Maxima at math.utexas.edu
> http://www.math.utexas.edu/mailman/listinfo/maxima
>
Prev by Date: Progressive slowdown
Next by Date: more about build-index+cl-ppcre branch & encodings
Previous by thread: more about build-index+cl-ppcre branch & encodings
Next by thread: more about build-index+cl-ppcre branch & encodings
Index(es):
- Date
- Thread