Here's some commentary from doc/info/build_index.pl.
Make of it what you will.
# (1.1a) Scan the *.info-* files for unit separator characters;
# those mark the start of each texinfo node.
# Build a hash table which associates the node name with the filename
# and byte offset (NOT character offset) of the unit separator.
#
# Do NOT use the indirect table + tag table (generated by makeinfo),
# because those tables give character offsets; we want byte offsets.
# It is easier to construct a byte offset table by hand,
# rather than attempting to fix up the character offsets.
# (Which are strange anyway.)
best, Robert Dodier
On 12/6/10, Robert Dodier <robert.dodier at gmail.com> wrote:
> When I wrote the perl script to compute the offsets,
> there was some Lisp-inspired craziness about byte offsets vs character
> offsets.
> I seem to recall that the offsets are character offsets,
> but there something else which is a byte count ... maybe the
> amount of stuff to read is a number of bytes, not a number of characters.
>
> I'm pretty sure I did try reading some files with multibyte characters
> (Spanish & Portuguese, I guess) so I wouldn't throw out the
> existing offset/count stuff just yet. Unfortunately I won't
> have time to investigate for a few days, maybe someone else
> can take a look at it.
>
> best, Robert Dodier
>
> On 12/6/10, Raymond Toy <toy.raymond at gmail.com> wrote:
>> On 12/6/10 11:37 AM, Leo Butler wrote:
>>>
>>> On Mon, 6 Dec 2010, Raymond Toy wrote:
>>>
>>> < On 12/6/10 1:10 AM, Robert Dodier wrote:
>>> < > Yeah, I see the problem with the incorrect indexing too.
>>> < > Could be looking in the correct file at the incorrect offset,
>>> < > or the incorrect file at the correct offset, or
>>> < > both the file and offset are incorrect. I didn't
>>> < > look at it carefully.
>>> < I don't read perl very well, but could the problem be that
>>> < build-index.pl is reading the info files with a utf-8 encoding? This
>>> is
>>> < the right encoding, but won't that totally mess up the index in
>>> < maxima-index.lisp? I'm pretty sure the indices in maxima-index.lisp
>>> are
>>> < octet offsets, not character offsets.
>>>
>>> I was inclined to believe this, but I don't think the problem is here.
>>> I re-wrote the build_index.pl to use the right encoding (and speed it
>>> up), but this doesn't affect the problem.
>>>
>>> Indeed, if you open maxima.info-1 in an emacs buffer, put point at
>>> (point-min) and (goto-char 288618), you will arrive in the middle
>>> of the `expand' documentation. So the char vs. byte counts are quite
>>> close. Accessing online help for `expand'
>>> puts you in the midst of the docstring for `example'.
>> But, from looking at read-info-text in cl-info.lisp, the octet count has
>> to be exact because read-info-text moves to the exact offset in the file
>> and reads some number of octets. So, close isn't enough. From tracing
>> read-info-text on "? expand", the offset is 33623, but the documentation
>> for expand starts at offset 288346.
>>
>> (33623 was obtained from maxima-index.lisp.)
>>
>> So calling read-info-text with the correct offset produces the correct
>> documentation (more or less).
>>> Even more peculiarly, ? expandwrt displays the same string as ? expand,
>>> but the offsets differ.
>> Because maxima-index.lisp says the offsets are the same.
>>>
>>> Based on all this, I tend to think the problem lies in the lisp
>>> function reading the info files.
>> You are also correct about this. read-info-text opens the file with
>> some default encoding. I'm not exactly sure what file-position does in
>> various lisps for encoded files. If file-position moves the to the
>> specified octet, then that's ok. But then we use read-sequence.
>> Read-sequence doesn't support any kind of encoding, so the returned
>> string will probably be messed up.
>>
>> I think what we need to do here is open the file as a binary file of
>> octets, move to the correct offset and read in the desired number of
>> octets into an array. Then this array needs to be converted to a string
>> using the correct encoding. (Most lisps have some kind of
>> octets-to-string function.)
>>
>> Do this make sense to you?
>>
>> Ray
>>
>> _______________________________________________
>> Maxima mailing list
>> Maxima at math.utexas.edu
>> http://www.math.utexas.edu/mailman/listinfo/maxima
>>
>