Texinfo / parse-info stuff



Mario Rodriguez <biomates at telefonica.net> writes:
> As far as I know, and if I am not mistaken, both latin1 and utf-8 use
> the same character encoding for non special characters. I think that it
> doesn't matter if files are saved in any of these encodings as long as
> you don't use special characters. In fact, time ago we had some files in
> info/es saved in latin1 while others were in utf-8. Now, all of them are
> saved in utf-8.

I think doing that is incorrect. The build system (even before I started
hacking on it!) carefully assumes that the contents of info/es are
encoded as latin1, then it has an explicit transcoding step that copies
them to info/es.utf8 and converts to UTF-8 as it goes.

As far as I could tell, from a pretty careful reading of how things
worked, UTF-8 encoded files in info/es are just wrong. So I converted
the UTF-8 special characters in info/{es,de,pt,pt_BR} to latin1. To find
the list of files to edit, I basically called the chardet program on
each texinfo file and, if it didn't reckon the result was ASCII, I
opened it up in Emacs (which impressively always guessed the encoding
correctly) and resaved it as latin1.

In case you're thinking what I'm thinking... I agree that storing latin1
copies of everything is more than a little crazy, but (as Ray pointed
out to me) we need to do this for documentation to work if you have a
non-unicode lisp (eg gcl) and a latin1 terminal. Since utf-8 can
represent a strict superset of latin1, it makes more sense to have the
originals written in latin1 and then transcoded automatically to utf-8
than the other way around.

> The error reported above is due to a typo when writing the word
> 'par?metros', it should be 'par@'ametros'. For the same reason, the
> German word 'Teilausdr?cke' should be fixed as 'Teilausdr@"ucke'.
> Sometimes we forget that we are writing a texinfo document.
>
> I am not an expert in character encoding, but I suspect that using only
> non special characters is safer, at least for european languages.

Yep, I agree with that, but there's no problem as long as we use the
encoding we claim we're using :-) (note that the info/es/maxima.texi
file has "@documentencoding ISO-8859-1" as its second non-comment
line...)

Most files didn't have any non-7-bit characters, which is why the
documentation hasn't been obviously massively broken.

Rupert
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 315 bytes
Desc: not available
URL: <http://www.math.utexas.edu/pipermail/maxima/attachments/20130505/5a5c5510/attachment.pgp>;