Universal read_data function



Again,  I forgot to include the maxima mailing list. I think I must have 
sometimers...

-------- Original Message --------
Subject: 	Re: Universal read_data function
Date: 	Thu, 02 Jun 2011 08:57:11 -0700
From: 	Paul Bowyer <pbowyer at olynet.com>
To: 	Edwin Woollett <woollett at charter.net>



On 06/01/2011 11:05 AM, Edwin Woollett wrote:
>  On May 31, Paul Bowyer wrote:
>  ------------------------
>  My reason for thinking I needed to use Windows text file standards in
>  the data files was because I copy/pasted them from your email
>  messages. If I were creating them from scratch on a Linux box, I'd opt
>  for the default LF that is standard for Linux text files.
>
>  When I try to write utility functions, I try to make them robust so
>  they don't fail when things aren't absolutely perfect. It made sense
>  to me to handle the case where Windows text file standards were used
>  since you were working on a Windows machine. I wasn't trying to be a
>  nuisance by continually marking up your code and I hope I didn't upset
>  you, but please forgive me if I did.
>
>  Paul
>  -------------------------------------
>  Hi Paul,
>
>  I never get upset, and can only be flattered by your interest in
>  my faltering efforts at Maxima code.
>
>  The current version of read_data (which has changed: see
>  below) cares not a whit about end of line chars, so that should
>  never be the issue here. The important thing is that the
>  file to be read does not contain spurious extra end of
>  line chars, and that is why I advise looking at the file with
>  a utility such as notepad2, which clearly shows up the
>  locations and types of end of line chars (shift+control+9)
>  (which is a toggle).
>
>  (By the way, when you write data to a stream opened
>  with openw, using printf as is the manual examples,
>  the end of line chars are LF (unix).)
>
>  The NEW version allows the 'data-sep-string' to be "text",
>  (which is a hack), in which case all lines are read
>  in as strings without splitting, as is appropriate for
>  a purely text file which contains spaces and punctuation
>  marks.
>
>  A related change is if the four arg version is
>  used, by supplying a list  of line numbers,
>  those lines 2 and 4 are read into separate
>  sublists as a whole as  one string for the
>  whole line, doing no splitting.
>
>  ---------------------------------------
>  The present complete syntax and code are then:
>  -------------------------------------------------------------
>  /*********** read_data  ****************************/
>   /*  if only a file name is given, then  the
>     data separators can be an arbitrary mixture
>     of spaces and commas, but the commas are
>     converted to spaces, so strings with spaces
>     will choke the code if you only provide the
>     filename, or you provide (filename," ").
>
>
>
>     syntax: read_data(filename,data-sep-string,mult,line-list)
>
>       with ";" for example in second slot,
>          and false in third slot.
>       (mult is set to true by default.)
>
>       The data separator string can be anything
>       recognised by split, and the boolean parameter
>       mult is used by split.
>
>
>       In addition, the data-sep-string can be "text",
>       in which case *all* lines of the stream are read
>       in as individual strings.
>
>       Thus the syntax read_data(filename,"text") does
>       no line splitting.
>
>      The most complicated four arg syntax has the
>      form
>        read_data (filename, " ", true, [2,4] )
>
>      for example, where for split line data items,
>      (ie., not lines 2 and 4) space is being used
>      as the data separator, but lines 2 and 4 should
>      be read into separate sublists as a whole as
>      one string for the whole line, doing no splitting
>      for lines 2 and 4.
>            */
>
>
>  /* new 5-29 */
>
>  read_data([%v]) :=
>     block ([%s,%r,%l,%filename,%dsep,%mult:true,
>                  %mix:false,  %whole:[],%ln],
>
>      %filename : part (%v,1),
>
>      if not stringp (%filename)
>        then ( disp (" file name must be a Maxima string "),
>               return (false)),
>
>     if not file_search (%filename) then
>       (disp (" file not found "),return (false)),
>
>     if length (%v) = 1 then %mix : true
>        else if length(%v) = 2 then %dsep : part (%v,2)
>        else if length (%v) = 3
>               then (%dsep : part (%v,2), %mult : part (%v,3))
>        else
>       (%dsep : part (%v,2), %mult : part (%v,3),%whole : part(%v,4)),
>
>
>
>     %s : openr (%filename),
>     %r : [],
>     %ln : 0,
>
>     while (%l : readline(%s)) # false do
>        ( %ln : %ln + 1,
>          if %dsep = "text" then
>             %r : cons (%l,%r)
>          else if not lfreeof (%whole,%ln) then
>             %r : cons (%l,%r)
>          else if %mix then
>             %r : cons (map(parse_string, split(ssubst (" ",",",%l))), %r)
>          else %r : cons (map(parse_string, split(%l,%dsep,%mult)), %r)),
>
>     close (%s),
>     reverse (%r))$
>  ------------------------------------------------
>
>  Ted
>
>
Hi Ted:

I tried your latest code shown above on "ndata2.dat" which I re-copied
from your email (using Thunderbird) of "05/29/2011 12:53 PM" into kwrite
and filed without modifications. Because of the way printfile listed the
data, there was a blank line between the two data lines, and because of
the way I copy/pasted, there was only a single LF char at the end of the
file.

(%i3) printfile ("ndata2.dat")$

2 , 4.8, -3/4, "xyz", -2.8e-9

3 22.2  7/8 "abc" 4.4e10

By the way, the CRs that I had in my copies of the data files that I
used for my previous testing had to be manually entered using Okteta,
because they weren't present in the copy/paste data for ndata2.dat. I
must have gotten my facts turned around when I stated that the CRs
showed up as a result of the copy/paste operation.

Anyway, using your code shown above and running:

printfile("/home/pfb/ndata2.dat")$
trace( parse_string );
read_data("/home/pfb/ndata2.dat");
untrace( parse_string );

results in this output:

2 , 4.8, -3/4, "xyz", -2.8e-9
3 22.2  7/8 "abc" 4.4e10
(%o36) [parse_string]
1" Enter "parse_string["2"]
1" Exit  "parse_string2
1" Enter "parse_string["4.8"]
1" Exit  "parse_string4.8
1" Enter "parse_string["-3/4"]
1" Exit  "parse_string(-3)/4
1" Enter "parse_string[""xyz""]
1" Exit  "parse_string"xyz"
1" Enter "parse_string["-2.8e-9"]
1" Exit  "parse_string-2.8*10^-9
1" Enter "parse_string[]
stdin:1:incorrect syntax: Premature termination of input at $.
(%o38) [parse_string]

The inclusion of one, or two lines of code in your function
  gives some protection against erroneous entries such as those
that occur by copy/paste or simply by hand-typed entry.
If I were writing this function, I'd do it this way:
------------------------------------------------------------------------------
read_data([%v]) :=
    block ([%s,%r,%l,%filename,%dsep,%mult:true,
                 %mix:false,  %whole:[],%ln],

     %filename : part (%v,1),

     if not stringp (%filename)
       then ( disp (" file name must be a Maxima string "),
              return (false)),

    if not file_search (%filename) then
      (disp (" file not found "),return (false)),

    if length (%v) = 1 then %mix : true
       else if length(%v) = 2 then %dsep : part (%v,2)
       else if length (%v) = 3
              then (%dsep : part (%v,2), %mult : part (%v,3))
       else
      (%dsep : part (%v,2), %mult : part (%v,3),%whole : part(%v,4)),



    %s : openr (%filename),
    %r : [],
    %ln : 0,

    while (%l : readline(%s)) # false do
       ( %ln : %ln + 1,

         /*Added the following and the enclosing parens.
           The inclusion of these eliminates problems with:
           blank lines and CRs in the file. */

         /*Add this line if you're concerned about CRs in line ends.
         %l : strim(" ", ssubst(" ", ascii(13), %l ) ),*/

         if %l # "" then  /*Check for blank line*/
         (
           if %dsep = "text" then
            %r : cons (%l,%r)
           else if not lfreeof (%whole,%ln) then
            %r : cons (%l,%r)
           else if %mix then
            %r : cons (map(parse_string, split(ssubst (" ",",",%l))), %r)
           else %r : cons (map(parse_string, split(%l,%dsep,%mult)), %r)
         )
       ),
    close (%s),
    reverse (%r));
------------------------------------------------------------------------------

and again running:

printfile("/home/pfb/ndata2.dat")$
trace( parse_string );
/*read_data("/home/pfb/ndata1.dat"," ",true,[4]);*/
read_data("/home/pfb/ndata2.dat");
untrace( parse_string );

results in this output:

2 , 4.8, -3/4, "xyz", -2.8e-9
3 22.2  7/8 "abc" 4.4e10
(%o45) [parse_string]
1" Enter "parse_string["2"]
1" Exit  "parse_string2
1" Enter "parse_string["4.8"]
1" Exit  "parse_string4.8
1" Enter "parse_string["-3/4"]
1" Exit  "parse_string(-3)/4
1" Enter "parse_string[""xyz""]
1" Exit  "parse_string"xyz"
1" Enter "parse_string["-2.8e-9"]
1" Exit  "parse_string-2.8*10^-9
1" Enter "parse_string["3"]
1" Exit  "parse_string3
1" Enter "parse_string["22.2"]
1" Exit  "parse_string22.2
1" Enter "parse_string["7/8"]
1" Exit  "parse_string7/8
1" Enter "parse_string[""abc""]
1" Exit  "parse_string"abc"
1" Enter "parse_string["4.4e10"]
1" Exit  "parse_string4.4*10^10
(%o46) [[2,4.8,-3/4,"xyz",-2.8*10^-9],[3,22.2,7/8,"abc",4.4*10^10]]
(%o47) [parse_string]

I didn't do any testing on other data files at this time.

Paul