SEQIO -- A Package for Sequence File I/O 


CHANGES - Changes to the SEQIO Package
**************************************

This file lists the changes made when going from one version to the
next. It should be detailed enough that you won't need to go through
the rest of the documentation to find out what's new. 



Changes from Version 1.2 to Version 1.3
***************************************

Minor Changes
=============

 o Added a new example program, example4. (Version 1.2.2) 
 o Fixed a bug that kept seqfentry from returning the correct entry
   text when mmap'ing was used. (Version 1.2.2) 
 o Added the definition of FILENAME_MAX to fmtseq and idxseq, to
   maintain compatibility with SunOS 4.1.1. (Version 1.2.2) 
 o Changed genbank_annotate and pir_annotate to be a little bit
   more robust. (Version 1.2.2) 
 o Removed the getpagesize system call from the package, since
   Solaris doesn't support it. (Version 1.2.1) 
 o Fixed an uninitialized variable bug in databank_fast_read (one
   my version of gcc didn't catch). (Version 1.2.1) 

Changes from Version 1.1 to Version 1.2
***************************************

New Formats/Porting and Format Changes
======================================

Added the GCG format 

Added the GCG-* format specification of the GCG form of GenBank,
PIR, EMBL, Swiss-Prot, FASTA, NBRF and IG/Stanford formats. 

Added the MSF Multiple Sequence Format 

Added BLASTN/BLASTP/BLASTX program output format 

Added handling for NID and PID identifiers in the GenBank and EMBL
formats (although, since neither formats' release notes explicitly
defines a PID/PI line, no such line is output by the package). 

New Programs and Program Changes
================================

Added idxseq, a database indexing program 

Added a number of example programs. 

Changed the name of keyword to grepseq. 

Extended the fmtseq program in the following ways: 

 o Added support for the GCG, GCG-*, MSF and BLAST output
   format 
      (This support includes "no loss" conversions between the
      non-GCG and GCG forms of the GCG-* formats) 
 o Added a run mode capability using the `-mode' option and a
   user-created "fmtseq" BIOSEQ entry. This gives the use the
   ability to set and unset multiple options at once. 
 o Added the `-split' option for non-GCG output, so that the user
   can produce a set of output files whose contents correspond to
   the input files given to it (i.e., so the input file contents of
   "gbbct.seq" get converted and output into a corresponding file
   "gbbct.fasta"). 
 o Extended the `-split' option for GCG output, so that each entry
   is output in its own, individual file (whose name is the entry
   identifier string followed by the `-split' extension). 
 o Added a `-long' option, which performs the file conversions so
   that each input entry's header text appears as a comment in the
   converted entry. 
 o Added a `-skipempty' option to the Pretty-print format, so that
   lines containing only gap characters are not output (making
   multiple alignments of things like the BLAST output much easier
   to read). 

New Capabilities of the SEQIO Package
=====================================

Added the ability for the user to specify single entries of a file,
specifying either by entry position, by byte offset or by entry identifier. 

Added the ability for the user to specify single entries of a database,
using the database identifiers and random access of the database
entries. 

The BIOSEQ environment variable can now take a full PATH-like
specification, specifying more than one BIOSEQ file. 

BIOSEQ entries can now have multiline information fields. 

Data Structure Changes
======================

Added the fields `rawlen' and `fragstart' to the SEQINFO structure 

Removed the `mainid' and `mainacc' fields from the SEQINFO structure
and moved all identifiers into `idlist'. 

New Functions
=============

char *seqfgetrawseq(SEQFILE *sfp, int *length_out, int newbuffer) 

 Added the `get' version of `seqfrawseq', because it's lack was
 annoying. 

int seqffragstart(SEQFILE *sfp) 

 This SEQINFO access function returns the starting position of a
 fragment sequence (if the sequence is a fragment and the starting
 position is known). 

int seqfrawlen(SEQFILE *sfp) 

 The SEQINFO access function returns the length of the raw
 sequence. 

int seqfoneline(SEQINFO *info, char *buffer, int buflen, int idonly) 

 This function constructs a "oneline description" of a sequence,
 based on the information in the SEQINFO structure. 

int seqfputs(SEQFILE *sfp, char *s, int len) 

 This function outputs a string on the output stream opened for
 the SEQFILE structure. 

int seqfgcgify(SEQFILE *sfp, char *entry, int entrylen) 

 This function takes an entry in the non-GCG form of one of the
 GCG-* formats and outputs the GCG form of that entry. 

int seqfungcgify(SEQFILE *sfp, char *entry, int entrylen) 

 This function takes an entry in the GCG form of one of the GCG-*
 formats and outputs the non-GCG form of that entry. 

char *bioseq_matchinfo(char *fieldname, char *fieldvalue) 

 This function finds the database whose BIOSEQ entry contains
 an information field with the given field name and field value. 

int seqfisafile(char *filename) 

 This function tests whether the string given to it is an existing file,
 even when the string includes a single entry access specification.

int seqfcangcgify(char *format) 

 This signals whether the given format is one of the GCG-*
 formats. 

void seqfbytepos(SEQFILE *sfp) 

 This function returns the byte offset of the current entry in the
 current file. 

void seqfsetperror(void (*perr_fn)(char *)) 

 This function sets the "print error" function the package uses to
 perform all of its error printing. 

Function Changes
================

SEQFILE *seqfopen(char *filename, char *mode, char *format) 

 Seqfopen now automatically read the first entry of the file, thus
 the format of a file is always determined when seqfopen returns.
 Also, it now supports the single entry access to a file's entries. 

int seqftruelen(SEQFILE *sfp)

 This function now always returns the "true" length of the current
 sequence, ignoring any alignment or notational characters. 

char *seqfmainid(SEQFILE *sfp, int newbuffer)
char *seqfmainacc(SEQFILE *sfp, int newbuffer) 

 These two functions are no longer simple access functions to the
 SEQINFO structure (since their corresponding fields were
 removed from the structure). Now, they access information from
 the `idlist' field to construct the "main" identifier and "main"
 accession number. 

int seqfannotate(SEQFILE *sfp, char *entry, int entrylen, char
*newcomment, int flag) 

 This function now takes a SEQFILE structure, instead of a stdio
 FILE structure, as the first parameters. And, the format parameter
 has been removed, since the SEQFILE structure specifies what
 format the given entry must be. 

char *bioseq_info(char *dbspec, char *fieldname) 

 A special case has been added to this function, in that when the
 fieldname is "Root", the root directory of the datbase's BIOSEQ
 entry is now returned. Thus, no information field with the name
 "Root" can appear in a BIOSEQ entry. (Ok, it can appear there,
 but there's no way to access the information from it.) 



Changes from Version 1.0 to Version 1.1
***************************************

New Formats/Porting and Format Changes
======================================

Added PHYLIP Interleaved and Sequential file formats 

Added the Clustalw file format 

Added FASTA/TFASTA/SSEARCH/LFASTA/LALIGN/ALIGN program
output format 

Reimplemented the NBRF format, now that I found out where the
documentation was. 

Ported it to Windows NT/95
Successfully compiled it on Solaris
Successfully compiled it using g++ 

New Programs
============

Added fmtseq, the file format conversion program 

Added keyword, a program to search for keyword/motif matches 

Data Structure Changes
======================

Added fields `mainid' and `mainacc' to the SEQINFO structure 

 So, now the identifiers in an entry are split up into these two
 fields plus `idlist'. The `mainid' field gets the main identifier, the
 `mainacc' field gets the main accession number, and `idlist' gets
 all of the other identifiers. 

New Functions
=============

char *seqfrawseq(SEQFILE *sfp, int *length_out, int newbuffer) 

 Returns the "raw" sequence given in the entry, which includes
 any alignment or structural notation characters in addition to the
 sequence itself. Typically, `seqfsequence' extracts only the
 alphabetic characters, whereas `seqfrawseq' extracts all
 characters except whitespace and digits. See "format.doc" for the
 full details. 

char *seqfmainid(SEQFILE *sfp, int newbuffer)
char *seqfmainacc(SEQFILE *sfp, int newbuffer) 

 Access functions for the new information fields `mainid' and
 `mainacc'. 

void seqfsetidpref(SEQFILE *sfp, char *idprefix)
void seqfsetdbname(SEQFILE *sfp, char *dbname)
void seqfsetalpha(SEQFILE *sfp, char *alphabet) 

 Sets the identifier prefix, database name and sequence alphabet
 for the sequences read in using the given SEQFILE structure. 

int seqfisaformat(char *format) 

 Tests a format string to see if it's a support file format. 

int seqffmttype(char *format) 

 Return a type information value about the given format (see "
 format.doc" for the details about the format types). 

int seqfcanwrite(char *format) 

 Can the package output entries in that format? 

int seqfcanannotate(char *format) 

 Can the package annotate entries in that format? 

int bioseq_check(char *dbspec) 

 Does the database search specification refer to a known
 database? Is there a BIOSEQ entry for it? 

int seqfsetpretty(SEQFILE *sfp, int value) 

 When outputting entries in the Plain, FASTA, NBRF or
 IG/Stanford formats, this specifies whether to add spaces to make
 the sequence look prettier or not. 

 By default, the output operations look at the sequence being
 output, and only add spaces when the sequence is DNA, RNA or
 Protein and when there are no non-alphabetic characters in the
 sequence. 

Minor Changes
=============

 o Removed "#include <unistd.h>" since it was not needed 
 o Fixed a bug in the bioseq_parse directory reading code (it now
   skips entries "." and "..") 
 o Replaced "strerror(errno)" with "sys_errlist[errno]" 
 o Replaced S_ISREG and S_ISDIR with their macro exansions 
 o Changed the error macros so that the return argument is the
   complete return command, instead of just the return value 
 o Made my own versions of toupper, strcasecmp, strncasecmp 
 o Changed the `fasta_read' and `fasta_getinfo' functions so that
   any lines beginning with ';' that occur before any of the
   sequence are considered as part of the entry header and are
   added to the comment lines when filling in the SEQINFO fields. 
 o Changed `seqfsetidpref' and `add_id' to convert all identifier
   prefixes to lowercase. 
 o Made some minor changes to the FASTA, NBRF and IG/Stanford
   putseq functions, rearranging where the main identifier and main
   accession number are placed in an outputted entry. 
 o Added a stripflag argument to parse_comment and
   add_comment so that spaces won't be stripped from comments
   in some formats. 
 o Added prototypes to all of the functions declared just before the
   file_table. 
 o Added `extern "C" {' and '}' ifdef'ed inside `__cplusplus' at the
   beginning and end of "seqio.h". 
 o Created typedefs for the two enum's in the INTSEQFILE
   structure, to be compatible with g++ compilation. 
 o Added explicit conversions for all assignments involving void *
   variables. 
 o In seqfopendb, the format and idprefix information field values
   are now tested to see if they contain valid values. 
 o Fixed a bug in fasta_read, nbrf_read and stanford_read which
   allowed the current file pointer to move past the end of the read
   file buffer (which causes a seg fault when mmap buffers are being
   used). 
 o The access functions for the SEQINFO fields have been
   collapsed into a bunch of stub functions and intseq_field[123]. 
 o Added format specific variables to the INTSEQFILE structure,
   which are used by the NBRF, PHYLIP, Clustalw and
   FASTA-output formats. 
 o Fixed GenBank, PIR, EMBL, Swiss-Prot and NBRF output
   functions so that accession number lists don't overflow past the
   line length. 


James R. Knight, knight@cs.ucdavis.edu
July 8, 1996 
