SEQIO -- A Package for Sequence File I/O 


README - Readme File for the SEQIO Package
******************************************



The SEQIO package is a set of C functions which can read and write
biological sequence files formatted using various file formats and which
can be used to perform database searches on biological databases. All
of the code is packaged together into a single file, making it easy to
incorporate into your programs. Here are the files included in the
SEQIO package distribution. 

 o seqio.c - The code 
 o seqio.h - The header file 
 o bioseq.txt - An example BIOSEQ file which contains default
                descriptions for a number of databases
                (see "user.doc" for more information on BIOSEQ files) 

 o doc/seqio.doc - The main documentation describing the SEQIO interface 
 o doc/quickref.doc - A Quick Reference Guide to the interface 
 o doc/programr.doc - A "How-To" Guide for using the SEQIO package
                      (canonical examples and programming tips, issues in
                       porting SEQIO to a new machine) 
 o doc/user.doc - Documentation for the users of your programs,
                  not you the user of SEQIO
                  (short descriptions of file formats, how to specify
                   database searches, and so on) 
 o doc/format.doc - Documentation describing the specific criteria
                    SEQIO uses when parsing and outputting the different
                    file formats. 

 o Makefile - A simple makefile for seqio.o, fmtseq and the examples
 o fmtseq.c - A file conversion program 
 o idxseq.c - A database indexing program 
 o example1.c - A simple keyword searching program 
 o example2.c - A sequence information display program 
 o example3.c - A feature extraction program 
 o grepseq.c - A fixed-width motif searching program 
 o typeseq.c - A sequence output (`cat' or `fetch') program 
 o wcseq.c - A sequence/entry counting program 

 o doc/fmtseq.doc - Documentation for fmtseq 
 o doc/idxseq.doc - Documentation for idxseq 
 o doc/examples.doc - Documentation for the examples 

 o html/seqio_toc.html - A table of contents for the HTML pages 
                         (useful as the main page of a local WWW copy of
                          the documentation) 
 o html/* - All of the *.doc files in HTML format (with crosslinks). 

 o README - This file 
 o CHANGES - A list of changes and bug fixes made to the code
             and documentation 
 o TODO - What will come next (that I know about) 



Installation Notes
******************

To install the programs associated with the package, and to setup your
system to use those programs, perform the following steps. 

 1. Uncompress (using gunzip) and untar the release. This will
   create a sub-directory "seqio-1.2" below where you untar it. 
 2. Enter the sub-directory and run make to compile all of the
   programs. The makefiles included in the release are very simple,
   but since the code itself should be cross-platform portable, the
   makefile doesn't have to be complex. The one thing you might
   have to customize is the compiler name and options. The
   makefile is configured to use the gcc compiler. If you do not have
   gcc, then edit the CC and CFLAGS makefile variables for the C
   (or C++) compiler you do have. The only flag really necessary for
   the compilation is the optimization flag (it will make a difference
   in the programs' running time). 
 3. To install the programs elsewhere, copy "fmtseq", "idxseq",
   "grepseq", "typeseq" and "wcseq" to the executable directory.
   These are the only programs that really have the potential to be
   considered useful application programs. 
 4. If you have support for local documentation on the Web, then
   either create a link to the file "html/seqio_toc.html", or copy all of
   the files in the "html" directory and create the link to
   "seqio_toc.html" in the destination directory. 
 5. Create a BIOSEQ file describing all of your databases (an
   example is given in "bioseq.txt"), and, if you want to allow single
   entry access to the entries of those databases, run the "idxseq"
   program on each of them. Tell any users of the program to
   include that filename as part of their BIOSEQ environment
   variable list of files. 
 6. Enjoy. 

Using the SEQIO Package Itself
==============================

To be able to use the package itself, you should be familiar with reading
and writing files using the C stdio package and with doing dynamic
allocation of memory using malloc and free. To use the SEQIO package
in your program, simply copy the files "seqio.c" and "seqio.h" to your
program directory, include the header file in any program files that use
the SEQIO package, and compile the package along with your program.

At this point in time, the SEQIO package has been tested using gcc on
Unix systems running SunOS, Solaris, Ultrix, IRIX and Windows NT,
and using g++ on Ultrix. The code has been written to the ANSI C
standard, so you will need an ANSI C/C++ compiler in order to compile
the package. One suggestion I have is that you turn on optimization
when compiling the SEQIO package. It will significantly improve the
package's efficiency. Also, compiling the package may take several
minutes, as the code is around 20,000 lines (this will get shorter in a
later version (of course, I keep saying that every version)). 

If you plan to use this package and wish to receive notices about
updates and bug fixes, please send mail to knight@cs.ucdavis.edu. In
that mail, specify whether you just want a notice about a new version of
the package, or you want the patch file or complete release
automatically sent to you.
(NOTE: If you see ANYTHING you think is either wrong, or should be
changed, please let me know. If it is wrong, I'll fix it. If I think it isn't, I'll
tell you why, and also tell you how you can get what you want.) 

Any use of the SEQIO package should be accompanied with
acknowledgements and copyright notices in the documentation of any
software developed using the package or derived from the package.
Something along the lines of: 

 This software uses the SEQIO package for reading and writing
 sequences. Copyright (c) 1996 by James Knight at Univ. of
 California, Davis. 

Any papers describing software using the SEQIO package, or whose
results were significantly aided by the use of the SEQIO package
(except when the use was internal to a larger program), should include
an acknowledgement and citation. The citation should be something
like: 

 Knight, James "SEQIO: A C Package for Reading and Writing
 Sequences," distributed by the author. 

(As soon as I get a paper out about the package, this will become a
reference to the paper.) 



Author and Acknowledgements
***************************

 James Knight
 Dept. of Computer Science
 Univ. of California, Davis
 Davis, CA 95616
 E-mail: knight@cs.ucdavis.edu
 WWW-Site: http://wwwcsif.cs.ucdavis.edu/~knight 

Send any bug reports, new database/file-format information,
comments, complaints or extension requests to knight@cs.ucdavis.edu.

This work was supported foremost by Dan Gusfield at UCDavis, by
grant DE-FG03-90ER60999 from the Department of Energy and by the
Aspen Center for Physics. 

My thanks to Don Gilbert for collecting descriptions of the various
formats and including them with his "readseq" program. I never used
his code, but the `Formats' file was quite useful in writing the package,
and I did look through his code when writing "fmtseq". Thanks also to
Russell Malmberg who stuck with all of my attempts to port the
package to Windows NT/95 until it finally compiled and ran. Thanks to
Kay Hofmann for describing the MSF format in a detailed enough form
for implementation. 



COPYRIGHT NOTICE
****************

In this version, the following copyright notice holds for the SEQIO
package, its documentation and the fmtseq and idxseq programs. All of
the example programs are public domain, and can be used and
rewritten without any acknowledgements (although, it would be the
polite thing to do). 

Please note however that in a future version, some programs added to
the release may have a more restrictive copyright (those programs will
be restricted to non-commercial use because of the original sources
used to derive the programs). However, the SEQIO package, fmtseq,
idxseq and the example programs will always be freely available for
commercial or non-commercial use, now and into the future. 

The copyright for the SEQIO package, its documentation and the
fmtseq and idxseq programs: 

  Copyright (c) 1996 by James Knight at Univ. of California, Davis

  Permission to use, copy, modify, distribute and sell this software
  and its documentation is hereby granted, subject to the following
  restrictions and understandings:

    1) Any copy of this software or any copy of software derived
       from it must include this copyright notice in full.

    2) All materials or software developed as a consequence of the
       use of this software or software derived from it must duly
       acknowledge such use, in accordance with the usual standards
       of acknowledging credit in academic research.

    3) The software may be used freely by anyone for any purpose,
       commercial or non-commercial.  That includes, but is not
       limited to, its incorporation into software sold for a profit
       or the development of commercial software derived from it.
 
    4) This software is provided AS IS with no warranties of any
       kind.  The author shall have no liability with respect to the
       infringement of copyrights, trade secrets or any patents by
       this software or any part thereof.  In no event will the
       author be liable for any lost revenue or profits or other
       special, indirect and consequential damages. 


James R. Knight, knight@cs.ucdavis.edu
June 29, 1996 
