-------------------------------------------------------------------------------
POPE Pipeline Version 1.0
by Victor Missirian
-------------------------------------------------------------------------------

-------------------------------------------------------------------------------
Software Requirements for Pipeline:
-------------------------------------------------------------------------------
1. Python 2.6.1 (may work for earlier versions too)
-------------------------------------------------------------------------------
2. R version 2.11, with R package Hash (version 2.0.1)
-------------------------------------------------------------------------------
3. the Tophat mRNA-seq aligner (version 1.0.13)
and the Bowtie read aligner (version 0.12.5).
Tophat depends on Bowtie.
-------------------------------------------------------------------------------
4. sh
I tested the pipeline on "GNU bash, version 3.2.48(1)-release
(x86_64-apple-darwin10.0)", though it should work for many
other versions/implementations.  The POPE pipeline is set up
to use whichever shell is at the path '/bin/sh'.


-------------------------------------------------------------------------------
List of Included Files:
-------------------------------------------------------------------------------
POPE.sh
pipeline_helper_functions.sh
generate_SNP_file_from_reads_data.sh
get_SNPs_per_locus.py
create_in_silico_hybrid.sh
read_count_frequency_distribution.sh
mRNA-seq_pipeline.sh
get_difference_between_SNP_files.py
get_SNPs_between_the_two_other_alleles_from_two_SNP_files.py
get_SNP_file_with_negative_calls.py
samtools2SNP.sh
vcf2SNP.py
read.count.frequency.distribution.R
create_read_counts_file_for_all_replicates.py
create_final_output_file.py
statistics_per_locus.py
DE.fishers.exact.test.R
read.count.frequency.distribution.helper.functions.R
upper.quartile.normalization.R
assign_hybrid_expression_patterns.sh
assign.hybrid.expression.patterns.R
assign.expression.patterns.helper.functions.R
DE.study.global.variables.R
statistics.per.locus.file.helper.functions.R
DE.study.helper.functions.R
get.statistics.per.locus.information.for.hybrid.and.its.two.parents.R
upper.quartile.normalization.R
filter_loci_by_genome_mapping_bias.sh
get_good_orthologs.py
genome_mapping_bias.sh
randomly_print_n_sequence_names_from_fasta_file_to_output_file.py
print_n_random_reads_of_length_x_from_subset_of_sequences_in_fasta_file_to_output_file.py
split_sam_hits_file.py
statistics_per_locus_for_multiple_separate_analyses.py
get.genome.mapping.bias.R
get_orthologs_with_low_genome_mapping_bias.py
filter_loci_by_genome_mapping_bias_ratio.py
process_gene_models_helper_functions.py
process_SNP_file__module.py
read_in_SNP_data__module.py
read_alignment_statistics_per_locus__module.py
fasta_file_helper_functions.py
sequence_processing__module.py
locus__helper_functions.py
chromosome_helper_functions__module.py
get_SNPs_for_read__module.py
README

-------------------------------------------------------------------------------
Instructions To Run Pipeline:
-------------------------------------------------------------------------------
1. Make sure that you have all the required input files (listed in a
separate section).
-------------------------------------------------------------------------------
2. Download the bowtie index for your target organism into the directory
'bowtie_indices'
-------------------------------------------------------------------------------
3. On the command line, move into the directory containing this README.
-------------------------------------------------------------------------------
4. Assign reads to alleles/loci and detect differential expression

	./POPE.sh <ref_parent> <other_parent> <top_level_reads_directory>
		<read_alignment_method> <bowtie_index_basename> 
		<genomic_reference_fasta_file_for_bwa>
		<gene_model_annotations_GFF3_file> <SNP_filename>
		<cutoff_for_filtering_SNPs_that_are_just_outside_exon_boundaries>
		<negative_SNP_call_coverage_threshold> <pvalue_threshold>
		<reads_processing_output_directory>

-------------------------------------------------------------------------------
5. Identify loci with low genome mapping bias

	./filter_loci_by_genome_mapping_bias.sh <num_sampled_reads> 
		<num_reads_threshold> <sequencing_depth>
		<max_num_allelic_reads_threshold>
		<mapping_bias_ratio_threshold> <read_length>
		<bowtie_index_basename> <condition_list>
		<sequence_fasta_file_list>
		<sequence_feature_annotations_file_list>
		<GMB_estimation_output_directory>
		<filtered_loci_output_file>

-------------------------------------------------------------------------------
6. Adjust the genome mapping bias ratio threshold (Optional)

The user can quickly generate the set of loci for a new genome mapping bias ratio threshold by running

	./filter_loci_by_genome_mapping_bias_ratio.py <genome_mapping_bias_ratio_file> <mapping_bias_ratio_threshold> <filtered_loci_output_file>

where <genome_mapping_bias_ratio_file> is <GMB_estimation_output_directory>/genome_mapping_bias_ratio_for_each_locus_with_strong_orthology, using
the <GMB_estimation_output_directory> created in the previous step.

--------------------------------------------------------------------------------
7. Run the POPE core analysis methods to detect cis/trans effects and
additivity/non-additivity

	./assign_hybrid_expression_patterns.sh
		<ref_parent_statistics_per_locus_file_array>
		<other_parent_statistics_per_locus_file_array>
		<hybrid_parent_statistics_per_locus_file_array>
		<filtered_loci_file> <fold_change_threshold> <pvalue_threshold>
		<method> <number_of_points_to_sample>
		<hybrid_expression_pattern_output_directory>

where the first three arguments were generated in
<reads_processing_output_directory> by the fourth step and
<filtered_loci_file> is the file <filtered_loci_output_file> generated
in the previous step.

-------------------------------------------------------------------------------
8. Look at the results

The final results for categorizing loci in terms of the presence or absence of
cis effects, trans effects, and additivity are listed in four separate files
in the directory <hybrid_expression_pattern_output_directory> that was created
in Step 7.
-------------------------------------------------------------------------------


-------------------------------------------------------------------------------
Descriptions of Selected Input Files:
-------------------------------------------------------------------------------

	<ref_parent> and <other_parent> are the names of two parental
	genotypes for which we have data on the reciprocal hybrids. 
	The choice of <ref_parent>* and <other_parent> must be 
	consistent between the runs of the two scripts mentioned above.

		* <ref_parent> does not necessarily have to be the parent which
		  is closest to the genomic reference sequence.  The
		  designations <ref_parent> and <other_parent> are used only
		  (a) to keep track of which parent is which both in the code
		  and the SNP data files.

	<top_level_reads_directory> should contain one reads directory for each
	of the two parents and their reciprocal hybrids, named <ref_parent>
	<other_parent>, real_hybrid_<ref_parent>_<other_parent>, and
	real_hybrid_<other_parent>_<ref_parent>.
	Each reads directory for a given condition <c> should contain
	a separate single-ended reads file for each biological replicate,
	named condition_<c>_replicate_<rep_number>.
	*No two reads files should ever be given the same name*

	<read_alignment_method> currently must be set to 'tophat'

	<bowtie_index_basename> is the bowtie index basename of your target
	organism and it must match a basename in the directory 'bowtie_indices'
	(i.e. "a_thaliana" for Arabidopsis thaliana
	and "o_sativa" for Oryza sativa)



	Arguments <bowtie_index_basename>,
	and <SNP_filename> are all described further in the 
	next sections.
 
-------------------------------------------------------------------------------

2. The Bowtie index directory for the genome of the organism under study, 
which is named "<bowtie_index_basename>.ebwt", stored within the parent 
directory 'bowtie_indices' in the main pipeline directory.  For some organisms,
such as Arabidopsis thaliana, the directory can be found already generated on 
the website for Bowtie, but for other organisms, such as Oryza sativa, it may 
need to be generated by running the Bowtie command 'bowtie-build' on the 
chromosome reference sequences.

-------------------------------------------------------------------------------

3. A SNP filename, <SNP_filename>, with no header*, where each line is of the 
format:

		chromosome <tab> position <tab> refbase <tab> otherbase

The chromosome names in the SNP file must match the chromosome names returned 
by running the command

	bowtie-inspect --names <bowtie_index_basename>

from within the directory <bowtie_index_basename>.ebwt

* The SNP file may optionally contain a single-line header in which the
second tab-delimited field is either 'pos' or 'position'.

-------------------------------------------------------------------------------

4. A GFF3 file in a special format

	A specific <gene_model_annotations_GFF3_file> for each 
	of the suppported organisms (Oryza sativa (MSU v6.1), Arabidopsis 
	thaliana (TAIR 9), and Populus trichocarpa) must be placed
	in the subdirectory 'annotations'.  
	
	Currently, these gff3's are not provided with the pipeline, but they
	are available on request by emailing me at
	<vmissirian@ucdavis.edu>.
	
	Other organisms may not supported by the pipeline.

The provided GFF3 files should already be set up so that the chromosome names
match the chromosome names returned by running the command

	bowtie-inspect --names <bowtie_index_basename>

from within the directory <bowtie_index_basename>.ebwt

-------------------------------------------------------------------------------
Pipeline Outputs:
-------------------------------------------------------------------------------

1. "POPE.sh" writes one results file for each of several
comparisons of interest, including

<parent_A> vs. <parent_B>		[parent-specific expression]
<hybrid_AB> vs. <hybrid_BA>		[expression differences between 
					 the two reciprocal hybrids]

<hybrid_AB> vs. <parent_A>		[test for hybrid expression patterns,
<hybrid_AB> vs. <parent_B>		 such as high-parent, low-parent,
<hybrid_AB> vs. <in_silico_hybrid>	 additive, over-dominant, and
					 under-dominant, for both reciprocal
<hybrid_BA> vs. <parent_A>		 hybrids]
<hybrid_BA> vs. <parent_B>
<hybrid_BA> vs. <in_silico_hybrid>

where <in_silico_hybrid> is a simulated average of the two parents
<parent_A> and <parent_B> and has a name of the form
"in_silico_hybrid_<parent_A>_<parent_B>."

2. 

-------------------------------------------------------------------------------
Pipeline Notes and Assumptions:
-------------------------------------------------------------------------------

1. Currently only supports Oryza sativa (MSU v6.1) and Arabidopsis thaliana
(TAIR 9)

-------------------------------------------------------------------------------
Contact:
-------------------------------------------------------------------------------
Please report any problems to me at vmissiri@gmail.com

