UNC Systems Genetics
Collaborative Cross Genomes
On this page we provide full genomic sequences for the Collaborative Cross (CC) mouse strains in the form of FASTA files for the 19 autosomes, sex chromosomes (X and Y), and mitochondria (M). They can be used as reference sequences for high-throughput short-read alignments, or for any other comparative genomic analyses.
Each genome comes with a companion MOD file, which can be used to remap coordinates from the FASTA sequences back to reference coordinates (Currently NCBI37/mm9). This is essential since all gene and genomic annotations are relative to the reference. MOD files are genome and version specific, and therefore should always be downloaded together as a set.
We supply two types of genomes, sequenced and imputed. Sequenced genomes result from direct DNA sequencing at a minimum of 30x coverage, and an iterative alignment process. Imputed genomes are derived from genotype data, where we first construct a haplotype mosaic and then assemble an imputed genome using segments of DNA sequence from known DNA contributors.
The CC founder strain genomes are derived from variant data and BAM files provided by the Welcome Trust Sanger Institute, and described by Keane et al. Statistics on the number of variants relative to the reference sequence are also provided.
Last update: 2012-11-08
Last update: 2012-11-15
The variants are derived from the SNPs and indels data sets in ftp://ftp-mouse.sanger.ac.uk/REL-1111-SNPs/ .
Last update: 2012-11-19
We provide a suite of tools that simplify the incorporation of our pseudogenomes into standard analysis and hiseq pipelines.
Lapels is used to remap pseudogenome alignments, in the form of a BAM file, back to the reference sequence. This entails the removal of all indels (via the cigar string modifications, the underlying sequence is unaltered) and adjustments to the fragment and its mate's starting positions. Lapels also annotates the number and types (SNPs, insertions, and deletions) of sequence variants seen in each read.
The input includes the BAM file of psedogenome alignment and the MOD file associated with the FASTA sequences used in the alignment. (Please bundle MOD and FASTA while downloading.)
The output is a BAM file with corrected reads positions, cigar strings, and annotated tags. It has been tested to be compatible with downstream tools, such as IGV (using the reference genome) and Cufflinks (using any referenced based transcript library).
The code and documentation are hosted in http://code.google.com/p/lapels/.
Please report bugs or give suggestions in http://code.google.com/p/lapels/issues/list .
Suspenders merges the results of multiple alignments (BAM files) applied to the same set of reads. It is used when working with F1 and RIX crosses, where we suggest performing separate alignments to each parental genome. Suspenders then effectively merges and annotates these separate BAM files into a single consensus BAM file.
When reads map to the same genomic location in both alignments, only one read is output. Where there are differences in either mapping positions or multiplicity of reads, Suspenders determines the most likely alignment and source genome for the read, which is sent to the output BAM file. When there is no significant difference in the alignments all multiple mappings are output.
The code for Suspenders is available in http://code.google.com/p/suspenders/.
S. Huang, C.-Y. Kao, L. McMillan, and W. Wang.Transforming genomes using mod files with applications. In Proceedings of the ACM Conferenceon Bioinformatics, Computational Biology and Biomedicine. ACM, 2013. [link]
J. Holt, S. Huang, L. McMillan, and W. Wang. Read annotation pipeline for high-throughput sequencing data. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM, 2013. [link]