UNC Systems Genetics

Collaborative Cross Genomes

On this page we provide full genomic sequences for the Collaborative Cross (CC) mouse strains in the form of FASTA files for the 19 autosomes, sex chromosomes (X and Y), and mitochondria (M). They can be used as reference sequences for high-throughput short-read alignments, or for any other comparative genomic analyses.

Each genome comes with a companion MOD file, which can be used to remap coordinates from the FASTA sequences back to reference coordinates (Currently NCBI37/mm9). This is essential since all gene and genomic annotations are relative to the reference. MOD files are genome and version specific, and therefore should always be downloaded together as a set.

We supply two types of genomes, sequenced and imputed. Sequenced genomes result from direct DNA sequencing at a minimum of 30x coverage, and an iterative alignment process. Imputed genomes are derived from genotype data, where we first construct a haplotype mosaic and then assemble an imputed genome using segments of DNA sequence from known DNA contributors.  

 

CC Founders

The CC founder strain genomes are derived from variant data and BAM files provided by the Welcome Trust Sanger Institute, and described by Keane et al. Statistics on the number of variants relative to the reference sequence are also provided.

Last update: 2012-11-08


Strain Pseudogenome Statistics Type Downloads
A/J sequenced FASTA (734 MB)
MOD (23 MB)
C57BL/6J sequenced FASTA (734 MB)
MOD (<1 MB)
129S1/SvImJ sequenced FASTA (734 MB)
MOD (25 MB)
NOD/ShiLtJ sequenced FASTA (734 MB)
MOD (23 MB)
NZO/HlLtJ sequenced FASTA (734 MB)
MOD (24 MB)
CAST/EiJ sequenced FASTA (734 MB)
MOD (92 MB)
PWK/PhJ sequenced FASTA (734 MB)
MOD (89 MB)
WSB/EiJ sequenced FASTA (734 MB)
MOD (34 MB)
MD5 checksums

 

CC Strains

Last update:  2012-11-15


Strain Pseudogenome Statistics Type Downloads
OR867 imputed FASTA (734 MB)
MOD (36 MB)
OR1237 imputed FASTA (734 MB)
MOD (32 MB)
OR1515 imputed FASTA (734 MB)
MOD (34 MB)
OR3154 imputed FASTA (734 MB)
MOD (32 MB)
OR3252 imputed FASTA (734 MB)
MOD (28 MB)
OR559 imputed FASTA (734 MB)
MOD (38 MB)
IL16211 imputed FASTA (734 MB)
MOD (28 MB)
IL16188 imputed FASTA (734 MB)
MOD (31 MB)
OR477 imputed FASTA (734 MB)
MOD (28 MB)
OR5489 imputed FASTA (734 MB)
MOD (35 MB)
AU8041 imputed FASTA (734 MB)
MOD (36 MB)
AU8043 imputed FASTA (734 MB)
MOD (33 MB)
MD5 checksums

Sanger Strains

We also present pseudogenomes of other mouse strains from the Mouse Genomes Project in Wellcome Trust Sanger Institute.

The variants are derived from the SNPs and indels data sets in ftp://ftp-mouse.sanger.ac.uk/REL-1111-SNPs/

Last update: 2012-11-19


Strain Pseudogenome Statistics Type Downloads
129P2/OlaHsd sequenced FASTA (734 MB)
MOD (33 MB)
129S5SvEvBrd sequenced FASTA (734 MB)
MOD (29 MB)
AKR/J sequenced FASTA (734 MB)
MOD (31 MB)
BALB/cJ sequenced FASTA (734 MB)
MOD (27 MB)
C3H/HeJ sequenced FASTA (734 MB)
MOD (31 MB)
C57BL/6NJ sequenced FASTA (734 MB)
MOD (<1 MB)
CBA/J sequenced FASTA (734 MB)
MOD (31 MB)
DBA/2J sequenced FASTA (734 MB)
MOD (30 MB)
FVB/NJ sequenced FASTA (734 MB)
MOD (25 MB)
LP/J sequenced FASTA (734 MB)
MOD (32 MB)
SPRET/EiJ sequenced FASTA (734 MB)
MOD (193 MB)
MD5 checksums

 

Pseudogenome Tools

We provide a suite of tools that simplify the incorporation of our pseudogenomes into standard analysis and hiseq pipelines. 

[Lapels]

Lapels is used to remap pseudogenome alignments, in the form of a BAM file, back to the reference sequence. This entails the removal of all indels (via the cigar string modifications, the underlying sequence is unaltered) and adjustments to the fragment and its mate's starting positions. Lapels also annotates the number and types (SNPs, insertions, and deletions) of sequence variants seen in each read.

The input includes the BAM file of psedogenome alignment and the MOD file associated with the FASTA sequences used in the alignment. (Please bundle MOD and FASTA while downloading.)

The output is a BAM file with corrected reads positions, cigar strings, and annotated tags. It has been tested to be compatible with downstream tools, such as IGV (using the reference genome) and Cufflinks (using any referenced based transcript library).

The code for Lapels is written in Python and can be downloaded as a tarball from here. It requires the pysam library and the argparse library.

The code and documentation are hosted in http://code.google.com/p/lapels/.

Please report bugs or give suggestions in http://code.google.com/p/lapels/issues/list .

 

[Suspenders]

Suspenders merges the results of multiple alignments (BAM files) applied to the same set of reads. It is used when working with F1 and RIX crosses, where we suggest performing separate alignments to each parental genome. Suspenders then effectively merges and annotates these separate BAM files into a single consensus BAM file.

When reads map to the same genomic location in both alignments, only one read is output. Where there are differences in either mapping positions or multiplicity of reads, Suspenders determines the most likely alignment and source genome for the read, which is sent to the output BAM file. When there is no significant difference in the alignments all multiple mappings are output.

Suspenders is written in Python. It requires the pysam library and the argparse libraries.

The code for Suspenders is available in http://code.google.com/p/suspenders/.

 

Publications

S. Huang, C.-Y. Kao, L. McMillan, and W. Wang.Transforming genomes using mod files with applications. In Proceedings of the ACM Conferenceon Bioinformatics, Computational Biology and Biomedicine. ACM, 2013. [link]

J. Holt, S. Huang, L. McMillan, and W. Wang. Read annotation pipeline for high-throughput sequencing data. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM, 2013. [link]



UNC Systems Genetics Sponsored By: