Compgen Tool Suite

Collaborative Cross Genomes

On this page we provide full genomic sequences for the Collaborative Cross (CC) mouse strains in the form of FASTA files for the 19 autosomes, sex chromosomes (X and Y), and mitochondria (M). They can be used as reference sequences for high-throughput short-read alignments, or for any other comparative genomic analyses.

Each genome comes with a companion MOD file, which can be used to remap coordinates from the FASTA sequences back to reference coordinates (Currently NCBI37/mm9). This is essential since all gene and genomic annotations are relative to the reference. MOD files are genome and version specific, and therefore should always be downloaded together as a set.

We supply two types of genomes, sequenced and imputed. Sequenced genomes result from direct DNA sequencing at a minimum of 30x coverage, and an iterative alignment process. Imputed genomes are derived from genotype data, where we first construct a haplotype mosaic and then assemble an imputed genome using segments of DNA sequence from known DNA contributors.

CC Founders

The CC founder strain genomes are derived from variant data and BAM files provided by the Welcome Trust Sanger Institute, and described by Keane et al. Statistics on the number of variants relative to the reference sequence are also provided.

Last update: 2012-11-08

Strain

Pseudogenome Statistics

Type

Downloads

A/J

sequenced

FASTA (734 MB)
MOD (23 MB)

C57BL/6J

sequenced

FASTA (734 MB)
MOD (<1 MB)

129S1/SvImJ

sequenced

FASTA (734 MB)
MOD (25 MB)

NOD/ShiLtJ

sequenced

FASTA (734 MB)
MOD (23 MB)

NZO/HlLtJ

sequenced

FASTA (734 MB)
MOD (24 MB)

CAST/EiJ

sequenced

FASTA (734 MB)
MOD (92 MB)

PWK/PhJ

sequenced

FASTA (734 MB)
MOD (89 MB)

WSB/EiJ

sequenced

FASTA (734 MB)
MOD (34 MB)

MD5 checksums

CC Strains

Last update: 2012-11-15

Strain

Pseudogenome Statistics

Type

Downloads

OR867

imputed

FASTA (734 MB)
MOD (36 MB)

OR1237

imputed

FASTA (734 MB)
MOD (32 MB)

OR1515

imputed

FASTA (734 MB)
MOD (34 MB)

OR3154

imputed

FASTA (734 MB)
MOD (32 MB)

OR3252

imputed

FASTA (734 MB)
MOD (28 MB)

OR559

imputed

FASTA (734 MB)
MOD (38 MB)

IL16211

imputed

FASTA (734 MB)
MOD (28 MB)

IL16188

imputed

FASTA (734 MB)
MOD (31 MB)

OR477

imputed

FASTA (734 MB)
MOD (28 MB)

OR5489

imputed

FASTA (734 MB)
MOD (35 MB)

AU8041

imputed

FASTA (734 MB)
MOD (36 MB)

AU8043

imputed

FASTA (734 MB)
MOD (33 MB)

MD5 checksums

Sanger Strains

We also present pseudogenomes of other mouse strains from the Mouse Genomes Project in Wellcome Trust Sanger Institute.

The variants are derived from the SNPs and indels data sets in ftp://ftp-mouse.sanger.ac.uk/REL-1111-SNPs/ .

Last update: 2012-11-19

Strain

Pseudogenome Statistics

Type

Downloads

129P2/OlaHsd

sequenced

FASTA (734 MB)
MOD (33 MB)

129S5SvEvBrd

sequenced

FASTA (734 MB)
MOD (29 MB)

AKR/J

sequenced

FASTA (734 MB)
MOD (31 MB)

BALB/cJ

sequenced

FASTA (734 MB)
MOD (27 MB)

C3H/HeJ

sequenced

FASTA (734 MB)
MOD (31 MB)

C57BL/6NJ

sequenced

FASTA (734 MB)
MOD (<1 MB)

CBA/J

sequenced

FASTA (734 MB)
MOD (31 MB)

DBA/2J

sequenced

FASTA (734 MB)
MOD (30 MB)

FVB/NJ

sequenced

FASTA (734 MB)
MOD (25 MB)

LP/J

sequenced

FASTA (734 MB)
MOD (32 MB)

SPRET/EiJ

sequenced

FASTA (734 MB)
MOD (193 MB)

MD5 checksums

Pseudogenome Tools

We provide a suite of tools that simplify the incorporation of our pseudogenomes into standard analysis and hiseq pipelines.

[Lapels]

Lapels is used to remap pseudogenome alignments, in the form of a BAM file, back to the reference sequence. This entails the removal of all indels (via the cigar string modifications, the underlying sequence is unaltered) and adjustments to the fragment and its mate's starting positions. Lapels also annotates the number and types (SNPs, insertions, and deletions) of sequence variants seen in each read.

The input includes the BAM file of psedogenome alignment and the MOD file associated with the FASTA sequences used in the alignment. (Please bundle MOD and FASTA while downloading.)

The output is a BAM file with corrected reads positions, cigar strings, and annotated tags. It has been tested to be compatible with downstream tools, such as IGV (using the reference genome) and Cufflinks (using any referenced based transcript library).

The code for Lapels is written in Python and can be downloaded as a tarball from here. It requires the pysam library and the argparse library.

The code and documentation are hosted in http://code.google.com/p/lapels/.

Please report bugs or give suggestions in http://code.google.com/p/lapels/issues/list .

[Suspenders]

Suspenders merges the results of multiple alignments (BAM files) applied to the same set of reads. It is used when working with F1 and RIX crosses, where we suggest performing separate alignments to each parental genome. Suspenders then effectively merges and annotates these separate BAM files into a single consensus BAM file.

When reads map to the same genomic location in both alignments, only one read is output. Where there are differences in either mapping positions or multiplicity of reads, Suspenders determines the most likely alignment and source genome for the read, which is sent to the output BAM file. When there is no significant difference in the alignments all multiple mappings are output.

Suspenders is written in Python. It requires the pysam library and the argparse libraries.

The code for Suspenders is available in http://code.google.com/p/suspenders/.

Publications

S. Huang, C.-Y. Kao, L. McMillan, and W. Wang.Transforming genomes using mod files with applications. In Proceedings of the ACM Conferenceon Bioinformatics, Computational Biology and Biomedicine. ACM, 2013. [link]

J. Holt, S. Huang, L. McMillan, and W. Wang. Read annotation pipeline for high-throughput sequencing data. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM, 2013. [link]

UNC Systems Genetics

Collaborative Cross Genomes

CC Founders

CC Strains

Sanger Strains

Pseudogenome Tools

Publications