Before you build your own classification model, please try GeneScissors with the default model to make sure everything is right.
Because different populations and different RNA-seq setups may cause misalignment errors behave differently, building a new classification model for GeneScissors is necessary when you apply GeneScissors to a new dataset. If your RNA-seq read length is 100bp and your datasets are F1 mice, you probably can use the default learning model supplied by GeneScissors. If your dataset is different, you may consider to train your own model to identify the suspicious transcripts.
First, you need to simulate the RNA-seq datasets that mimic the properties of the real datasets you want to analyze (e.g. the length of reads, population genome). You can use our RNA-seq simulator or use flux simulator.
Then, you need to prepare the label file, which includes all genes and transcripts that are expressed in the simulation datasets. The label file includes six columns separated by tabs: Ensembl Transcript Id, Abundance Level, Ensembl Gene Id, Gene Name, and Transcript Name.
Each row of bam_file_list should contain two columns now, separated by tabs: the bam file and the abundance file of the same individual. Then, you should run
python factory/manager.py --filelist bam_file_list --reftable reference.table --simulation
This will generate all commands used for generating the training datasets. After you run all commands, you should run
python shop/generate_learning_model.py
It will generate the classification model and save it to learning_model_all at the root folder of GeneScissors.
The output of training data is stored at folder named simulation_data, so please remember to remove the outputs under the folder if you want to build a new model.