I have gff file, the contents are like the following (tab separated):
# start gene 1Chr.g1
1Chr AUGUSTUS gene 3636 5916 0.1 + . ID=1Chr.g1
1Chr AUGUSTUS transcript 3636 5916 0.1 + . ID=1Chr.g1.t1;Parent=1Chr.g1
1Chr AUGUSTUS transcription_start_site 3636 3636 . + . Parent=1Chr.g1.t1
1Chr AUGUSTUS exon 3636 3913 . + . Parent=1Chr.g1.t1
1Chr AUGUSTUS start_codon 3760 3762 . + 0 Parent=1Chr.g1.t1
1Chr AUGUSTUS intron 3914 3995 1 + .
1Chr AUGUSTUS CDS 3760 3913 1 + 0 ID=1Chr.g1.t1.cds;Parent=1Chr.g1.t1
1Chr AUGUSTUS stop_codon 5628 5630 . + 0 Parent=1Chr.g1.t1
1Chr AUGUSTUS transcription_end_site 5916 5916 . + . Parent=1Chr.g1.t1
# start gene 1Chr.g2
1Chr AUGUSTUS gene 5938 8761 0.17 - . ID=1Chr.g2
1Chr AUGUSTUS transcript 5938 8761 0.17 - . ID=1Chr.g2.t1;Parent=1Chr.g2
1Chr AUGUSTUS transcription_end_site 5938 5938 . - . Parent=1Chr.g2.t1
1Chr AUGUSTUS exon 5938 6594 . - . Parent=1Chr.g2.t1
1Chr AUGUSTUS stop_codon 6428 6430 . - 0 Parent=1Chr.g2.t1
1Chr AUGUSTUS intron 6595 7156 0.8 - . Parent=1Chr.g2.t1
1Chr AUGUSTUS CDS 6428 6594 0.89 - 2 ID=1Chr.g2.t1.cds;Parent=1Chr.g2.t1
# start gene 2Chr.g1
2Chr AUGUSTUS gene 11612 13481 0.09 - . ID=2Chr.g1
2Chr AUGUSTUS transcript 11612 13481 0.09 - . ID=2Chr.g1.t1;Parent=2Chr.g1
2Chr AUGUSTUS transcription_end_site 11612 11612 . - . Parent=2Chr.g1.t1
2Chr AUGUSTUS exon 11612 13481 . - . Parent=2Chr.g1.t1
2Chr AUGUSTUS stop_codon 11864 11866 . - 0 Parent=2Chr.g1.t1
2Chr AUGUSTUS CDS 11864 12940 1 - 0 ID=2Chr.g1.t1.cds;Parent=2Chr.g1.t1
2Chr AUGUSTUS start_codon 12938 12940 . - 0 Parent=2Chr.g1.t1
2Chr AUGUSTUS transcription_start_site 13481 13481 . - . Parent=2Chr.g1.t1
# start gene 2Chr.g2
2Chr AUGUSTUS gene 22876 31223 0.04 + . ID=2Chr.g2
2Chr AUGUSTUS transcript 22876 31223 0.04 + . ID=2Chr.g2.t1;Parent=2Chr.g2
2Chr AUGUSTUS transcription_start_site 22876 22876 . + . Parent=2Chr.g2.t1
2Chr AUGUSTUS exon 22876 23456 . + . Parent=2Chr.g2.t1
2Chr AUGUSTUS exon 23515 24451 . + . Parent=2Chr.g2.t1
2Chr AUGUSTUS start_codon 23519 23521 . + 0 Parent=2Chr.g2.t1
I want to replace the IDs of the genes which are 1Chr.g1, 1Chr.g2, 2Chr.g1, and 2Chr.g2 to just in sequence like start from g1 to end of the IDs like in this case g4.
Expected Output
# start gene g1
1Chr AUGUSTUS gene 3636 5916 0.1 + . ID=g1
1Chr AUGUSTUS transcript 3636 5916 0.1 + . ID=g1.t1;Parent=g1
1Chr AUGUSTUS transcription_start_site 3636 3636 . + . Parent=g1.t1
1Chr AUGUSTUS exon 3636 3913 . + . Parent=g1.t1
1Chr AUGUSTUS start_codon 3760 3762 . + 0 Parent=g1.t1
1Chr AUGUSTUS intron 3914 3995 1 + .
1Chr AUGUSTUS CDS 3760 3913 1 + 0 ID=g1.t1.cds;Parent=g1.t1
1Chr AUGUSTUS stop_codon 5628 5630 . + 0 Parent=g1.t1
1Chr AUGUSTUS transcription_end_site 5916 5916 . + . Parent=g1.t1
# start gene g2
1Chr AUGUSTUS gene 5938 8761 0.17 - . ID=g2
1Chr AUGUSTUS transcript 5938 8761 0.17 - . ID=g2.t1;Parent=g2
1Chr AUGUSTUS transcription_end_site 5938 5938 . - . Parent=g2.t1
1Chr AUGUSTUS exon 5938 6594 . - . Parent=g2.t1
1Chr AUGUSTUS stop_codon 6428 6430 . - 0 Parent=g2.t1
1Chr AUGUSTUS intron 6595 7156 0.8 - . Parent=g2.t1
1Chr AUGUSTUS CDS 6428 6594 0.89 - 2 ID=g2.t1.cds;Parent=g2.t1
# start gene g3
2Chr AUGUSTUS gene 11612 13481 0.09 - . ID=g3
2Chr AUGUSTUS transcript 11612 13481 0.09 - . ID=g3.t1;Parent=g3
2Chr AUGUSTUS transcription_end_site 11612 11612 . - . Parent=g3.t1
2Chr AUGUSTUS exon 11612 13481 . - . Parent=g3.t1
2Chr AUGUSTUS stop_codon 11864 11866 . - 0 Parent=g3.t1
2Chr AUGUSTUS CDS 11864 12940 1 - 0 ID=g3.t1.cds;Parent=g3.t1
2Chr AUGUSTUS start_codon 12938 12940 . - 0 Parent=g3.t1
2Chr AUGUSTUS transcription_start_site 13481 13481 . - . Parent=g3.t1
# start gene g4
2Chr AUGUSTUS gene 22876 31223 0.04 + . ID=g4
2Chr AUGUSTUS transcript 22876 31223 0.04 + . ID=g4.t1;Parent=g4
2Chr AUGUSTUS transcription_start_site 22876 22876 . + . Parent=g4.t1
2Chr AUGUSTUS exon 22876 23456 . + . Parent=g4.t1
2Chr AUGUSTUS exon 23515 24451 . + . Parent=g4.t1
2Chr AUGUSTUS start_codon 23519 23521 . + 0 Parent=g4.t1
I wrote the following bash script, but it took too long, as I tried to count its time, so for one sed it took 1 second, and if there are 28000 iterations it will take about 8 hours, which is too much time.
Is there any efficient way to do this?
awk '$3 == "gene"' $1 |cut -f9 |grep -o "=.*" |sed -e 's/=//g' >LIST.txt
COUNTER=0
cat LIST.txt | while read line; do
COUNTER=$(expr $COUNTER + 1)
echo "sed -i 's/$line/g$COUNTER/g' $1" |bash
done
rm LIST.txt
Another thing, generate a file sedTG45 which is very annoying.