I have a table of 9M rows, snp_collection.tab with empty values in several columns:
snpID freq_bin ID_nearest_gene ID_nearest_gene_located_within HGNC_nearest_gene_snpsnap HGNC_nearest_gene_snpsnap_protein_coding flag_snp_within_gene flag_snp_within_gene_protein_coding
10:10001753 7 ENSG00000224788 CELF2 FALSE FALSE
10:10001794 41 ENSG00000224788 CELF2 FALSE FALSE
10:100023489 10 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 TRUE TRUE
10:100025128 45 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 TRUE TRUE
10:10002975 1 ENSG00000224788 CELF2 FALSE FALSE
Which for simplcity, looks like this:
This is messing up things when I later parse it using awk to produce a more useful table for my purposes:
Later step which gets messed up from dirty input:
tail -n+2 broadSNPs.tab | gawk -F'[:\t]' -vOFS="\t" '{ print "chr"$1, ($2 - 1), $2, $3, $18, $19, $20, $21}' > broadSNPs.bed
Here, the empty spaces aren't respected by awk, and rows with empty spaces are offset in the output.
So, I'm trying to fill in those empty spaces before awk.
Using the advice given here, Filling space/tab separated, empty columns with 0, I've done:
sed 's/\t\t/\tNA\t/g' snp_collection.tab > broadSNPs.tab
I have Also tried replacing the \t's with ctrl+v+tab, but alas the output remains the same as i the original snp_collection.tab, with the empty spaces still just empty.
What am I missing? Thanks!
Desired output:
snpID freq_bin ID_nearest_gene ID_nearest_gene_located_within HGNC_nearest_gene_snpsnap HGNC_nearest_gene_snpsnap_protein_coding flag_snp_within_gene flag_snp_within_gene_protein_coding
10:10001753 7 ENSG00000224788 NA NA CELF2 FALSE FALSE
10:10001794 41 ENSG00000224788 NA NA CELF2 FALSE FALSE
10:100023489 10 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 TRUE TRUE
10:100025128 45 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 TRUE TRUE
10:10002975 1 ENSG00000224788 NA NA CELF2 FALSE FALSE
With this fixed input, my awk command shouldn't produce a nonsensical output.