A nucleic acid or amino acid sequence can be seen as composed of a number of possibly overlapping k-mers or words of length k, for a certain k ≥ 1. The k-mer composition of a sequence is given by the frequency with which each possible k-mer occurs within the sequence. The 1-mer composition is related to the GC content of a DNA sequence, and the 2-mer, 3-mer, and 4-mer compositions are also known as the di-nucleotide, tri-nucleotide, and tetra-nucleotide compositions of a DNA sequence. For example, the di-nucleotide composition of TATAAT is given by one occurrence of AA, two ocurrences of AT, and two ocurrences of TA.
Write pseudocode, Python code, and C++ code for the word composition problem. The program must implement and use the word composition function in the pseudocode, which must be iterative and is not allowed to perform input/output operations. Make two submissions, including the pseudocode as a comment to both the Python and the C++ code.
Input
The input is a string s (a genomic sequence) over the alphabet Σ={A,C,G,T} and an integer k with 1 ≤ k ≤ ||s||.
Output
The output is a sorted list of all the k-mers of s and their frequencies.
Input
TATAAT 1
Output
A 3 T 3
Input
TATAAT 2
Output
AA 1 AT 2 TA 2
Input
TATAAT 3
Output
AAT 1 ATA 1 TAA 1 TAT 1
Input
TATAAT 4
Output
ATAA 1 TAAT 1 TATA 1
Input
TATAAT 5
Output
ATAAT 1 TATAA 1
Input
TATAAT 6
Output
TATAAT 1