Word composition X95544


Statement
 

pdf   zip

html

A nucleic acid or amino acid sequence can be seen as composed of a number of possibly overlapping k-mers or words of length k, for a certain k ≥ 1. The k-mer composition of a sequence is given by the frequency with which each possible k-mer occurs within the sequence. The 1-mer composition is related to the GC content of a DNA sequence, and the 2-mer, 3-mer, and 4-mer compositions are also known as the di-nucleotide, tri-nucleotide, and tetra-nucleotide compositions of a DNA sequence. For example, the di-nucleotide composition of TATAAT is given by one occurrence of AA, two ocurrences of AT, and two ocurrences of TA.

Write pseudocode, Python code, and C++ code for the word composition problem. The program must implement and use the word composition function in the pseudocode, which must be iterative and is not allowed to perform input/output operations. Make two submissions, including the pseudocode as a comment to both the Python and the C++ code.

Input

The input is a string s (a genomic sequence) over the alphabet Σ={A,C,G,T} and an integer k with 1 ≤ k ≤ ||s||.

Output

The output is a sorted list of all the k-mers of s and their frequencies.

Public test cases
  • Input

    TATAAT
    1
    

    Output

    A 3
    T 3
    
  • Input

    TATAAT
    2
    

    Output

    AA 1
    AT 2
    TA 2
    
  • Input

    TATAAT
    3
    

    Output

    AAT 1
    ATA 1
    TAA 1
    TAT 1
    
  • Input

    TATAAT
    4
    

    Output

    ATAA 1
    TAAT 1
    TATA 1
    
  • Input

    TATAAT
    5
    

    Output

    ATAAT 1
    TATAA 1
    
  • Input

    TATAAT
    6
    

    Output

    TATAAT 1
    
  • Information
    Author
    Gabriel Valiente
    Language
    English
    Official solutions
    C++ Python
    User solutions
    C++ Python