
Patrice Koehl 
New Algorithms for Biological Sequence Analyses 
Software: SeqKernelThe C++ source code for SeqKernel is available under LGPL license.
Source code: WSeqKernel.cpp Input substitution matrix: BLOSUM62.txt Optimized subsitution matrix: OPTIM10.txt To compile and run the code:
Overview: alignmentfree method for sequence alignmentThe order in which amino acids appear defines the primary sequence of a protein. Amino acids are usually labeled using a oneletter code, and sequences are correspondingly represented as strings of letters. This representation has proved very valuable, especially in the context of sequence alignment that are performed using stringmatching algorithms. Those methods represent the overwhelming majority of methods used for sequence comparisons in bioinformatics. When comparing two sequences, they proceed in two steps, first the generation of the alignment between the two sequences, then the derivation of a statistical score for that alignment. They rely on a weighting scheme to measure the cost of matching pairs of amino acids. Many such weights have been proposed, from substitution matrices derived from evolution models such as the PAM matrices and the BLOSUM matrices, to matrices that capture physicochemical properties of amino acids. Using this score, an alignment is derived following a dynamic programming algorithm, either the local method of Smith and Waterman, or the global method of Needleman and Wunsch. This alignment is then scored by summing the individual weights of the matching pairs of amino acids and adding penalties for the presence of gaps. It should be noted that this score is not a metric in sequence space. Statistical methods have been developed to assess the significance of such scores, both for gapped and nongapped alignments. Such statistical scores are widely used for the identification of homologous sequences, or for fold recognition. It has been shown however that those scores are efficient for both tasks when sequences with high similarities can be identified, but often fail for dissimilar pairs. Extensions to pairwise alignment methods have been proposed to alleviate those problems, such as those based on multiple sequence alignments and profiles, and those based on Hidden Markov Models. While those show improved sensitivity, they remain prone to the problems related to the construction and use of alignments. To overcome the limitations of the alignmentbased methods described above, many "alignmentfree" methods have been developed over the past three decades. We have developed our own version of such an alignment free alternative that is based on the concept of string kernels. Starting from recently proposed kernels on the discrete space of protein sequences (Shen et al, Found. Comput. Math., 2013,14:951984), we have developed our own version, SeqKernel. SeqKernel: A string kernel for alignmentfree sequence comparisonThe string kernel considered here is inspired by the convolution string kernels introduced by D. Haussler (University of California, Santa Cruz; 1999. UCSCRL9910.), adapted by Saigo et al (Saigo H, Vert JP, Ueda N, Akutsu T. Bioinformatics. 2004;20:16821689.) as the local alignment kernels, and later by Smale and coworkers (Shen et al, Found. Comput. Math., 2013,14:951984). We provide here the key elements of its construction. Readers should refer to the original papers for a more detailed presentation, notably for the proofs of the mathematical properties that are relevant to kernels in general. Notations Let $X$ be a finite set of size $n$. A kernel K is a symmetric function from $X\times X$ to $\mathbb{R}$ such that the Gram matrix $G$ of size $n\times n$ defined by $G(i,j)=K(x_i,x_j)$ is symmetric, positive, and definite. Let $\mathcal{A}$ be the set of the standard twenty amino acids found in proteins. A protein sequence $S$ is a string of elements from $\mathcal{A}$. We note $S$ the length of $S$. A kernel for amino acid pairs Let $SA$ be a function from $\mathcal{A}\times \mathcal{A}$ to $\mathbb{R}$, such that $SA(i,j)$ measures the similarity of the amino acids $i$ and $j$, and let $SM$ be the Gram matrix associated to $SA$. Examples of $SM$ include the matrices representing the raw data of any BLOSUM matrices, namely the raw counts of how often amino acid $i$ is substituted by amino acid $j$ in a set of selected protein sequence alignments that is then normalized by considering its row sums $P(i)$: $$P(i) = \sum_{j=1}^{20} SM(i,j) $$ $$SM2(i,j) = \frac{SM(i,j)}{P(i)p(j)} $$ We have checked that when $SM$ is a raw count BLOSUM matrix, then $SM2$ is symmetric, positive, and definite. Note that when $SM$ is such a raw count matrix, the corresponding BLOSUM matrix $BL$ is defined as $BL(i,j) = round(log_2(SM2(i,j)))$, where $round$ is the function that rounds a real number to its nearest integer. Given a strictly positive real number $\beta$, we define the function $K_1: \mathcal{A}\times \mathcal{A} \rightarrow \mathbb{R}$ as: $$K_1(i,j) = SM2(i,j)^{\beta}$$ $K_1$ is a kernel function on $\mathcal{A}\times \mathcal{A}$, as long as $SM2$ is symmetric, positive, definite, and $\beta$ is strictly positive. A kernel for comparing two strings of the same length. Let $k$ be a strictly positive integer and let $\mathcal{A}^k$ be the $k$th Cartesian power of $\mathcal{A}$. An element of $\mathcal{A}^k$ is a string of $k$ letters taken from $\mathcal{A}$, it is usually referred to as a kmer, or a kgram. Let $u^k=(u_1,\ldots,u_k)$ and $v^k=(v_1,\ldots,v_k)$ be two kmers in $\mathcal{A}^k$. The function $K_2^k$ defined by: $$K_2^k(u^k,v^k) = \prod_{i=1}^k K_1(u_i,v_i)$$ is a kernel on $\mathcal{A}^k$, the set of all kmers. We note that $K_2^k$ is a convolution kernel. A kernel for computing protein sequence similarity. Let $S=(s_1,\ldots,s_n)$ and $T=(t_1,\ldots,t_m)$ be two protein sequences; both are strings, with $S \in \mathcal{A}^n$ and $T \in \mathcal{A}^m$. Let $u^k$ and $v^k$ be substrings of length $k$ (i.e. kmers) of $S$ and $T$ respectively. $u^k$ and $v^k$ are considered contiguous, i.e. we do not allow gaps. There are therefore $nk+1$ and $mk+1$ such substrings in $S$ and $T$, respectively. We define
where the summation extends to $p=\min(n,m)$, $n$ and $m$ the lengths of the two sequences that are compared, and $\omega_k$ a weight. We set the weights $\omega_k$ to: $$\omega(k)=\frac{1}{(nk+1)(mk+1)}$$ Finally, we define the correlation kernel $\hat{K}_3$ as:$$\hat{K}_3(S,T) = \frac{K_3(S,T)}{ \sqrt{K_3(S,S)K_3(T,T)}}$$ We make the following remarks:
