Commit fa299231 authored by ClovisG's avatar ClovisG
Browse files

doc upgrade

parent 04726373
......@@ -181,9 +181,12 @@ summary {
<p>Here are some documentation about methods, software, libraries you can use for your project.</p>
<div id="biopython" class="section level2">
<p><a href="">BioPython</a> is a standard library to perform various jobs in bioinformatics. In particular it is useful to handle IO access with standard file formats, as detailed below.</p>
<p>To install, use <code>pip3</code>:</p>
<pre><code>pip3 install biopython</code></pre>
<div id="read-files-in-standard-format-fasta" class="section level3">
<h3>Read files in standard format (FASTA)</h3>
<p>Most of the sequence files in bioinformatics are in a very simple format called <a href="">FASTA</a>.</p>
......@@ -216,8 +219,7 @@ summary {
for i,record in enumerate(Bio.SeqIO.parse(&quot;path/to/my/file.fasta&quot;,&quot;fasta&quot;):
if i%1000==0:
<div id="illumina-sequencing" class="section level2">
......@@ -225,6 +227,65 @@ for i,record in enumerate(Bio.SeqIO.parse(&quot;path/to/my/file.fasta&quot;,&quo
<p><a href="">Illumina</a> is the main technology used nowadays for sequencing, although the <a href="">Nanopore</a> is rising. Illumina produces short redas (pieces sequences of 150-250bp) compared to Nanopore (10kbp-100kbp), but the produced data is relibale there is usually &lt;1% of sequecing errors. Moreover, those errors have nice statistical properties: as a rough approximation, we can consider those errors to be uniformly distributed, and containing only mutations, i.e. substituting a letter by another one (Nanopore has mainly insertion and deletion).</p>
<p>The quantity of reads covering a given position in the genome is well approximated by a Poisson distribution of parameter <span class="math inline">\(\lambda\)</span>, equaling the sequencing depth, i.e. the quantity of data you get as output divided by the length of the genome.</p>
<div id="databases-and-on-line-tools" class="section level2">
<h2>Databases and on-line tools</h2>
<div id="blastx" class="section level3">
<p><a href=";PROGRAM=blastx&amp;PAGE_TYPE=BlastSearch&amp;BLAST_SPEC=">blastx</a> finds proteins - in the NCBI public database of protein sequences - that are encoded by the nucleic sequences provided as input.</p>
<div id="hmmer" class="section level3">
<p><a href="">HMMer</a> allows to search databases of known protein sequences, in particular Pfam (see below).</p>
<div id="pfam" class="section level3">
<p><a href="">Pfam</a> is a database of protein families (i.e. protein that share a common ancestor and that are supposed to play similar roles in the organisms they are expressed in).</p>
<p>You can retrieve pre-computed multiple sequence alignments of known protein families (in the <strong>Alignments</strong> tab in the menu on the left). These protein sequence alignments can be used for many purposes, and in particular for protein structure prediction (see following section).</p>
<div id="protein-structure-prediction" class="section level2">
<h2>Protein structure prediction</h2>
<div id="predicting-contact-map-from-multiple-sequence-alignment-msa" class="section level3">
<h3>Predicting contact map from multiple sequence alignment (MSA)</h3>
<p>There are 20 standard amino acids. They have various physicochemical properties, such as size, charge, aromatic cycle, etc.</p>
<p>The structure and therefore the interactions between the amino acid of the protein sequence determine its function in the organism. Facing amino-acid in 3D structure interacts, we say there are in <em>contact</em>. The structure of the protein is stable mainly due to electrostatic interaction between amino-acid being in contact.</p>
<p>Therefore if one mutates, the amino-acid in contact have to compensate the mutation: e.g. if two amino-acid X and Y are charged positively and negatively respectively, if X is mutated in a negatively charged amino-acid, then Y have to be mutated in a positively charged amino-acid.</p>
<p>By aligning mulitple related protein sequences and computing the mutual information between the position of the alignment, one can infer the possible contacts. The mutual information is defined as follows:</p>
<span class="math inline">\(MI(i,j) = \sum f_{i,j,a,b} log_2\frac{f_{i,j,a,b}}{f_{i,a}.f_{j,b}}\)</span>
<p>Where <span class="math inline">\(f_{i,j,a,b}\)</span> is the frequency of the event “having amino acid <span class="math inline">\(a\)</span> at position <span class="math inline">\(i\)</span> and amino acid <span class="math inline">\(b\)</span> at position <span class="math inline">\(j\)</span>”, and <span class="math inline">\(f_{i,a}\)</span> and <span class="math inline">\(f_{j,b}\)</span> the corresponding marginals. If there are a lot of changes at position <span class="math inline">\(i\)</span> and if those changes are correlated to changes at position <span class="math inline">\(j\)</span>, then the mutual information will be high. If position <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span> are independent, then the joint equals the product of the marginals (<span class="math inline">\(f_{i,j,a,b} = f_{i,a}.f_{j,b}\)</span>) and the mutual information is 0.</p>
<p>A simple approach when there are enough sequences in an MSA is to suppose that:</p>
<span class="math inline">\(MI(i,j)&gt;\tau \Rightarrow \text{contact between } i \text{ and } j\)</span>
<p>where <span class="math inline">\(\tau\)</span> is a threshold that has to be tuned.</p>
<p>As some columns tend to predict too many contacts (many correlations arise due to inherited mutations, and not by compensatory effect), the following usual correction is applied:</p>
<span class="math inline">\(MI(i,j) = MI(i,j) - \frac{1}{N}\sum\limits_{k}(MI(k,j)+MI(i,k))\)</span>
<div id="ft-comar" class="section level3">
<p>This software generates a 3D structure from a contact map.</p>
<p>You can download it here: <a href="">FT-COMAR</a>.</p>
<p>The contact map have to be in the following format:</p>
<p>Where the first line is the number <span class="math inline">\(N\)</span> of amino-acid in the sequence, and the following <span class="math inline">\(N^2\)</span> lines indicate if there is a <strong>contact</strong> by a 1 between amino acid <span class="math inline">\(i\)</span> and amino acid <span class="math inline">\(j\)</span>, or <strong>no contact by a 0</strong>. The order of the lines correspond to the lexicographic order: (1,1), (1,2),…, (1,N), (2,1), (2,2),…, (2,N), …, (N,N).</p>
<p>Here is an example for getting the 3D structure (in PDB format) associated to the contacts listed in <code>my_contacts.lst</code>:</p>
<pre><code>path/to/FT-COMAR my_contacts.lst 9 0 test.pdb</code></pre>
<p>If you want to visualize the obtained 3D structure, you can use rasmol by calling:</p>
<p><code>rasmol test.pdb</code></p>
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment