Facebook
Biological Data Mining
formats

BioDig: Finding Homologous Genes

Published on September 21, 2012, by in BioDig.

1. NCBI HomoloGene

link: http://www.ncbi.nlm.nih.gov/HomoloGene/HTML/homologene_buildproc.html

Proteins from input organisms-> bastp-> find DNA sequence of proteins -> synteny -> maximaization global score

2.  Online blog 1

link: http://www.personal.psu.edu/zuz17/blogs/psu_life/2011/02/understand-ucsc-netchain-alignment-1.html

3. paper1

link: http://genome.cshlp.org/content/11/5/803.full

“Computational Inference of Homologous Gene Structures in the Human Genome”

 

4. paper2

link: http://www.sciencemag.org/content/320/5875/486.full

Eukaryotic genomes differ in the degree to which genes remain on corresponding chromosomes (synteny) and in corresponding orders (collinearity) over time (1). For example, most eutherian (placental mammal) orders have incurred only moderate reshuffling of chromosomal segments since descent from common ancestors ∼130 million years ago (2). Indeed, karyotype evolution along major vertebrate lineages appears to have been slow since an inferred whole-genome duplication occurred ∼500 million years ago (3). Accordingly, accurate identification of orthologs across eutherian taxa is relatively routine, and deduction of synteny and collinearity is often straightforward with best-in-genome criteria (4), identifying one-to-one best matching chromosomal regions in pairwise genome comparisons.

5. Through evolutionary analysis

link: http://genome.cshlp.org/content/8/3/163.full

6. Ensembl Gene Homolog prediction method

link: http://www.ensembl.org/info/docs/compara/homology_method.html

 

http://www.ensembl.org/info/website/news.html

ProteinTrees and homologies (all species)

GeneTrees (protein-coding) with new/updated genebuilds and assemblies

  • Clustering using hcluster_sg
  • Multiple sequence alignments using MCoffee or Mafft
  • Phylogenetic reconstruction using TreeBeST
  • Homology inference
  • Pairwise gene-based dN/dS scores for high coverage species pairs only (both on orthologues and paralogues)
  • GeneTree stable ID mapping
  • Per family gene dynamics using CAFE

ncRNAtrees and homologies (all species)

  • Classification based on Rfam models
  • Multiple sequence alignments with Infernal
  • Phylogenetic reconstruction using RAxML
  • Phylogenetic reconstruction using FastTree2 and RAxML-Light for very big families
  • Additional multiple sequence alignments with Prank (w/ genomic flanks)
  • Additional phylogenetic reconstruction using PhyML and NJ
  • Phylogenetic tree merging using TreeBeST
  • Per family gene dynamics using CAFE
  • Homology inference

7. UCSC

(chimpanzee as example)

“The RNAs were aligned against the chimp genome using blat; those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.5% of the best and at least 25% base identity with the genomic sequence were kept. ”

Softwares

1. GenScan

2. Treebest

http://treesoft.sourceforge.net/treebest.shtml

Used in Ensembl homolog gene prediction

3. MCScanX

http://chibba.pgml.uga.edu/mcscan2/

4. Mercator

http://www.biostat.wisc.edu/~cdewey/mercator/

 

Database

1. phylomedb

http://phylomedb.org/

2.Oma Browser

http://omabrowser.org/cgi-bin/gateway.pl

3.  EggNOG

http://eggnog.embl.de/version_3.0/

 Review

The quest for orthologs: finding the corresponding gene across genomes

 

Key words: synteny, homolog, ortholog, paralog

formats

BioDig: High-throughput Sequencing Data Visualization

Published on August 28, 2012, by in BioDig.

Desktop App:

1. IGV

http://www.broadinstitute.org/igv/

Bam file need to be sorted and indexed

2. Samtools

http://samtools.sourceforge.net/

Special for sam or bam files

3. Savant Genome Browser

http://genomesavant.com/savant/index.php

 

Web Service:

Notice: For large raw sequencing datasets(>550M), it is not comfortable to upload data to the web service.

1. UCSC Genome Browser

http://genome.ucsc.edu/

2. galaxy

https://main.g2.bx.psu.edu/

formats

BioData: Fungi genome and annotation files downloading

Published on August 24, 2012, by in BioData.

1. Download from ensembl

http://fungi.ensembl.org/info/data/ftp/index.html

1.1 species included by far

shbya gossypii (Ashbya gossypii)
Aspergillus clavatus (Aspergillus clavatus)
Aspergillus flavus (Aspergillus flavus)
Aspergillus fumigatus (Aspergillus fumigatus)
Aspergillus fumigatusa1163 (Aspergillus fumigatusa1163)
Aspergillus nidulans (Aspergillus nidulans)
Aspergillus niger (Aspergillus niger)
Aspergillus oryzae (Aspergillus oryzae)
Aspergillus terreus (Aspergillus terreus)
Botryotinia fuckeliana (Botryotinia fuckeliana)
Fusarium oxysporum (Fusarium oxysporum)
Gaeumannomyces graminis (Gaeumannomyces graminis)
Gibberella moniliformis (Gibberella moniliformis)
Gibberella zeae (Gibberella zeae)
Leptosphaeria maculans (Leptosphaeria maculans)
Magnaporthe oryzae (Magnaporthe oryzae)
Magnaporthe poae (Magnaporthe poae)
Mycosphaerella graminicola (Mycosphaerella graminicola)
Nectria haematococca (Nectria haematococca)
Neosartorya fischeri (Neosartorya fischeri)
Neurospora crassa (Neurospora crassa)
Phaeosphaeria nodorum (Phaeosphaeria nodorum)
Puccinia graminis (Puccinia graminis)
Puccinia triticina (Puccinia triticina)
Saccharomyces cerevisiae (Saccharomyces cerevisiae)
Schizosaccharomyces pombe (Schizosaccharomyces pombe)
Sclerotinia sclerotiorum (Sclerotinia sclerotiorum)
Trichoderma virens (Trichoderma virens)
Tuber melanosporum (Tuber melanosporum)
Ustilago maydis (Ustilago maydis)

1.2 file types

DNA FASTA Files

cDNA FASTA Files

protein FASTA Files

GTF Files

1.3 example: down load S.pombe gene annotation file

ftp://ftp.ensemblgenomes.org/pub/fungi/release-15/gtf/schizosaccharomyces_pombe/Schizosaccharomyces_pombe.ASM294v1.15.gtf.gz

2. Download S.pombe from pombase

http://www.pombase.org/

2.1 gff file

ftp://ftp.sanger.ac.uk/pub/yeast/pombe/GFF/

comments: bad release control

3. Reference

Comparative Functional Genomics of the Fission Yeasts

 

Tags: ,
formats

BioDig: Digging Repeat Elements In The Genome

Published on August 24, 2012, by in BioDig.

Softwares

1. RepeatMasker

http://www.repeatmasker.org/RMDownload.html

“RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.”

problem digging: http://www.biostars.org/post/show/9174/error-invoking-makeblastdb/

 

2.RepeatModeler

http://www.repeatmasker.org/RepeatModeler.html

“RepeatModeler is a de-novo repeat family identification and modeling package. ”

3. COSEG

http://www.repeatmasker.org/COSEGDownload.html

“COSEG is a program which automatically identifies repeat subfamilies using significant co-segregating ( 2-3 bp ) mutations. ”

 

Database

1. UCSC Genome Browser Tables

http://genome.ucsc.edu/cgi-bin/hgTables?command=start

group: Variation And Repeat -> track:  RepeatMasker / Simple Repeats / Microsatellite

 

 

formats

BioDig: Transcription Factor Binding Sites Motif Enrichment Analysis

Published on August 16, 2012, by in BioDig.

Softwares:

1.  Homer

http://biowhat.ucsd.edu/homer/

2. MEME

http://meme.sdsc.edu/meme/intro.html

Meme is quite well designed motif enrichment analysis suite.

3.Motif Finding(DWE)

http://rulai.cshl.edu/cgi-bin/TRED/tred.cgi?process=analysisMotifDWEForm

(error generated during running)

4. RSAT

http://rsat.ccb.sickkids.ca/

Algorithms:

1.

 

Motif Data Structure

1. PSSMs

2.

 Online Motif Search

1. Motifmap

http://motifmap.ics.uci.edu/

 

 

formats

BioDig: Introduction To Data Mining In NCBI GEO Database

Published on August 8, 2012, by in BioDig.

Currently, NCBI GEO(Gene Expression Omnibus) is the best place to find published microarray and high-throughput sequencing data, which includes microarray data for gene expression, sequencing for gene expression(RNA-seq), ChIP-seq, DNA methylation profiling data based on microarray or sequencing (such as RRBS, Methyl-seq). GEO also provide services for users to do customized analysis, such as t-test to find differentially expressed genes based on user defined control and treat datasets (eg: GEO2R).

Methods To Access GEO By Programming:

1. GEOquery

R package

http://bioconductor.org/packages/1.8/bioc/html/GEOquery.html

2. GEOmetadb

http://gbnci.abcc.ncifcrf.gov/geo/

R package:

http://bioconductor.org/packages/2.2/bioc/html/GEOmetadb.html

Ref:

Strategies to Explore Functional Genomics Data Sets in NCBI’s GEO Database

Data Mining in GEO and Beyond

3. Signature Based Dataset Searching

4. GEO Data Analysis Tools

Tools 1:  Find Gene

Tools 2: Compare 2 Sets of samples

Tools 3: Cluster Heatmaps

Tools 4: Experiment design and value distribution

 

Tags: ,
formats

BioDig: Ruby Client Scripts For DAVID Webservice

Published on August 1, 2012, by in BioDig.

DAVID is one of the well known gene enrichment analysis service provider(free for academic clients). Recently, they  provided new web service, which can give user more power to automatically do gene enrichment analysis through their SOAP or WSDL .

Current limitation for this service is:

A job with more than 3000 genes to generate gene or term cluster report will not be handled by DAVID due to resource limit.
No more than 200 jobs in a day from one user or computer.
DAVID Team reserves right to suspend any improper uses of the web service without notice.

 

Here is the ruby scripts to use this servie:

requirement: savon  (gem install savon)

 

#!/usr/bin/env ruby
# This content is released under MIT License
# copyright (c) 2012 Gangcai Xie <www.biodm.com>
require 'savon'
url="http://david.abcc.ncifcrf.gov/webservice/services/DAVIDWebService?wsdl"
client=Savon.client(url)
#get all possible WSDL actions
client.wsdl.soap_actions

#run the WSDL action request with parameters
response=client.request :authenticate do
	soap.body={
		:email=>'user@example.com' #your email address
	}
end

inputIds ='31741_at,31734_at,32696_at,37559_at,41400_at,35985_at,39304_g_at,41438_at,35067_at,32919_at,35429_at,36674_at,967_g_at,36669_at,39242_at,39573_at,39407_at,33346_r_at,40319_at,2043_s_at,1788_s_at,36651_at,41788_i_at,35595_at,36285_at,39586_at,35160_at,39424_at,36865_at,2004_at,36728_at,37218_at,40347_at,36226_r_at,33012_at,37906_at,32872_at,989_at,32718_at,36957_at,32645_at,37628_at,33825_at,35687_at,32779_s_at,34493_at,31564_at,887_at,34712_at,32897_at,34294_at,41365_at,41446_f_at,34375_at,875_g_at,41099_at,919_at,38970_s_at,39159_at,34184_at,1018_at,38032_at,35956_s_at,35536_at,34562_at,1867_at,35957_at,39519_at,41657_at,38491_at,652_g_at,35776_at,34989_at,33455_at,39950_at,37723_at,31977_at,38629_at,34581_s_at,36210_g_at,35120_at,41532_at,37889_at,1332_f_at,40540_at,41105_s_at,1919_at,37542_at,39698_at,36711_at,36809_at,1167_s_at,31648_at,32364_at,40792_s_at,38685_at,41358_at,32931_at,35294_at,39870_at,38654_at,257_at,39071_at,35606_at,41726_at,33094_s_at,32405_at,1432_s_at,33698_at,408_at,39748_at,1953_at,36100_at,36101_s_at,1372_at,35314_at,40790_at,2030_at,179_at,1852_at,259_s_at,38024_at,35376_f_at,41779_at,39232_at,41159_at,40365_at,31626_i_at,40385_at,35613_at,37506_at,38207_at,887_at,600_at,1461_at,38691_s_at,1267_at,1177_at,1125_s_at,2036_s_at,31615_i_at,37283_at,40954_at,31758_at,36960_at,33143_s_at,37048_at,38538_at,1005_at,34963_at,39408_at,32464_at,706_at,1276_g_at,164_at,41445_at,40735_at,1891_at,1258_s_at,40856_at,1911_s_at,31562_at,32359_at,274_at,1804_at,41387_r_at,848_at,41499_at,39448_r_at,34537_at,36459_at,35500_at,37139_at,612_s_at,32133_at,39757_at,37629_at,38463_s_at,568_at,749_at,1939_at,38018_g_at,1857_at,32699_s_at,40661_at,1994_at,38373_g_at,33893_r_at,1388_g_at,35345_at,1385_at,36615_at,1263_at,37385_at,1774_at,37233_at,39753_at,32626_at,35915_at,35714_at,31669_s_at,36519_at,40473_at,1750_at,33751_at,37831_at,35472_at,41825_at,34666_at,35471_g_at,31888_s_at,37722_s_at,35414_s_at,39750_at,35726_at,37662_at,33802_at,352_at,31737_at,37938_at,36161_at,31558_at,34475_at,37223_at,38953_at,37857_at,189_s_at,41169_at,33092_at,38660_at,40895_g_at,37146_at,1936_s_at,38860_at,40210_at,41180_i_at,31586_f_at,33366_at,31521_f_at,762_f_at,1124_at,36009_at,41111_at,36749_at,37310_at,31522_f_at,35768_at,39421_at,39967_at,35992_at,38356_at,39331_at,34145_at,35378_at,199_s_at,35966_at,1866_g_at,37377_i_at,37378_r_at,833_at,31586_f_at,38062_at,34981_at,1569_r_at,1548_s_at,41446_f_at,36999_at,34226_at,33385_g_at,36173_r_at,1007_s_at,35149_at,38671_at,1973_s_at,37724_at,37317_at,33829_at,36532_at,39372_at,41717_at,38221_at,37418_at,33120_at,136_at,33492_at,1602_at,41505_r_at,41736_g_at,37862_at,31859_at,40913_at,35956_s_at,32193_at,1148_s_at,1244_at,38684_at,37440_at,32186_at,1242_at,39503_s_at,224_at,38374_at,36018_at,36603_at,33288_i_at,33662_at,33555_at,33539_at,430_at,471_f_at,1369_s_at,35372_r_at,38089_at,40310_at,41106_at,41216_r_at,32815_at,37463_r_at,33470_at,40522_at,1463_at,1743_s_at,1895_at,32583_at,35440_g_at,1091_at'
idType = 'AFFYMETRIX_3PRIME_IVT_ID'
listName = 'make_up'
listType = 0

response=client.request :addList do
   soap.body={
   :inputid=>inputIds,
   :idtype=>idType,
   :listname=>listName,
   :listtype=>listType
   }
end

response=client.request :getDefaultCategoryNames
#response.body[:get_default_category_names_response][:return]

thd=0.1 #threshold
count = 2
chart=client.request :getChartReport do
  soap.body={
    :pvalue=>thd,
    :count=>count
  }
end

#chart_return=chart.body[:get_chart_report_response][:return] #old way

chart_return=chart.to_hash[:get_chart_report_response][:return] # return is an array



#each item is a hash with keys: [:ease_bonferroni, :afdr, :benjamini, :bonferroni, 
# :category_name, :ease, :fisher, :fold_enrichment, :gene_ids, :id, :list_hits, 
# :list_name, :list_totals, :percent, :pop_hits, :pop_totals, :rfdr, :scores, :term_name, :"@xmlns:xsi", :"@xsi:type"]

choose_types=[:term_name,:category_name,:list_hits,:fold_enrichment,:ease,:bonferroni,:gene_ids]
type_new_names=%w(Term_Name Category Hits EnrichFold Pvalue Bonferroni Hits_Genes)
#output the result

outfile=File.new("test.tsv","w")
outfile.puts type_new_names.join("\t")
chart_return.each do |item|
  values=choose_types.collect{|type| item[type]}
  outfile.puts values.join("\t")
end
outfile.close

formats

BioQuestions: How different is the human genome from mouse genome?

Published on July 27, 2012, by in BioQuestions.

1. Gene Levels

 

2. Protein Levels

 

3. Genome Sequence

 

4. Chromosomes

Tags: ,
formats

BioGraphics: Highthroughput Transcriptome Profiling Visualization

Published on July 26, 2012, by in BioGraphics.

Preface: This section is about the methods to visualize large-scale transcriptome data, such as RNA-seq and microarray data.

1. PCA

1.1 multiple-dimensional data

 

2. hierarchical clustering

2.1 multiple-dimensional data

 

3. heatmap

3.1 multiple-dimensional data

 

4. MAPlot

4.1 only can visualize two-dimensional data (such as wild-type VS mutant)

 

5. UCSC Genome Browser

5.1 can only visualize part of the genome(such as one gene or one cluster of neighboring genes) at one time

5.2 easy to integrate with known database(such as Refseq, data from ENCODE project)

 

6. Density plot

6.1 multiple dimensional data, however cannot not be too much(better less than 6 samples)

 

7. Boxplot

7.1 use for multiple dimensional data

formats

BioDig: Tools for microRNA expression profiling based on small-RNA highthrougput sequencing

Published on July 25, 2012, by in BioDig.

1. miRDeep2

link to miRDeep2

mapping, quantification and prediction

language : perl

2.miRExpress

link to miRExpress

 

3. miReep

link to MiReep

Last update: 2009-07-17

4. miRanalyzer

link to miRanalyzer

Last update: April 26, 2011

input support: fa or sequences

5. miRtools

link to miRTools

provide web service

maximal input 20M

6. DSAP

link to DSAP

provide web service

maximal file size 300Mb

input files: fastq or sequencing tags

7. MiReNA

link to MiReNA

8. miRNAkey

link to miRNAkey

Current Reviews:

Performance comparison and evaluation of software tools for microRNA deep-sequencing data analysis

notes:

The simple and intuitive way may be the best one.