Splice life to understand yourself: 2008

Dec 29, 2008

What is the Obama's Logo？

I talked with one of my friends a moment ago.
I said, "You should change, you can have a better life."
He said, "I am waiting for the big change, I should stare at it silently.What is the Obama's Logo?"

I said in my heart, "It's hope, the rising sun, the circle of life, the circle of change."
用中国话讲以上叫“故弄玄虚”，不就是几千年前就有的八卦图里的道理么，只可惜现在越来越少中国年轻人关注中国文化了。老外用了也不引用，谴责之！

Dec 25, 2008

A reasonably thorough table of next-gen-seq software--ZZ from Seqanswer

A reasonably thorough table of next-gen-seq software available in the commercial and public domain

Integrated solutions
* CLCbio Genomics Workbench - de novo and reference assembly of Sanger, 454, Solexa, Helicos, and SOLiD data. Commercial next-gen-seq software that extends the CLCbio Main Workbench software. Includes SNP detection, browser and other features. Runs on Windows, Mac OS X and Linux.
* NextGENe - de novo and reference assembly of Illumina and SOLiD data. Uses a novel Condensation Assembly Tool approach where reads are joined via "anchors" into mini-contigs before assembly. Requires Win or MacOS.
* SeqMan Genome Analyser - Software for Next Generation sequence assembly of Illumina, 454 Life Sciences and Sanger data integrating with Lasergene Sequence Analysis software for additional analysis and visualization capabilities. Can use a hybrid templated/de novo approach. Early release commercial software. Compatible with Windows® XP X64 and Mac OS X 10.4.

Align/Assemble to a reference
* Bowtie - Ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of 25 million reads per hour on a typical workstation with 2 gigabytes of memory. Link to discussion thread here. Written by Ben Langmead and Cole Trapnell.
* ELAND - Efficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genome. Written by Illumina author Anthony J. Cox for the Solexa 1G machine.
* EULER - Short read assembly. By Mark J. Chaisson and Pavel A. Pevzner from UCSD (published in Genome Research).
* Exonerate - Various forms of alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference. Authors are Guy St C Slater and Ewan Birney from EMBL. C for POSIX.
* GMAP - GMAP (Genomic Mapping and Alignment Program) for mRNA and EST Sequences. Developed by Thomas Wu and Colin Watanabe at Genentec. C/Perl for Unix.
* MOSAIK - Reference guided aligner/assembler. Written by Michael Strömberg at Boston College.
* MAQ - Mapping and Assembly with Qualities (renamed from MAPASS2). Particularly designed for Illumina-Solexa 1G Genetic Analyzer, and has preliminary functions to handle ABI SOLiD data. Written by Heng Li from the Sanger Centre.
* MUMmer - MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. Released as a package providing an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools. Version 3.0 was developed by Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, Corina Antonescu and Steven L Salzberg - most of whom are at The Institute for Genomic Research in Maryland, USA. POSIX OS required.
* Novocraft - Tools for reference alignment of paired-end and single-end Illumina reads. Uses a Needleman-Wunsch algorithm. Available free for evaluation, educational use and for use on open not-for-profit projects. Requires Linux or Mac OS X.
* RMAP - Assembles 20 - 64 bp Solexa reads to a FASTA reference genome. By Andrew D. Smith and Zhenyu Xuan at CSHL. (published in BMC Bioinformatics). POSIX OS required.
* SeqMap - Works like ELand, can do 3 or more bp mismatches and also INDELs. Written by Hui Jiang from the Wong lab at Stanford. Builds available for most OS's.
* SHRiMP - Assembles to a reference sequence. Developed with Applied Biosystem's colourspace genomic representation in mind. Authors are Michael Brudno and Stephen Rumble at the University of Toronto.
* Slider- An application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences.. Authors are from BCGSC. Paper is here.
* SOAP - SOAP (Short Oligonucleotide Alignment Program). A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. Author is Ruiqiang Li at the Beijing Genomics Institute. C++ for Unix.
* SSAHA - SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a tool for rapidly finding near exact matches in DNA or protein databases using a hash table. Developed at the Sanger Centre by Zemin Ning, Anthony Cox and James Mullikin. C++ for Linux/Alpha.
* SXOligoSearch - SXOligoSearch is a commercial platform offered by the Malaysian based Synamatix. Will align Illumina reads against a range of Refseq RNA or NCBI genome builds for a number of organisms. Web Portal. OS independent.

de novo Align/Assemble
* MIRA2 - MIRA (Mimicking Intelligent Read Assembly) is able to perform true hybrid de-novo assemblies using reads gathered through 454 sequencing technology (GS20 or GS FLX). Compatible with 454, Solexa and Sanger data. Linux OS required.
* SHARCGS - De novo assembly of short reads. Authors are Dohm JC, Lottaz C, Borodina T and Himmelbauer H. from the Max-Planck-Institute for Molecular Genetics.
* SSAKE - Version 2.0 of SSAKE (23 Oct 2007) can now handle error-rich sequences. Authors are René Warren, Granger Sutton, Steven Jones and Robert Holt from the Canada's Michael Smith Genome Sciences Centre. Perl/Linux.
* VCAKE - De novo assembly of short reads with robust error correction. An improvement on early versions of SSAKE.
* Velvet - Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454. Need about 20-25X coverage and paired reads. Developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI).

SNP/Indel Discovery
* ssahaSNP - ssahaSNP is a polymorphism detection tool. It detects homozygous SNPs and indels by aligning shotgun reads to the finished genome sequence. Highly repetitive elements are filtered out by ignoring those kmer words with high occurrence numbers. More tuned for ABI Sanger reads. Developers are Adam Spargo and Zemin Ning from the Sanger Centre. Compaq Alpha, Linux-64, Linux-32, Solaris and Mac
* PolyBayesShort - A re-incarnation of the PolyBayes SNP discovery tool developed by Gabor Marth at Washington University. This version is specifically optimized for the analysis of large numbers (millions) of high-throughput next-generation sequencer reads, aligned to whole chromosomes of model organism or mammalian genomes. Developers at Boston College. Linux-64 and Linux-32.
* PyroBayes - PyroBayes is a novel base caller for pyrosequences from the 454 Life Sciences sequencing machines. It was designed to assign more accurate base quality estimates to the 454 pyrosequences. Developers at Boston College.

Genome Annotation/Genome Browser/Alignment Viewer/Assembly Database
* STADEN - Includes GAP4. GAP5 once completed will handle next-gen sequencing data. A partially implemented test version is available here
* EagleView - An information-rich genome assembler viewer. EagleView can display a dozen different types of information including base quality and flowgram signal. Developers at Boston College.
* XMatchView - A visual tool for analyzing cross_match alignments. Developed by Rene Warren and Steven Jones at Canada's Michael Smith Genome Sciences Centre. Python/Win or Linux.
* SAM - Sequence Assembly Manager. Whole Genome Assembly (WGA) Management and Visualization Tool. It provides a generic platform for manipulating, analyzing and viewing WGA data, regardless of input type. Developers are Rene Warren, Yaron Butterfield, Asim Siddiqui and Steven Jones at Canada's Michael Smith Genome Sciences Centre. MySQL backend and Perl-CGI web-based frontend/Linux.

CHiP-Seq/BS-Seq
* FindPeaks - perform analysis of ChIP-Seq experiments. It uses a naive algorithm for identifying regions of high coverage, which represent Chromatin Immunoprecipitation enrichment of sequence fragments, indicating the location of a bound protein of interest. Original algorithm by Matthew Bainbridge, in collaboration with Gordon Robertson. Current code and implementation by Anthony Fejes. Authors are from the Canada's Michael Smith Genome Sciences Centre. JAVA/OS independent. Latest versions available as part of the Vancouver Short Read Analysis Package
* CHiPSeq - Program used by Johnson et al. (2007) in their Science publication
* BS-Seq - The source code and data for the "Shotgun Bisulphite Sequencing of the Arabidopsis Genome Reveals DNA Methylation Patterning" Nature paper by Cokus et al. (Steve Jacobsen's lab at UCLA). POSIX.
* SISSRs - Site Identification from Short Sequence Reads. BED file input. Raja Jothi @ NIH. Perl.
* QuEST - Quantitative Enrichment of Sequence Tags. Sidow and Myers Labs at Stanford. From the 2008 publication Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. (C++)
**See also this thread for ChIP-Seq, until I get time to update this list.

Alternate Base Calling
* Rolexa - R-based framework for base calling of Solexa data. Project publication
* Alta-cyclic - "a novel Illumina Genome-Analyzer (Solexa) base caller"

Dec 24, 2008

Merry Christmas!

Merry Christmas! Made by BP .

Bioinformatics never die, just bioinformaticans fade away.

Bioinformatics never die, just bioinformaticans fade away.---Ege of Systems biology.

Bioinformatics has become too central to biology to be left to specialist bioinformaticians. Biologists are all bioinformaticians now.

Opinion
Bioinformatics: alive and kicking
Lincoln D Stein*†

found online at http://genomebiology.com/2008/9/12/114

Dec 4, 2008

What is fiend in life?

A thred I saw in DXY today. I like these words.

Although you walk through the dark valley of Ur life,
please feel no fear.
I just want to tell you
I will be with you,
.....always......

You can have my words.
As long as you need me,
I will .be there

Nov 10, 2008

Note: Reading large tables into R

Reading large tables from text files into R is possible but knowing a few tricks will make your life a lot easier and make R run a lot faster.

First, read the help page for ' read.table'. It contains many hints for how to read in large tables. Of course, help pages tend to be a little confusing so I'll try to distill the relevant details here.

The following options to 'read.table()' can affect R's ability to read large tables:

colClasses

This option takes a vector whose length is equal to the number of columns in year table. Specifying this option instead of using the default can make 'read.table' run MUCH faster, often twice as fast. In order to use this option, you have to know the class of each column in your data frame. If all of the columns are "numeric", for example, then you can just set 'colClasses = "numeric"'. If the columns are all different classes, or perhaps you just don't know, then you can have R do some of the work for you.

You can read in just a few rows of the table and then create a vector of classes from just the few rows. For example, if I have a file called "datatable.txt", I can read in the first 5 rows and determine the column classes from that:

tab5rows <- read.table("datatable.txt", header = TRUE, nrows = 5)
classes <- sapply(tab5rows, class)
tabAll <- read.table("datatable.txt", header = TRUE, colClasses = classes)

Always try to use 'colClasses', it will make a big difference.
nrows

Specifying the 'nrows' argument doesn't necessary make things go faster but it can help a lot with memory usage. R doesn't know how many rows it's going to read in so it first makes a guess, and then when it runs out of room it allocates more memory. The constant allocations can take a lot of time, and if R overestimates the amount of memory it needs, your computer might run out of memory.

Of course, you may not know how many rows your table has. The easiest way to find this out is to use the 'wc' command in Unix. So if you run 'wc datafile.txt' in Unix, then it will report to you the number of lines in the file (the first number). You can then pass this number to the 'nrows' argument of 'read.table()'.

If you can't use 'wc' for some reason, but you know that there are definitely less than, say, N rows, then you can specify 'nrows = N' and things will still be okay. A mild overestimate for 'nrows' is better than none at all.
comment.char
If your file has no comments in it (e.g. lines starting with '#'), then setting 'comment.char = ""' will sometimes make 'read.table()' run faster.

I have tested the method with command "file <- read.table("quant-norm.pm-gcbg.plier.pca-select.summary.txt", header = TRUE,skip=80, colClasses = classes,sep="\t", comment.char = "", nrows=1236087)". I found that 50% memory can be saved for these three parameters, while it will be out of memory if passby these settings. 4G RAM in Linux for test.

Nov 5, 2008

Simple the biological data

Cite: Scientia's Glorikian this week said that is unlikely that any one technology, be it qPCR, arrays, FISH, or second-gen sequencing, will win out in the end when it comes to molecular diagnostics.

I do agree with it and think that proteomics data should be a complement.

Oct 21, 2008

BP制造

Oct 19, 2008

Missions of April Youth?

我是不是“四月青年”？我在不在仰望星空？国家和社会对我们难道没有期望？必须清醒，但须坦然。

Oct 16, 2008

Where am I in spirit?

Kierkegaard's three modes of existence -- the aesthetic, the ethical, and the religious. I am in the second stage.

Søren Aabye Kierkegaard:(5 May 1813 – 11 November 1855) was a prolific 19th century Danish philosopher and theologian. Kierkegaard strongly criticized both the Hegelianism of his time, and what he saw as the empty formalities of the Danish church.

From wiki: http://en.wikipedia.org/wiki/S%C3%B8ren_Kierkegaard

Oct 15, 2008

Personal genomics and medicine resources

Some companies
https://www.23andme.com/
http://www.navigenics.com/
http://www.decode.com/
http://www.knome.com/home/
http://www.dnadirect.com/
http://www.pacificbiosciences.com/

http://www.hgbiochip.com
http://www.yigene.com/

Bioinfor companies for NGS
http://www.genomequest.com/

Public personal genomics
http://thepersonalgenome.com/
http://www.yourgenome.org/

A good news website of personal genomics
http://www.eyeondna.com/

Resources for paticipate medicine and predictive medicine
http://mydaughtersdna.org/
http://www.framinghamheartstudy.org/risk/hrdcoronary.html

A aggregator for pathway database
http://www.pathguide.org/

Two popular medical informatics system.
VISTA http://www.virec.research.va.gov/DataSourcesName/VISTA/VISTA.htm
the clinical health services framework implemented by the Boston-based CareGroup

From genotyping to complete genome analysis.

Kabasiji key resource

http://a.kavkiskey.com/new.html
Useful resource.

Oct 14, 2008

History and philosophy

Recently I like reading book of history and philosophy very much. I learned something from those stories. Human is in nature and belong to nature. We should love nature.

When will my book of Wang Yang-Ming arrive? Wang Yang-Ming was a Ming Chinese idealist Neo-Confucian philosopher, official, educationist, calligraphist and general. Please see http://en.wikipedia.org/wiki/Wang_Yangming to know him.

Bird Hang is my first English name, I like it.

When I was enrolled in my university in 1998, I played basketball with my classmates almost everyday under the sunshine. I remembered that I was very confident of my shooting at that time and they gave me a nickname "bird".

I can't fly and I am not a bird at all. It's from "Larry Bird" of Celtic in NBA. Ten years went by, I am bird now reminded by Dr. Chen. Additionally, happy to heard from B that he took part in parade of Celtic in Boston for championship. I like the air at that site.

Splice life from points to system

Life is a mystery for me. I had tried my best to understand life for about 3 decades. It is just like a cake, you should cut it and taste it. In your mouth, you will find it not only as a small point but also as a huge system.

We should respect the works of God and human nature.

Splice life to understand yourself