Heyo! This week we’re lucky enough to have a guest post by fellow MSTP, Vincent Laufer, GS3. Here’s his account of adventures in genomics and computer programming at the NCBI Hackathon earlier this month. Computer whizzes, gene whizzes, and whizzes of any sort, read on and enjoy. -Paige
In late June, I got an email from a friend, Anna, detailing something called the “NCBI hackathon.” Intrigued, I read through the information at: http://www.ncbi.nlm.nih.gov/news/05-28-2015-genomics-hackathon-august/. Now to many of you, this may conjure images of grown, bearded men in ninja turtles T-shirts eating cold pizza and wearing thick glasses, and it may sound like something you want nothing to do with. To me, however, it sounded fantastic. I spoke to my PhD mentor, Lou Bridges, and sent in the application form that day, and 2 weeks later learned from the NCBI, or National Center for Bioinformatics (which is part of the National Library of Medicine at the NIH), that I had been selected for the team on “push-button filtering of VCF files.”
Scarcely 4 weeks after that, I was packing for my trip to Bethesda, Maryland. And, although I still did not know if I should be dressing in a ninja turtle T-shirt or a suit and tie (I went with a happy medium of the two), I did know enough to be excited, and I was not disappointed…
I arrived on the morning of Monday the 3rd, and in total about 50 people from industry, graduate and postdoctoral training, the NIH, and elsewhere filed in. We were divided into 6 teams, with diverse goals such as education, RNA-seq normalization, genomic DNA bioinformatics processing, etc. After introductions, we divided into our teams. We started by talking through what we planned to do, then divided up the work flow. To explain this workflow I’ll have to take a step back and talk a little bit about how we currently sequence a human genome.
Right now, we use technology based on “short read sequencing” almost everywhere. Briefly, you extract genomic DNA from a sample, like blood, massively amplify that DNA, then chop it up into short bits, then align that to a reference genome that is as accurate as we can make it, for now.
The file containing all the short bits, before they are aligned, is a FAST Q file. Once you align them all, you can stack them vertically, with all the reads assigned to a position. That is a rough idea of a BAM file. Finally, you can compare all the calls of a specific variant in a BAM file. If you are confident that something is a real variant and not an artifact, you can “call” the variant. In effect, this means, “ I believe there is a real variant at chr2:30494823.” This goes into a variant call file or VCF file. This file is what the field has converged on as the standard, and it is what you do most of your analysis (as opposed to preparation or QC) on.
So the first 3 steps of our pipeline were to automate this process. We take any of the three files, and convert it, ultimately, into a VCF file. We then take the VCF file, and annotate it using a program called snpEff (http://snpeff.sourceforge.net/). What I mean is, we say, this variant, the one called at chr4:40249428 is actually in a Gene, and it is a missense variant, and it’s rsID is rs3949842, and it has a Sift score of 2, and it has a conservation score of 7.2, etc. etc.
The next steps we took were really interesting and novel. We then take the VCF file and convert it to a JSON object, which is an object with a standard syntax that can be processed using a variety of web tools. We then uploaded this genomic information – in the form of a JSON object – into SOLR (https://en.wikipedia.org/wiki/Apache_Solr ) (think of it like a kind of search engine). This enables you to view the genomic information you uploaded (in whatever form in which you uploaded it) in the browser in an indexable fashion.
So, each of us had a part of this pipeline to code – mine was the variant annotation. We worked on it day and night for three days, taking time out to socialize on Tuesday night and during the day a bit. I met extremely accomplished people in several fields of bioinformatics, people such as Dr. Sean Davis of the NCI and Lukas Wagner of the NCBI, peers in graduate study, researchers making the transition from post-doctoral candidate to junior faculty, and many others. At the end of the three days, we all presented our projects. I was stunned at the quality and scope of the projects that my colleagues were able to put together in that short time – and what’s more it is all open-source and available on GitHub: https://github.com/DCGenomics?tab=repositories, so you can stand agape along side me.
Departing from Bethesda, I had so much to take away from the experience .. new tools I had learned to use like Snakemake (https://bitbucket.org/johanneskoester/snakemake/wiki/Home) and Pycharm (https://www.jetbrains.com/pycharm/), new contacts and colleagues, and most importantly a new outlook: that together we can build the informatic infrastructure necessary to break some of the most recalcitrant problems that have persisted in medical science.