Dr Jamie Alnasir - Ph.D. Projects at Royal Holloway

Ph.D. Projects at Royal Holloway University of London (RHUL)

Hercules (Apache Spark) k-mer Transcriptomics Analysis System

Jan 2014 - May 2017

Developed as part of the research undertaken in my Doctoral Ph.D thesis, this system is designed to perform deep analysis of short read RNA-seq transcriptomics data. It works at the exon level (non-reducible unit of genetic information) and examines k-mers (DNA/RNA sequences of length k) to quantify bias in distribution of mapped reads. This provides the ability to quality assess datasets. Designed to be scalable to extremely large, high-throughput datasets. For example, as of January 2017, the SRA alone (Sequence Read Archive) which stores DNA/RNA sequencing data, contains over 9 Petabases (9.377x1015, over 9 quadrillion letters, roughly a Petabyte of data) in over 30,000 experimental studies.

Quantifying Transcriptomic Read Distribution Across Exons

Application of Map-Reduce - Summarising k-mer Counts by Position

Alnasir, J. & Shanahan, H. P. (2017). Transcriptomics: Quantifying non-uniform read distribution using MapReduce.. International Journal of Foundations of Computer Science, 29(8). [ Read ]
Alnasir, J., & Shanahan, H. (2017). A novel method to detect bias in Short Read NGS RNA-seq data. Journal of Integrative Bioinformatics, 14(3). [ Read ]
Alnasir, J., & Shanahan, H. P. (2015). Transcriptomics on Spark Workshop ? Introducing Hercules ? an Apache Spark MapReduce algorithm for quantifying non-uniform gene expression. CloudTech'16, Marrakech, Morocco. [ Read]
Alnasir, J. (2016). Computer Science Post-graduate Research Colloquium 2016. Royal Holloway, pg 5 [ Read ]

Investigation into metadata annotation in the SRA (Sequence Read Archive) DNA database repository

Oct 2013 - May 2015

Background: It is important to understand the challenges and limitations of data production, metadata annotation and deposition in high-throughput sequencing. The workflow for the production of high-throughput sequencing data from nucleic acid samples is complex. There are a series of protocol steps to be followed in the preparation of samples for next-generation sequencing. The quantification of bias in a number of protocol steps, namely DNA fractionation, blunting, phosphorylation, adapter ligation and library enrichment, remains to be determined. Our recent investigation of the Sequence Read Archive (SRA) revealed that the annotation of key sequencing steps in submissions to the database remains inadequate.

NGS (Next-Generation Sequencing) Steps

NGS Process Workflow

Results: We examined the experimental metadata of the public repository Sequence Read Archive (SRA) in order to ascertain the level of annotation of important sequencing steps in submissions to the database. Using SQL relational database queries (using the SRAdb SQLite database generated by the Bioconductor consortium) to search for keywords commonly occurring in key preparatory protocol steps partitioned over studies, we found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively), had at least one keyword corresponding to one of the three protocol steps. Only 4.06% of all records, partitioned over studies, had keywords for all three steps in the protocol (5.58% of all SRA records).

Levels of NGS Metadata Annotation in the SRA

Conclusions: The current level of annotation in the SRA inhibits systematic studies of bias due to these protocol steps. Downstream from this, meta-analyses and comparative studies based on these data will have a source of bias that cannot be quantified at present.

Metadata Annotation Poster

Alnasir, J., & Shanahan, H. P. (2015). Investigation into the annotation of protocol sequencing steps in the sequence read archive. GigaScience, 4(1), 23. [ Read ]
Alnasir, J. (2014). Computer Science Post-graduate Research Colloquium 2014. Royal Holloway, pg 20 [ Read ]

PDB-Hadoop (Apache) Framework for Running Legacy Computational Biology Tools

Jan 2014 - May 2017

Developed as part of the research undertaken in my Doctoral Ph.D thesis, PDB-Hadoop

PDB-Hadoop Framework Architecture

PDB-Hadoop Poster

Alnasir, J. & Shanahan, H. P. (2018). The application of Hadoop in Structural Bioinformatics. Oxford University Press: Briefings in Bioinformatics, pp1-10. [ Read ]
Alnasir, J. & Shanahan, H., (2015). PDB-Hadoop: Parallelising user applications on the protein databank using Apache Hadoop. Poster session presented at 3DSig Structural Bioinformatics and Computational Biophysics 2015, Dublin, Ireland. [ Read ]
Alnasir, J (2015). Applying Apache Hadoop, Hive and Map Reduce to Legacy Systems and Applications. 1hr Conference talk at MSTI (Mediterranean Space of Technology and Innovation) Innovation week 2015. ENSIAS (École nationale supérieure d'informatique et d'analyse des systèmes), Rabat, Morocco.
[ Programme
Alnasir, J. (2014). Computer Science Post-graduate Research Colloquium 2015. Royal Holloway, pg 9 [ Read ]