Ph.D. Projects at Royal Holloway University of London (RHUL)
Hercules (Apache Spark) k-mer Transcriptomics Analysis SystemJan 2014 - May 2017
Developed as part of the research undertaken in my Doctoral Ph.D thesis, this system is designed to perform deep analysis of short read RNA-seq transcriptomics data. It works at the exon level (non-reducible unit of genetic information) and examines k-mers (DNA/RNA sequences of length k) to quantify bias in distribution of mapped reads. This provides the ability to quality assess datasets. Designed to be scalable to extremely large, high-throughput datasets. For example, as of January 2017, the SRA alone (Sequence Read Archive) which stores DNA/RNA sequencing data, contains over 9 Petabases (9.377x1015, over 9 quadrillion letters, roughly a Petabyte of data) in over 30,000 experimental studies.
Quantifying Transcriptomic Read Distribution Across Exons
Application of Map-Reduce - Summarising k-mer Counts by Position
Investigation into metadata annotation in the SRA (Sequence Read Archive) DNA database repositoryOct 2013 - May 2015
Background: It is important to understand the challenges and limitations of data production, metadata annotation and deposition in high-throughput sequencing. The workflow for the production of high-throughput sequencing data from nucleic acid samples is complex. There are a series of protocol steps to be followed in the preparation of samples for next-generation sequencing. The quantification of bias in a number of protocol steps, namely DNA fractionation, blunting, phosphorylation, adapter ligation and library enrichment, remains to be determined. Our recent investigation of the Sequence Read Archive (SRA) revealed that the annotation of key sequencing steps in submissions to the database remains inadequate.
NGS (Next-Generation Sequencing) Steps
NGS Process Workflow
Results: We examined the experimental metadata of the public repository Sequence Read Archive (SRA) in order to ascertain the level of annotation of important sequencing steps in submissions to the database. Using SQL relational database queries (using the SRAdb SQLite database generated by the Bioconductor consortium) to search for keywords commonly occurring in key preparatory protocol steps partitioned over studies, we found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively), had at least one keyword corresponding to one of the three protocol steps. Only 4.06% of all records, partitioned over studies, had keywords for all three steps in the protocol (5.58% of all SRA records).
Levels of NGS Metadata Annotation in the SRA
Conclusions: The current level of annotation in the SRA inhibits systematic studies of bias due to these protocol steps. Downstream from this, meta-analyses and comparative studies based on these data will have a source of bias that cannot be quantified at present.
Metadata Annotation Poster
PDB-Hadoop (Apache) Framework for Running Legacy Computational Biology ToolsJan 2014 - May 2017
Developed as part of the research undertaken in my Doctoral Ph.D thesis, PDB-Hadoop
PDB-Hadoop Framework Architecture
Copyright © 1999-2023 Dr. Jamie J. Alnasir, all rights reserved. Legal Disclaimer|