I am experienced in the construction and use of distributed systems, which is a significant part of my PhD, in particular the
use of conventional batch-scheduling Linux clusters IBM's Platform LSF
(Load Sharing Framework) as well as newer
emerging technologies such as Apache Hadoop and Spark clusters. I've presented
my work in this area at various conferences, and in 2015 this has included
Morocco, Ireland and at home in the UK at my research institution (Royal
Holloway, University of London) where I assist with the teaching/delivery of the
"Large scale data analysis and storage" module of the MSc Datascience
postgraduate degree, see
I've also built and maintain my own Linux clusters - a batch-scheduler cluster
as well an Apache Hadoop/Spark cluster using commodity hardware - these serve
prototyping and research and development purposes.
Hadoop & MapReduce
I presented a framework that employs Apache Hadoop to parallelise non-MapReduce
applications for use in scientific applications at July 2015's 3D-Sig
(Conference on Structural Biology and Computational Biophysics) which was part
of the ISMB (International Society of Molecular Biology) held in Dublin [
view submission abstract ]. This application employs technologies and techniques
commonly used in "bigdata" to applications in the scientific domain.
Jointly conceived by myself and Dr. Hugh Shanahan (my PhD supervisor) and
implemented by me the project aims to facilitate structural biologists in
parallelising their applications using Hadoop without writing complex MapReduce
code. The use case utilised Hadoop Streaming and allows end users to run their
jobs against the protein databank of X-ray crystallographic and NMR (Nuclear
Magnetic Resonance) models of macro-molecular structures without having to
rewrite their existing applications for MapReduce.
Along with my PhD supervisor I co-supervised an MSc
student with her project to provide the protein databank through a Hive Query
language database for access by python applications.
As part of my PhD I've been researching and utilising Apache Spark for
purposes of the analysis of Next-Generation Sequencing data. In particular I'm
researching and developing methods to be used in pipelines for
transcriptomic analysis that employ MapReduce on Spark for efficient, high