Jamie Alnasir experienced in Distributed Computing solutions (Platform LSF, Apache Hadoop, Spark and MapReduce)

[an error occurred while processing this directive]

Home > Portfolio > Distributed Computing (Platform LSF, Apache Hadoop, Spark and MapReduce)

Distributed Computing experience/projects

I am experienced in the construction and use of distributed systems, which is a significant part of my PhD, in particular the use of conventional batch-scheduling Linux clusters IBM's Platform LSF (Load Sharing Framework) as well as newer emerging technologies such as Apache Hadoop and Spark clusters. I've presented my work in this area at various conferences, and in 2015 this has included Morocco, Ireland and at home in the UK at my research institution (Royal Holloway, University of London) where I assist with the teaching/delivery of the "Large scale data analysis and storage" module of the MSc Datascience postgraduate degree, see Teaching.

I've also built and maintain my own Linux clusters - a batch-scheduler cluster as well an Apache Hadoop/Spark cluster using commodity hardware - these serve prototyping and research and development purposes.

Apache Hadoop & MapReduce

I presented a framework that employs Apache Hadoop to parallelise non-MapReduce applications for use in scientific applications at July 2015's 3D-Sig (Conference on Structural Biology and Computational Biophysics) which was part of the ISMB (International Society of Molecular Biology) held in Dublin [ view submission abstract ]. This application employs technologies and techniques commonly used in "bigdata" to applications in the scientific domain. Jointly conceived by myself and Dr. Hugh Shanahan (my PhD supervisor) and implemented by me the project aims to facilitate structural biologists in parallelising their applications using Hadoop without writing complex MapReduce code. The use case utilised Hadoop Streaming and allows end users to run their jobs against the protein databank of X-ray crystallographic and NMR (Nuclear Magnetic Resonance) models of macro-molecular structures without having to rewrite their existing applications for MapReduce.

Apache Hive

Along with my PhD supervisor I co-supervised an MSc student with her project to provide the protein databank through a Hive Query language database for access by python applications.

Apache Spark

As part of my PhD I've been researching and utilising Apache Spark for purposes of the analysis of Next-Generation Sequencing data. In particular I'm researching and developing methods to be used in pipelines for transcriptomic analysis that employ MapReduce on Spark for efficient, high performance computing.