Publications & Research 🔬

All my research work, that has been published (in reverse chronological order), is listed here. If you are interested in reading any specific publication, please reach out to me at contact@sanjaysoundarajan.dev for the full text.

Publicly Available Imaging Datasets for Age-related Macular Degeneration: Evaluation according to the Findable, Accessible, Interoperable, Reusable (FAIR) Principles
Nayoon Gim, Alina Ferguson, Marian Blazes, Sanjay Soundarajan, et al.
Experimental Eye Research
DOI: 10.1016/j.exer.2025.110342
Abstract
Age-related macular degeneration (AMD), a leading cause of vision loss among older adults, affecting more than 200 million people worldwide. With no cure currently available and a rapidly increasing prevalence, emerging approaches such as artificial intelligence (AI) and machine learning (ML) hold promise for advancing the study of AMD. The effective utilization of AI and ML in AMD research is highly dependent on access to high-quality and reusable clinical data. The Findable, Accessible, Interoperable, Reusable (FAIR) principles, published in 2016, provide a framework for sharing data that is easily usable by both humans and machines. However, it is unclear how these principles are implemented with regards to ophthalmic imaging datasets for AMD research. We evaluated openly available AMD-related datasets containing optical coherence tomography (OCT) data against the FAIR principles. The assessment revealed that none of the datasets were fully compliant with FAIR principles. Specifically, compliance rates were 5% for Findable, 82% for Accessible, 73% for Interoperable, and 0% for Reusable. The low compliance rates can be attributed to the relatively recent emergence of these principles and the lack of established standards for data and metadata formatting in the AMD research community. This article presents our findings and offers guidelines for adopting FAIR practices to enhance data sharing in AMD research.
Citation
@article{GIM2025110342, title = {Publicly Available Imaging Datasets for Age-related Macular Degeneration: Evaluation according to the Findable, Accessible, Interoperable, Reusable (FAIR) Principles}, journal = {Experimental Eye Research}, pages = {110342}, year = {2025}, issn = {0014-4835}, doi = {https://doi.org/10.1016/j.exer.2025.110342}, url = {https://www.sciencedirect.com/science/article/pii/S0014483525001137}, author = {Nayoon Gim and Alina Ferguson and Marian Blazes and Sanjay Soundarajan and Aydan Gasimova and Yu Jiang and Clarissa Sanchez Gutiérrez and Luca Zalunardo and Giulia Corradetti and Tobias Elze and Naoto Honda and Nadia Waheed and Anne Marie Cairns and M. Valeria Canto-Soler and Amitha Dolmalpally and Mary Durbin and Daniela Ferrara and Jewel Hu and Prashant Nair and Aaron Y. Lee and Srinivas R. Sadda and Tiarnan D.L. Keenan and Bhavesh Patel and Cecilia S. Lee}, keywords = {AMD, Artificial Intelligence, Machine Learning, FAIR Data, Data Sharing, Data Reuse, OCT Dataset}, abstract = {Age-related macular degeneration (AMD), a leading cause of vision loss among older adults, affecting more than 200 million people worldwide. With no cure currently available and a rapidly increasing prevalence, emerging approaches such as artificial intelligence (AI) and machine learning (ML) hold promise for advancing the study of AMD. The effective utilization of AI and ML in AMD research is highly dependent on access to high-quality and reusable clinical data. The Findable, Accessible, Interoperable, Reusable (FAIR) principles, published in 2016, provide a framework for sharing data that is easily usable by both humans and machines. However, it is unclear how these principles are implemented with regards to ophthalmic imaging datasets for AMD research. We evaluated openly available AMD-related datasets containing optical coherence tomography (OCT) data against the FAIR principles. The assessment revealed that none of the datasets were fully compliant with FAIR principles. Specifically, compliance rates were 5% for Findable, 82% for Accessible, 73% for Interoperable, and 0% for Reusable. The low compliance rates can be attributed to the relatively recent emergence of these principles and the lack of established standards for data and metadata formatting in the AMD research community. This article presents our findings and offers guidelines for adopting FAIR practices to enhance data sharing in AMD research.}}
AI-READI: rethinking AI data collection, preparation and sharing in diabetes research and beyond
AI-READI Consortium
Nature Metabolism
DOI: 10.1038/s42255-024-01165-x
Abstract
Here, we introduce Artificial Intelligence Ready and Equitable Atlas for Diabetes Insights (AI-READI), a multidisciplinary data-generation project designed to create and share a multimodal dataset optimized for artificial intelligence research in type 2 diabetes mellitus.
Citation
@Article{Baxter2024-ez, title = '{AI-READI}: rethinking {AI} data collection, preparation and sharing in diabetes research and beyond', author = 'Baxter, Sally L and de Sa, Virginia R and Ferryman, Kadija and Jain, Prachee and Lee, Cecilia S and Li-Pook-Than, Jennifer and Liu, T Y Alvin and Owen, Julia P and Patel, Bhavesh and Yu, Qilu and Zangwill, Linda M and Bahmani, Amir and Chute, Christopher G and Edberg, Jeffrey C and Hurst, Samantha and Ishikawa, Hiroshi and Lee, Aaron Y and McGwin, Gerald and McWeeney, Shannon and Nebeker, Camille and Owsley, Cynthia and Singer, Sara J and Adib, Riddhiman and Adibuzzaman, Mohammad and Alavi, Arash and Ashley, Catherine and Baer, Adrienne and Benton, Erik and Blazes, Marian and Cohen, Aaron and Cordier, Benjamin and Crist, Katie and Cuddy, Colleen and Gasimova, Aydan and Gim, Nayoon and Hong, Stephanie and Kim, Trina and Lin, Wei-Chun and Mitchell, Jessica and Ngadisastra, Caitlyn and Patronilo, Victoria and Shaffer, Jamie and Soundarajan, Sanjay and Zhao, Kevin and Drolet, Caroline and Lucero, Abigail and Matthies, Dawn and Pittock, Hanna and Watkins, Kate and York, Brittany and Amankwa, Charles E and Bangudi, Monique and Haboudal, Nada and Hallaj, Shahin and Heinke, Anna and Huang, Lingling and Kalaw, Fritz Gerald P and Karsolia, Apoorva and Khazaei, Hadi and Mohammed, Muna and Simpkins, Kyongmi and Wang, Xujing and Consortium, A I-R E A D I and Committee, Writing and Investigators, Principal and Research, Technical and Staff, Clinical and Managers, Project and {Interns} and Scientists, Nih Program', abstract = 'Here, we introduce Artificial Intelligence Ready and Equitable Atlas for Diabetes Insights (AI-READI), a multidisciplinary data-generation project designed to create and share a multimodal dataset optimized for artificial intelligence research in type 2 diabetes mellitus.', journal = 'Nature Metabolism', month = nov, year = 2024}
SODA: Software to Support the Curation and Sharing of FAIR Autonomic Nervous System Data
Christopher Marroquin, Jacob Clark, Dorian Portillo, Sanjay Soundarajan, Tram Ngo and Bhavesh Patel
Journal of Open Source Software
DOI: 10.21105/joss.06140
Abstract
SODA (Software to Organize Data Automatically) is an open source and free cross-platform desktop software that assists researchers in preparing and sharing their autonomic nervous system (ANS) related data according to the guidelines developed by the National Institute of Health (NIH)'s Stimulating Peripheral Activity to Relieve Conditions (SPARC) Program. By combining intuitive user interfaces with automation, SODA streamlines the process of implementing the SPARC guidelines which can otherwise be challenging and/or time consuming for researchers.
Citation
@Article{Marroquin2024, doi = {10.21105/joss.06140}, url = {https://doi.org/10.21105/joss.06140}, year = {2024}, publisher = {The Open Journal}, volume = {9}, number = {100}, pages = {6140}, author = {Christopher Marroquin and Jacob Clark and Dorian Portillo and Sanjay Soundarajan and Tram Ngo and Bhavesh Patel}, title = {SODA: Software to Support the Curation and Sharing of FAIR Autonomic Nervous System Data}, journal = {Journal of Open Source Software} }
Clinical Dataset Structure: A Universal Standard for Structuring Clinical Research Data and Metadata
Bhavesh Patel, Sanjay Soundarajan, Aydan Gasimova, Nayoon Gim, Jamie Shaffer and Aaron Y Lee
ARVO Annual Meeting Abstract
Abstract
During clinical research studies, multiple modalities of data are typically collected such as surveys, vitals, and eye images. There is currently no consensus on how to structure such multimodal data into a consistently organized dataset that is easily reusable by humans and machines in line with the FAIR (Findable, Accessible, Interoperable, Reusable) Principles. We addressed this issue in the Artificial Intelligence Ready and Equitable Atlas for Diabetes Insights (AI-READI) project by developing the Clinical Dataset Structure (CDS), a standard approach for organizing multimodal clinical research data and metadata at the root level.
Making Biomedical Research Software FAIR: Actionable Step-by-step Guidelines with a User-support Tool
Bhavesh Patel, Sanjay Soundarajan, Hervé Ménager and Zicheng Hu
Nature / Scientific Data
DOI: 10.1038/s41597-023-02463-x
Abstract
Findable, Accessible, Interoperable, and Reusable (FAIR) guiding principles tailored for research software have been proposed by the FAIR for Research Software (FAIR4RS) Working Group. They provide a foundation for optimizing the reuse of research software. The FAIR4RS principles are, however, aspirational and do not provide practical instructions to the researchers. To fill this gap, we propose in this work the first actionable step-by-step guidelines for biomedical researchers to make their research software compliant with the FAIR4RS principles. We designate them as the FAIR Biomedical Research Software (FAIR-BioRS) guidelines. Our process for developing these guidelines, presented here, is based on a re-classification of the FAIR4RS principles and a thorough review of current practices in the field. To support researchers, we have also developed a tool that streamlines the process of implementing these guidelines. This tool is incorporated in FAIRshare, a free and open-source software application aimed at simplifying the curation and sharing of FAIR biomedical data and software through user-friendly interfaces and automation. Details about this tool are also presented.
Citation
@Article{Patel2023, author={Patel, Bhavesh and Soundarajan, Sanjay and Ménager, Hervé and Hu, Zicheng}, title={Making Biomedical Research Software FAIR: Actionable Step-by-step Guidelines with a User-support Tool}, journal={Scientific Data}, year={2023}, month={Aug}, day={23}, volume={10}, number={1}, pages={557}, issn={2052-4463}, doi={10.1038/s41597-023-02463-x},url={https://doi.org/10.1038/s41597-023-02463-x}}
SPARClink: an interactive tool to visualize the impact of the SPARC program
Sanjay Soundarajan, Sachira Kuruppu, Ashutosh Singh, Jongchan Kim and Monalisa Achalla
F1000Research
DOI: 10.12688/f1000research.75071.1
Abstract
The National Institutes of Health (NIH) Stimulating Peripheral Activity to Relieve Conditions (SPARC) program seeks to accelerate the development of therapeutic devices that modulate electrical activity in nerves to improve organ function. SPARC-funded researchers are generating rich datasets from neuromodulation research that are curated and shared according to FAIR (Findable, Accessible, Interoperable, and Reusable) guidelines and are accessible to the public on the SPARC data portal. Keeping track of the utilization of these datasets within the larger research community is a feature that will benefit data-generating researchers in showcasing the impact of their SPARC outcomes. This will also allow the SPARC program to display the impact of the FAIR data curation and sharing practices that have been implemented. This manuscript provides the methods and outcomes of SPARClink, our web tool for visualizing the impact of SPARC, which won the Second prize at the 2021 SPARC FAIR Codeathon. With SPARClink, we built a system that automatically and continuously finds new published SPARC scientific outputs (datasets, publications, protocols) and the external resources referring to them. SPARC datasets and protocols are queried using publicly accessible REST application programming interfaces (APIs, provided by Pennsieve and Protocols.io) and stored in a publicly accessible database. Citation information for these resources is retrieved using the NIH reporter API and National Center for Biotechnology Information (NCBI) Entrez system. A novel knowledge graph-based structure was created to visualize the results of these queries and showcase the impact that the FAIR data principles can have on the research landscape when they are adopted by a consortium.
Citation
@Article{ 10.12688/f1000research.75071.1, AUTHOR = { Soundarajan, S and Kuruppu, S and Singh, A and Kim, J and Achalla, M}, TITLE = {SPARClink: an interactive tool to visualize the impact of the SPARC program [version 1; peer review: 1 approved with reservations]}, JOURNAL = {F1000Research}, VOLUME = {11}, YEAR = {2022}, NUMBER = {124}, DOI = {10.12688/f1000research.75071.1}}
A comprehensive and high-performance motif finding approach on heterogeneous systems
Sanjay Soundarajan
Thesis
Abstract
Unknown regulatory motif finding on DNA sequences is a crucial task for understanding gene expression and the task requires accuracy and efficiency. We propose DMF, a combinatorial approach that uses hash-based heuristics to skip unnecessary computations while retaining the maximum accuracy. Parallelized versions of our DMF approach, called PDMF, have been developed to use CPU, GPU and heterogeneous computing architectures in order to achieve the maximum performance. PDMF also incorporates SIMD instructions to further accelerate the task of unknown motif search. Our experimental results show that the multicore version (PDMFm) achieved 8.87x speedup over DMF. The GPU version (PDMFg) achieved a 41.48x and 9.95x average speedup over the serial version and PDMFm, respectively. Our SIMD enhanced heterogeneous approach (PDMFh) achieved a 3.42x speedup over our fastest GPU model (PDMFg1). The proposed approach was tested for performance against popular approximate and suffix tree-based approaches with various sized real-world datasets and the experimental results showed that the proposed approach achieved the maximum accuracy within a practical time bound for motif lengths 6~14.
Citation
Soundarajan, Sanjay. California State University, Fresno ProQuest Dissertations Publishing, 2020.
CPU-GPU Collaborated Computation Models for Biological Sequence Alignment with Mirror-Based Work Load Balancing
Sanjay Soundarajan, Michelle Salomon and Jin H. Park
2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS)
DOI: 10.1109/icpads47876.2019.00043
Abstract
Biological sequence alignment has been used in many application areas of computational biology and bioinformatics and both non-heuristic and heuristic algorithms have been developed and used. Although the high accuracy is guaranteed, non-heuristic approaches, such as Smith-Waterman algorithm, are not popularly used in the real world due to the quadratic time complexity. However, recent technological development on HPC systems made researchers propose a diversified acceleration approaches of the Smith-Waterman algorithm, including GPU based approaches. In this paper, we propose efficient CPU-GPU collaborated computation models for protein sequence alignment based on the Smith-Waterman algorithm to exploit the maximum efficiency on the heterogeneous system with multicore processor(s) and GPU(s). The proposed approach implements the full functionalities of the local sequence alignment, which are BLAST compatible, and uses efficient strategies for work load balancing and exploiting the maximum performance on the heterogeneous system. Our experimental results showed that the best CPU-GPU collaborated computation model outperforms the corresponding serial and the basic GPU computation models with 30.5x and 2.78x speedups, respectively, on the system with two Xeon E5-2670 processors and a Tesla M2075 GPU.
Citation
@inproceedings{Soundarajan_2019, doi = {10.1109/icpads47876.2019.00043}, url = {https://doi.org/10.1109%2Ficpads47876.2019.00043}, year = 2019, month = {dec}, publisher = {{IEEE}}, author = {Sanjay Soundarajan and Michelle Salomon and Jin H. Park}, title = {{CPU}-{GPU} Collaborated Computation Models for Biological Sequence Alignment with Mirror-Based Work Load Balancing}, booktitle = {2019 {IEEE} 25th International Conference on Parallel and Distributed Systems ({ICPADS})}}
Demystifying Transportation Using Big Data Analytics
Fletcher Trueblood, David Rodriguez, Jese Hernandez, Michelle Salomon, Sanjay Soundarajan and Matin Pirouz
2019 International Conference on Computational Science and Computational Intelligence (CSCI)
DOI: 10.1109/csci49370.2019.00240
Abstract
With the ever-growing generation and collection of data, there are ample opportunities to extract useful information from big data. The transportation industry, particularly the taxi companies, are a significant contributor to this data age. This research analyzes a 2016 voluminous taxi dataset from the City of Chicago to find impactful transportation trends for determining city hotspots based on time and location. Customer satisfaction was used as a way of deciding which taxi companies need to look at improving their customer service. Linear regression models were used to estimate tips relative to the distance traveled and the time taken. The haversine distance was utilized to pair the latitude and longitude coordinates of drop-offs and their next pickup. To maximize the driver's earnings, information on tips, and to analyze the average range to drivers next fare were combined. Stakeholders, customers, and transportation authorities can use the results of this analysis to plan better commute patterns.
Citation
@inproceedings{Trueblood_2019, doi = {10.1109/csci49370.2019.00240}, url = {https://doi.org/10.1109%2Fcsci49370.2019.00240}, year = 2019, month = {dec}, publisher = {{IEEE}}, author = {Fletcher Trueblood and David Rodriguez and Jese Hernandez and Michelle Salomon and Sanjay Soundarajan and Matin Pirouz}, title = {Demystifying Transportation Using Big Data Analytics}, booktitle = {2019 International Conference on Computational Science and Computational Intelligence ({CSCI})}}
PDMF: Parallel Dictionary Motif Finder on Multicore and GPU
Michelle Salomon, Sanjay Soundarajan and Jin H. Park
2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
DOI: 10.1109/hpcc/smartcity/dss.2019.00031
Abstract
Unknown motif finding in a set of DNA sequences is an important step of understanding the functionality of a group of genes and it requires accuracy and efficiency. We propose and present high-performance computation models for accelerating an efficient combinatorial approach of finding motif, which uses tree based bypassing and hash based heuristics to skip unnecessary computations with keeping the maximum accuracy. The computation models are designed for multicore processors and GPU and implemented with OpenMP and CUDA, respectively. To achieve the maximum efficiency, we also developed efficient heterogeneous computation models, in which multicore processor(s) and GPU collaborate. The collection of the resulting products is named PDMF and tested on a couple of HPC systems for performance. Our experimental results showed that the multicore version (PDMFm) achieved average 4.63x and 8.87x speedups over the serial version on a couple of systems with 4 cores and 16 cores, respectively. The GPU version (PDMFg) achieved average 41.48x and 9.95x speedups over the serial version and PDMFm on a system with a 4-core host CPU and a GPU. The best heterogeneous version showed ~1.4x speedup over the baseline GPU version.
Citation
@inproceedings{Salomon_2019, doi = {10.1109/hpcc/smartcity/dss.2019.00031}, url = {https://doi.org/10.1109%2Fhpcc%2Fsmartcity%2Fdss.2019.00031}, year = 2019, month = {aug}, publisher = {{IEEE}}, author = {Michelle Salomon and Sanjay Soundarajan and Jin H. Park}, title = {{PDMF}: Parallel Dictionary Motif Finder on Multicore and {GPU}}, booktitle = {2019 {IEEE} 21st International Conference on High Performance Computing and Communications$mathsemicolon$ {IEEE} 17th International Conference on Smart City$mathsemicolon$ {IEEE} 5th International Conference on Data Science and Systems ({HPCC}/{SmartCity}/{DSS})}}
Efficient Branch and Bound Motif Finding with Maximum Accuracy based on Hashing
Sanjay Soundarajan, Michelle Salomon and Jin H. Park
2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC)
DOI: 10.1109/ccwc.2019.8666485
Abstract
The problem of finding unknown motifs in DNA sequences has been studied seriously in last three decades and a variety of solution approaches have appeared in the literature. We present an efficient combinatorial approach of finding the optimum motif(s) with the maximum accuracy based on a couple of hash based heuristics, which greatly reduce the searching space and unnecessary computations used in the traditional tree based branch and bound mechanism. The proposed approach was tested for performance with various sized real world datasets and our experimental results showed that the proposed approach achieved the maximum accuracy within the practical time, even better than some popularly used approximate and suffix tree based approaches in most of the cases we tested in our practice.
Citation
@inproceedings{Soundarajan_2019, doi = {10.1109/ccwc.2019.8666485}, url = {https://doi.org/10.1109%2Fccwc.2019.8666485}, year = 2019, month = {jan}, publisher = {{IEEE}}, author = {Sanjay Soundarajan and Michelle Salomon and Jin H. Park}, title = {Efficient Branch and Bound Motif Finding with Maximum Accuracy based on Hashing}, booktitle = {2019 {IEEE} 9th Annual Computing and Communication Workshop and Conference ({CCWC})}}
A Gaze-Based Virtual Keyboard Using a Mouth Switch for Command Selection
Sanjay Soundarajan and Hubert Cecotti
2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)
DOI: 10.1109/embc.2018.8512929
Abstract
Portable eye-trackers provide an efficient way to access the point of gaze from a user on a computer screen. Thanks to eyetracking, gaze-based virtual keyboard can be developed by taking into account constraints related to the gaze detection accuracy. In this paper, we propose a new gaze-based virtual keyboard where all the letters can be accessed directly through a single command. In addition, we propose a USB mouth switch that is directly connected through a computer mouse, with the mouse switch replacing the left click button. This approach is considered to tackle the Midas touch problem with eye-tracking for people who are severely disabled. The performance is evaluated on 10 participants by comparing the following three conditions: gaze detection with mouth switch, gaze detection with dwell time by considering the distance to the closest command, and the gaze detection within the surface of the command box. Finally, a workload using NASA-TLX test was conducted on the different conditions. The results revealed that the proposed approach with the mouth switch provides a better performance in terms of typing speed (36.6 ± 8.4 letters/minute) compared to the other conditions, and a high acceptance as an input device.
Citation
@inproceedings{Soundarajan_2018, doi = {10.1109/embc.2018.8512929}, url = {https://doi.org/10.1109%2Fembc.2018.8512929}, year = 2018, month = {jul}, publisher = {{IEEE}}, author = {S. Soundarajan and H. Cecotti}, title = {A Gaze-Based Virtual Keyboard Using a Mouth Switch for Command Selection}, booktitle = {2018 40th Annual International Conference of the {IEEE} Engineering in Medicine and Biology Society ({EMBC})}}