Research Spotlight:
Dr. Liliana Florea

Dr. Liliana Florea is an Associate Professor of Medicine in the McKusick-Nathans Institute of Genetic Medicine.
Her research focuses on developing algorithms, computational models, and software tools for analyzing sequencing data to characterize genes and their variations and help infer molecular mechanisms of diseases.
Q: What initially inspired your interest in computational biology and the intersection of sequencing technologies and molecular mechanisms?
LF: “Having started as a mathematician and theoretical computer scientist, I was fascinated by the application of computational methods to create advances in other fields, from biology to economics to natural languages. As a newly minted graduate student, I joined Webb Miller’s lab at Penn State, which was seminally engaged in designing the first algorithms for analyzing data from the sequencing of the human genome.”
Q: Your lab develops algorithms and software tools to analyze next-generation sequencing data. Could you describe your approach to designing computational methods, and how do you ensure they remain adaptable to new advancements in sequencing technologies?
LF: “We start with a good understanding of the problem and data. They determine the class of methods we employ, such as string comparison algorithms, graph-based or statistical and deep learning models, and the data structures best suited to represent the data. We add heuristics to address cases that the primary model does not capture. We write modular code that allows different algorithm components to be separately optimized and adapted to evolving data characteristics, such as read length, abundance and sequencing error rate, or to new and more efficient algorithms. Lastly, generational changes in sequencing technologies, such as from Sanger to short Illumina reads to long ONT reads, and from bulk to single cell, result in vastly different characteristics of the data that demand new approaches and classes of algorithms”
Q: What specific challenges do you face in adapting your computational methods to complex biological data, such as large-scale sequencing datasets?
LF: “Scalability is one of our primary design considerations. We employ compact data structures, such as sequence indices and splice graph representations of genes, and develop efficient algorithms. Further, we develop methods to globally and simultaneously organize and analyze data across all samples for large multi-sample datasets, leveraging their similarities to generate compact models. Lastly, we use access to multi-core computing environments, particularly the multi-architecture Rockfish system, to distribute computation across multiple cores.”