Research Spotlight:
Dr. Liliana Florea

Dr. Liliana Florea is an Associate Professor of Medicine in the McKusick-Nathans Institute of Genetic Medicine.

Her research focuses on developing algorithms, computational models, and software tools for analyzing sequencing data to characterize genes and their variations and help infer molecular mechanisms of diseases.

Q: What initially inspired your interest in computational biology and the intersection of sequencing technologies and molecular mechanisms?

LF: “Having started as a mathematician and theoretical computer scientist, I was fascinated by the application of computational methods to create advances in other fields, from biology to economics to natural languages. As a newly minted graduate student, I joined Webb Miller’s lab at Penn State, which was seminally engaged in designing the first algorithms for analyzing data from the sequencing of the human genome.”

Q: Your lab develops algorithms and software tools to analyze next-generation sequencing data. Could you describe your approach to designing computational methods, and how do you ensure they remain adaptable to new advancements in sequencing technologies?

LF: “We start with a good understanding of the problem and data. They determine the class of methods we employ, such as string comparison algorithms, graph-based or statistical and deep learning models, and the data structures best suited to represent the data. We add heuristics to address cases that the primary model does not capture. We write modular code that allows different algorithm components to be separately optimized and adapted to evolving data characteristics, such as read length, abundance and sequencing error rate, or to new and more efficient algorithms. Lastly, generational changes in sequencing technologies, such as from Sanger to short Illumina reads to long ONT reads, and from bulk to single cell, result in vastly different characteristics of the data that demand new approaches and classes of algorithms”

Q: What specific challenges do you face in adapting your computational methods to complex biological data, such as large-scale sequencing datasets?

LF: “Scalability is one of our primary design considerations. We employ compact data structures, such as sequence indices and splice graph representations of genes, and develop efficient algorithms. Further, we develop methods to globally and simultaneously organize and analyze data across all samples for large multi-sample datasets, leveraging their similarities to generate compact models. Lastly, we use access to multi-core computing environments, particularly the multi-architecture Rockfish system, to distribute computation across multiple cores.”

Q: How have advancements in computational tools and technology directly impacted your research in bioinformatics?

LF: New and more efficient algorithms and data structures, as well as advances in computer technologies, have collectively made it possible to process increasingly large datasets. We now routinely analyze datasets of hundreds of genomic data samples, or train complex deep-learning models of genes and other genomic features.”

Q: How has the field of computational genetics and bioinformatics evolved throughout your career, and how have these changes shaped your research approach?

LF: “Scalability is one factor. As a graduate student, I developed one of the first algorithms to align a gene segment (EST) to the genomic sequence containing its gene. A single RNA-seq sample now contains tens of millions of reads that are searched against an entire genome or group of genomes, and there are tens to thousands of samples in a dataset. Then, there is integration and new scientific horizons. Over the years, my research has crossed from aligning sequences to integrating millions of alignments to produce models of genes and transcripts to using multi-omics data to create deep learning models of genes and genomic features and interpret them into biological knowledge.”

Q: What are the next steps or future directions for your research, and what emerging trends in computational biology are you most excited about exploring?

LF: We are excited about our new research building deep learning models of genes and splicing, including for transposable elements’ splicing-in, or exonization. Such models will further allow us to characterize the extent and contribution of these elements to population variation and diseases. Our group also continues to build tools for alternative splicing characterization from massive collections of RNA sequence data produced with different technologies and to identify genetic determinants of splicing in health and disease.”

Q: What is the impact or value of Rockfish in your research?

LF: “Having access to Rockfish’s comprehensive array of resources has been transformative for our research, whether we performed memory-intensive algorithm development on the large memory machines, used hundreds of standard queue cores to swiftly and efficiently process extensive collections of data, or took advantage of the GPU resources to train our deep learning models. Our research would not have been possible without these resources and the professional, responsive and timely support that the ARCH staff has always provided.”
Learn more about Florea Lab Research Group here