AI and Data Science
Gleaning the meaning from terabytes of data.
Technology to drive breakthrough science
A given high-throughput functional genomics, proteomics, sequencing, metabolomics or transcriptome profiling experiment can generate terabytes of raw data. The challenge for scientists is to convert these strings of numbers to biological insight. Applied Artificial Intelligence and Data Science aims to pick out, from among a wealth of measurement noise, observations from the data that really tell us something about the organism. In the study of the biology of aging, a successful bioinformatics analysis uses high-throughput data to make predictions about genes and molecules that mediate lifespan or other age-associated traits. And these predictions then feed back to the laboratory to be tested in additional rounds of experiments. Professor David Furman and the AI and Data Science team offer a collaborative service to meet this goal, supporting low-level manipulation, normalization and statistical analysis, and high-level post-processing of multiple types of biological datasets, as well as data integration, transferring, and retrieval.
David Furman, PhD . Faculty Director
Henry Huang . Data Scientist
Kevin Perez . Data Scientist
The AI and Data Science Core functions as a resource for users who seek to survey the genome-scale effect on a laboratory organism of a perturbation of interest, usually with respect to gene expression (RNA-seq), chromatin modifications (ChIP-seq, ATAC-seq, DamID), DNA sequence of the microbiome (16S sequencing) or the host, or protein abundance (proteomics). Before users generate data, they come to the core to be advised on experimental design, including replicate structure and sequencing coverage. When appropriate, we code and run power simulations at this stage to justify the expected outcomes from a given experiment. (Sample size justification of this type is now required in the proposals for all NIH grants.) Once a given experiment is complete and the data are in hand, we and the user establish the driving questions to be answered in the project, after which we set out a statistically appropriate analysis plan. This could include quality control measurements, reproducibility assessment, genome assembly and annotation, genomic variation calling, motif search and discovery, gene and transcript variant estimation, allelic expression level estimation, normalization between samples, differential abundance analysis, chromosomal structure and localization profiles, peak calling, functional genomic analysis, and network analysis. For any such needs of the project as appropriate, we write code and run existing packages and draw up formatted results files and visualizations for the user. Hardware used by the core currently includes various desktops and laptops for user interface, an 8-core MacPro and a 32-core Super Server (Sysorex, Inc.) for computation, and the Institute-wide fileserver for storage.
Technological innovation in the service of scientific advancement
Our cutting-edge technologies support the Institute’s goals and put the newest capabilities in the
hands of our scientists.