Welcome to Fly Longitudinal NeuroDegenerative Atlas
Overview
Welcome to the Fly Longitudinal NeuroDegenerative Atlas (FLYNDA), a web application that provides interactive access to comprehensive transcriptomic and proteomic datasets from transgenic Drosophila models of neurodegenerative diseases.
FLYNDA integrates longitudinal data from fly models expressing human proteins linked to Alzheimer's disease (Tauopathy and Aβ42), Parkinson's disease (α-Synuclein), and Huntington's disease (N-terminal truncated and full-length Huntingtin). Our platform offers user-friendly, customizable visualization tools, enabling researchers to explore and compare gene expression levels across these disease models over time.
By identifying fly orthologs of human genes of interest, users can analyze expression patterns and gain insights into disease mechanisms, facilitating the discovery of potential therapeutic targets.
- Number of genes analyzed in Bulk RNA-seq: 17,459
- Number of proteins analyzed in Proteomics: 6,128
- Average number of genes detected per sample in scRNA-seq: 13,415
Quick Guide to Fly Longitudinal NeuroDegenerative Atlas
Search Results: (Please Select Row(s) from the Table below for detailed information and 'Data Visualization')
Detail Info of Selected Fly Genes: (Please Select Row(s) from the Table above)
This study uses longitudinal Drosophila models expressing human proteins (amyloid, tau, alpha-synuclein, huntingtin) to investigate various neurological diseases, including Alzheimer’s, Parkinson’s, and Huntington’s diseases.
Drosophila Models
UAS-Aβ42, UAS-αSyn, UAS-Tau (2N4R isoform), UAS-FL-HTT, and UAS-NT-HTT animals were crossed with the pan-neuronal expression driver elav-GAL4 to generate experimental animals with the following genotypes:
1) elav-Gal4/+;UAS-Aβ42/+, 2) elav-Gal4/+;UA-Tau2N4R/+, 3) elav-Gal4/+;UAS-αSyn/+, 4) elav-Gal4/+;UAS-FL-HTT/+, 5) elav-Gal4/+;UAS-NT-HTT/+
Elav-Gal4/+ animals crossed with a scrambled UAS line were used as controls in all analyses. A separate, “non-driver” w1118/+, control was also included in parallel. Expression of wild-type 2N4R human Tau and wild-type human α-synuclein (insect codon optimized) was achieved using Bloomington Drosophila Stock Center (BDSC) strains #51363 and #51375, respectively. Expression of amyloid-β42 fused to the Argos secretion peptide was achieved using the previously published line M17A on Chr.III. Adult flies were raised at 25°C and aged to 2, 5, 7, 10, 14, and 21 days post-eclosion. Due to the longer survival time of Tau and control flies, they were both additionally aged to days 28 and 42, whereas day 57 was achieved in controls only.
RNA-Sequencing Data Description
Three RNA-sequencing biological replicates per genotype per time point were generated. Each replicate consisted of 100 heads collected from flash frozen, age-matched virgin female flies. mRNA was extracted using TRIzol (#15596026, Invitrogen) followed by DNAse treatment. A minimum of 500 nanograms of total DNase-treated extract RNA were used per replicate. Samples were prepared using the Illumina TruSeq Stranded mRNA Library Prep Kit. Samples were then sequenced on an Illumina NovaSeq 6000 with 350 bp paired-end reads. Raw reads were aligned to Drosophila reference genome dm6 r6.06. Genes with an average read count < 50 across all samples were excluded. Preparation of sequencing libraries, RNA-sequencing, and alignment was performed by the New York Genome Center. Differential expression was calculated with DESeq2 v1.34.0, as in prior work, using genotype and age as linear regression covariates and using the LRT as significance test. Significance was set for a Benjamini-Hochberg adjusted p-value < 0.05. The DESeq2 counts function (normalized = TRUE) was implemented in order to generate normalized abundance for plotting transcript counts per timepoint.
Proteomics Data
Five biological replicates per genotype per time point were generated. Tandem mass tag mass-spectrometry proteomics was performed following previously published protocols (Johnson et. al., Nat. Neurosci. 2022). Homogenates were prepared from adult fly heads, consisting of 50 heads per biological replicate. Protein isoforms and samples with > 30% missingness were removed, remaining missing values were imputed using minimum value divided by 2, isoforms mapping the same protein were collapsed by sum, and global internal standard (GIS) batch normalization was performed (protein intensity / batch GIS intensity). The preprocessed dataset includes quantitation of 6,128 unique proteins; these normalized values were used for plotting protein counts per timepoint. For differential expression, regression models were constructed using protein expression as an outcome, genotype as a predictor, while including age as a covariate: expression ~ genotype + age. Significance was computed using the Likelihood-ratio test (LRT) compared against the base model: expression ~ age. A cutoff of Benjamini-Hochberg adjusted p-value < 0.05 was set for false discovery correction.
Single-cell RNA-seq Data
Three biological replicates per genotype were generated, and all animals were aged to 10 days post-eclosion (21 samples total). Each replicate consisted of 16-18 intact, dissected brains from female flies, and were enzymatically dissociated into a single cell suspension as described in Davie et al., 2018. Single cell libraries were prepared per manufacturer’s protocol for the Chromium Single Cell Gene Expression 3’ v3.1 kit (10x Genomics) by the BCM Single Cell Genomics Core. Completed libraries were sequenced by the Baylor Genomic and RNA Profiling Core using the Illumina NovaSeq 6000 platform with a minimum depth of 300,000,000 reads per sample. Illumina BCL files were demultiplexed into FASTQ files by calling the Cell Ranger 4.0.0 mkfastq function. FASTQ files were aligned to the Drosophila reference genome (BDGP6.22.98) and gene counts were quantified using the Cell Ranger 4.0.0 count pipeline. Given the 10x recovery rate estimations, the cell calling algorithm in Cell Ranger was applied by setting the --expect-cells parameter in count to 10,000 per library. Filtered count matrices were loaded into Seurat in R for additional quality control and downstream analyses. Cells were removed from the data object if the number of unique genes per cell were less than 200 or greater than 3,000, or if the proportion of mitochondrial reads per cell was greater than 20%. [One NT-HTT library was an outlier and was subsequently removed from our analyses.]
DoubletFinder was applied per library to predict and remove heterotypic doublets. For each library, artificial doublets were generated from the existing data. Principal component analysis (PCA) was performed after merging the real and artificial data and a distance matrix was generated with the first 40 PCs to compute the proportion of artificial K-nearest-neighbors (pANN) for each cell. PC neighborhood size (pK) for computing pANN was estimated for each library as previously described (McGinnis et al., 2019). The number of suspected doublets per library was estimated and cells were ranked by pANN for removal. Total doublet proportion Y for each library was computed based on a custom linear equation of the input-to-multiplet stimation provided by the 10x Chromium documentation: y = 7.589 × 10-6 x + 5.272 × 10-4, x being the number of recovered intact cells after the initial filtering criteria described above. The linear equation was generated based on recovery estimations in the manufacturer’s protocol. Adjustment of the estimated doublet proportion for undetectable homotypic doublets was applied in DoubletFinder by using the Seurat clustering classifications at resolution = 2 as described below.
Gene expression was first normalized independently per library using a regularized negative binomial regression approach as implemented by SCTransform (Hafemeister & Satija, 2019). 5,000 highly variable features (HVG) were used for normalization while accounting for percent mitochondrial reads. Variable features were defined and ranked by computing the variance of standardized gene counts after loess-based adjustment of mean-variance relationships (Stuart et al., 2019). Residuals of the fitted regression models were used as normalized gene expression values for HVGs. All libraries normalized via SCTransform were integrated using the reciprocal principal component analysis (rPCA) pipeline in Seruat to correct for batch effects and facilitate identification of similar cell identities across conditions. Highly ranked HVGs shared across all libraries were used as integration features. The 3 elav-Gal4 “driver controls” were merged to create an integration reference, and integration anchors across libraries (correspondences of the selected features between cells) were computed over the first 100 rPCA dimensions in the combined dataset and then used to inform the subsequent integration and grouping of cells. Normalization of gene counts used in differential expression analysis, cell cluster marker gene computation, cell identity annotation, and other applications directly comparing gene expression levels between cell clusters were computed separately on the non-integrated gene expression data using the NormalizeData function in Seurat. In brief, for each gene in each cell, unique molecular identifiers (UMI) were divided by the sum UMIs in that cell, multiplied by a scalar (10,000), and log transformed. However, cell cluster membership was defined using the integrated dataset.
Cell clustering of the integrated dataset was performed using Seurat. In brief, we first used the FindNeighbors function to construct a nearest-neighbor (SNN) graph of the integrated dataset. Next, we used the FindClusters function to identify clusters of cells using a shared nearest-neighbor modularity optimization based clustering algorithm. Seeing as different clustering parameters can yield variable numbers of clusters, we performed a grid search of 64 different combinations of clustering parameters. We varied the number of principal components between 80 and 150 (step size of 10) in creating the SNN graph and varied the clustering resolution from 1 to 8 (step size of 1).
The grid search yielded between 179 to 368 clusters. To evaluate the clustering, we assessed the clusters’ biological relevance by comparing to published bulk-transcriptomes from 52 purified Drosophila brain neuronal types. We determined the similarity of expression profiles by calculating the Pearson correlation coefficient for each purified cell-type and the average log-normalized non-integrated gene expression of each cluster. The most variable genes found during clustering were used for the correlation calculation. For each pair of clustering parameters, we counted how many of the purified neuronal cell types match to 1 to 5 clusters. Specifically, for the top 6 most highly-correlated clusters for each neuronal type, we tested whether the difference in correlation value between a cluster and the subsequent cluster was greater than 0.05, as has been described by a prior publication [Ozel et al., 2020]. We chose two sets of clustering parameters that both maximized the number of clusters matches and the number of clusters and annotated both independently. We first annotated the dataset after it was clustered with 150 principal components at resolution 1, which yielded 202 clusters and 63 annotated unique cell types. We also annotated the dataset after it was clustered with 120 principal components at resolution 2, which yielded 239 clusters and 67 annotated unique cell types. Given the greater number of annotations in the latter option, we proceeded with these clusters for downstream analysis. a'/b'-KC, adPN/IPN, LC6, and LPLC2 were the 4 additional cell types we were able to annotate.
To assign clusters to specific cell types, we utilized three methods of annotation. First, we compared the average log-normalized non-integrated gene expression of each cluster to published bulk transcriptomes of purified neuronal populations. We calculated the Pearson correlation coefficient using the top 10 cluster markers – the genes ranked among the top 10 most differentially expressed genes on the basis of |log fold change| for at least one cluster when compared to all other clusters [Ozel et al., 2020]. Second, we used four published single-cell datasets [TauRW dataset 2023, Ozel et al., 2020, Davie et al., 2018, Li et al., 2022] as references and performed cell-label transfer using Seurat’s FindTransferAnchors and TransferData functions. In brief, pairs of similar cells between the reference and query dataset were identified using a mutual nearest neighbor approach after projecting the new dataset onto the reference dataset in PCA-reduced space. The 5,000 most variable genes in the new dataset were used for the dimensional reduction. Each cell was assigned a score and a predicted label from the reference dataset. Lastly, we utilized a trained two-layer neural network classifier of adult optic lobe cell-types [Ozel et al., 2020]. Non-integrated log-normalized expression values of ~500 marker genes were used as the input to train the classifier. Seeing as the cell label transfer and neural network classifier predicted labels on per cell basis, we assigned whole cluster identity based on the most abundant population of cells in that cluster. To assign final identities to clusters, we examined the concordance between our various annotating methods; for a cluster to receive an annotation, at least two of the methods were required to predict the cluster as concordantly.
Given that our integrated dataset contained libraries from disease model flies, we considered gene expression values might be affected by non-wildtype fly models. Additionally, the published transcriptome references, both single-cell atlases and bulk-sequenced purified neuronal types, used for annotation were derived from non-disease flies. We thus performed the same grid search with the same clustering parameters described prior on cells only from control flies (elav Gal4 and w118). The grid search on control fly cells, of which there 74,248, revealed that purified neuronal cells correlated just as well the entire dataset compared to control cells only.
Differential expression between disease (AB42, aSyn, Tau, FL-HTT, or NT-HTT) and driver control for each defined cell cluster was calculated using MAST v1.20.0, with sample batch as a covariate. Significance was set for a Benjamini-Hochberg adjusted p-value < 0.05.
Acknowledgement Statement
These data were generated from Drosophila models/analyses by Drs. Joshua Shulman, Zhandong Liu, and Juan Botas from Baylor College of Medicine. RNA-sequencing were generated by Drs. Joshua Shulman and Juan Botas at the Baylor College of Medicine (U01AG046161), and proteomics/analysis performed by Drs. Joshua Shulman, Juan Botas, Nicholas Seyfried, and Duc Duong from Emory University (U01AG061357;R01AG057339). The grid search and annotation for Single-cell RNA-seq analysis was performed by Justin Dhindsa.