Transposable elements are driving rapid adaptation of Enterococcus faecium
Bacterial genomic datasets compiled for this studySeveral bacterial genomic datasets were compiled for this study, and different selections of these datasets were used, as appropriate, for the analyses described herein. We refer to these datasets as: ‘NCBI complete genomes’, ‘NCBI long-read assemblies’, ‘NCBI short-read E. faecium assemblies’, ‘SHC long-read Enterococcus isolates’ and ‘SHC long-read metagenomes’. We also define the ‘high-contiguity Enterococcus genomes’, which consist of Enterococcus genomes from (1) NCBI complete genomes, (2) NCBI long-read assemblies, and (3) SHC long-read Enterococcus isolates. Datasets were combined without deduplication.To aid readers in interpreting our manuscript and for transparency, Supplementary Table 1 contains (1) a detailed description of bacterial genomic datasets compiled for this study, (2) descriptions of any additional filtering for specific analyses, and (3) a description of how these datasets are used in each main figure and Extended Data figure.NCBI long-read assemblies: long-read dataset curation, preprocessing, assembly and quality controlCandidate isolate long-read sequencing datasets were identified on NCBI with the following search criteria on 29 February 2024: ((((Enterococcus[Organism]) OR (Staphylococcus[Organism] OR (Klebsiella [Organism]) OR (Acinetobacter [Organism]) OR (Pseudomonas [Organism]) OR (Enterobacter [Organism]) OR (Escherichia [Organism]) OR (Salmonella[Organism]) OR (Streptococcus[Organism])) AND ‘oxford nanopore’[Platform])). Datasets for Shigella and Bordetella were queried separately on 28 October 2024 in the same way. Metadata for these accessions was downloaded with the Sequence Read Archive (SRA) Run Selector interface. Datasets with less than 50 Mb of total sequencing or less than 2 kb average read length were not included for further analysis. Raw reads were downloaded using SRA tools prefetch and fasterq-dump, and assembled with flye56 (v.2.9) using the --nano-raw method. Genome taxonomy was verified with GTDB-Tk57 (v.2.4) with database release v.220.Assemblies were then filtered for appropriate contiguity (≤20 contigs), genome size (±2 s.d. per genus), coverage (≥40×) and with identified species-level taxonomy. We excluded the GTDB-Tk taxonomy result for the Shigella datasets because the GTDB-Tk database does not include Shigella genomes and thus cannot classify them beyond E. coli. Finally, we filtered the set of genomes to include only species with more than ten samples in this and the NCBI complete genomes (see ‘NCBI complete genomes’ below) combined. The final set of genomes with assembly and taxonomy information is available in Supplementary Table 2. We refer to this dataset in the manuscript as NCBI long-read assemblies. Relevant source code and data can be found in our repository (‘Code Availability’).NCBI complete genomesThe NCBI command line tool ncbi-datasets-cli (v.16.22.1)58 was used to download all available NCBI genomes with the ‘complete’ assembly level from the following taxa on 3 July 2024: Enterobacter, Staphylococcus, Klebsiella, Acinetobacter, Pseudomonas, Enterococcus, Escherichia, Salmonella, Streptococcus, Shigella and Bordetella. For example, the following command was used to download Enterococcus assemblies: datasets download genome taxon enterococcus --assembly-level complete --include genome,gff3 --dehydrated --filename enterococcus.zip. Note, this command collects a reference to the relevant genomic data. Data must be downloaded using the ‘rehydrate’ command. After rehydrating, duplicate assemblies were removed, preferring the RefSeq over GenBank as the source database. Assemblies were then filtered according to the same contiguity, size, coverage and sample count thresholds as above, and the final set is likewise available in Supplementary Table 2. We refer to this dataset in the manuscript as NCBI complete genomes. Relevant source code and data can be found in our repository (‘Code Availability’).SHC long-read Enterococcus isolates: collection and sequencingA total of 279 bloodstream infection isolates identified by biotyping with a MALDI-TOF based Bruker Biotyper as E. faecalis (n = 170) or E. faecium (n = 109) by the Stanford Health Care Clinical Microbiology Laboratory between the periods 1 October 2020 to 31 December 2021 and 1 January 2023 to 31 December 2023 were collected under a protocol approved by the Stanford University Institutional Review Board (Protocol no. 64591; Principal Investigator: A. Bhatt, Committee IRB-61). Informed consent was obtained from all human research participants. An additional eight isolates collected in 2024 identified as atypical Enterococcus as above were stored as glycerol stocks. Metadata, including date, site of collection and relevant patient information, were extracted. Each isolate was cultured aerobically for approximately 16 h with shaking at 37 °C in 3 ml Brain Heart Infusion (BHI) broth (Sigma-Aldrich). Then, a 1:100 subculture into 10 ml BHI broth was prepared. Subcultures were allowed to reach an OD of approximately 1.0. The subculture was centrifuged at 3,000g for 10 min at room temperature, resuspended in PBS, then centrifuged again. The PBS was poured off and the subsequent pellet was resuspended in 1 ml 1× Zymo DNA/RNA Shield (Zymo, catalogue no. R1100-250) and shipped to Plasmidsaurus for Oxford Nanopore Technologies (ONT) ‘Standard Bacterial Genome with DNA extraction’ sequencing. After inspection of the assemblies, one E. faecium isolate (EF_B_40) was mischaracterized as E. faecalis and one E. faecalis isolate (FM_B_40) was mischaracterized as E. faecium, on the basis of biotyping. Of the 170 E. faecalis isolates, 167 were confirmed as E. faecalis. Of the 109 E. faecium isolates, 107 were confirmed as E. faecium with standard quality control performed by Plasmidsaurus. Finally, assemblies were filtered according to the same contiguity and coverage thresholds as above, resulting in 166 E. faecalis and 100 E. faecium genomes, respectively. We refer to this dataset in the manuscript as SHC long-read Enterococcus isolates.IS countingISEScan59 (v1.7.2.3) was run with default parameters. Each resulting .csv file was used to count IS elements per sample, per contig and per IS family. Summary results from NCBI long-read assemblies and NCBI complete genomes are included in Supplementary Table 2, and summary results from SHC long-read Enterococcus isolates (E. faecium and atypical Enterococcus) are included in Supplementary Table 3.Assembly annotationGenomes were annotated with Bakta60 (v.1.9.1–v.1.10.3) using database v.5.1 with the following flags --skip-tmrna --skip-ncrna --skip-ncrna-region --skip-crispr --skip-sorf --skip-gap --keep-contig-headers. Output files, including .ffn, .faa and .gff3, were used for subsequent analyses.Determination of IS flanking regions and boundariesTo investigate the sequence features of high-copy IS elements in E. faecium we used custom scripts to identify the IS ends, yielding transposase flanking and boundary regions. From the analysis described in the preceding sections, we identified 17 IS transposases with at least one copy per genome on average. Of these, we exclude the three IS3 family sequences as they are annotated as two coding sequences (CDSs) (orfA/B). The resulting 14 sequences comprise 30,847 copies across the 593 E. faecium genomes surveyed. In brief, we extracted 500 bp of flanking sequence on either side of the transposase CDS. For a given transposase variant, as their CDS is identical, we expect the IS flanks to be highly conserved. However, as we are considering variants from many similar genomes, we must account for identity by descent where sequence context beyond the true IS boundary is repeated. Therefore, we first deduplicate the flanking sequence. Then, per transposase variant, a multiple sequence alignment was generated for the set of deduplicated flanks with MAFFT61 (v.7.526). After collapsing positions whose consensus is ‘GAP’ (representing low-frequency insertions), per base entropy was calculated as bits (log2-scale) with scores approaching 0 representing highly conserved sites. The flank is determined as the (longest) region of continuous low entropy (<0.5), exempting short (≤3 bp) high-entropy segments representing SNVs or short indels. Sequence information for these IS, including their consensus flank lengths, is displayed in Extended Data Fig. 4c. Relevant source code can be found in our repository (‘Code Availability’).Multiple sequence alignment of IS transposase ORFsTransposase protein sequences from the most common IS elements in E. faecium (Fig. 1c) were aligned using MAFFT (v.7.526) with default parameters. This excludes IS3 elements as they are not encoded as single ORFs. A maximum-likelihood phylogeny was inferred with IQ-TREE2 (ref. 62) (v.2.2.0) under the VT+F+G4 substitution model, selected according to the Bayesian information criterion, with 1,000 ultrafast bootstrap replicates. Branch lengths represent the expected number of amino acid substitutions per site. The resulting tree was visualized with the R package ggtree (v.3.10.1).Frequency and distribution of IS transposase variants in E. faecium
For the identification of unique IS transposase ORF variants and quantification of their frequencies in E. faecium, we analysed the 593 E. faecium genomes from NCBI complete genomes and SHC Enterococcus isolates. We excluded the NCBI long-read assemblies from this analysis because only 5 out of 174 datasets (Extended Data Fig. 1) were sequenced with the more recent R10.4.1 nanopore chemistry. Earlier-generation ONT data are known to exhibit higher rates of homopolymer errors63, which do not notably hamper hidden-Markov-model-based approaches such as ISEScan but would complicate accurate counting of exact copies of transposase nucleotide sequences.Genome annotations were performed using Bakta, and the resulting GFF files were used to extract all CDS features longer than 500 nucleotides and labelled as ‘transposase’. These were then consolidated into a non-redundant transposase database, enabling us to quantify the frequency of each unique transposase nucleotide sequence across the dataset. We identified the six most abundant IS transposase families—ISL3, IS30, IS256, IS6, IS3 and IS110—and visualized their distribution (Extended Data Fig. 4d). Relevant source code can be found in our repository (‘Code Availability’).geNomad prediction of plasmid contigsTo predict whether a contig is probably from a chromosome or plasmid, we used geNomad64 (v.1.8.0, v.1.6 DB) with the end-to-end workflow. Contigs predicted as chromosomal with a score of >0.6 were included in analyses as chromosomes. Contigs predicted as a plasmid with a score of >0.7 were included in analyses as plasmid. Contigs not reaching those thresholds were considered undetermined and excluded from analyses where plasmid predictions were necessary (Fig. 1c and Extended Data Fig. 3). Information about the number of contigs reaching these thresholds for each genome from the NCBI complete genomes and NCBI long-read assemblies are included in Supplementary Table 2 and are visualized in Extended Data Fig. 3i.IS counts among ESKAPEE lineagesTo control for the possibility that certain lineages drive species-level results in Fig. 1b, IS counts were measured among ESKAPEE taxa lineages. All pairwise genomic distances were calculated within each ESKAPEE taxon using Mash65 (v.2.3) with 100,000 (-s), length 21 k-mer (-k) sketches. Genomes were then grouped into clusters defined by <0.01 pairwise distance across all members (Extended Data Fig. 2).IS expansion in enterococciGenomes from the high-contiguity Enterococcus collection and the 32 reference genomes described by Lebreton and colleagues29 (2,134 in total) were classified phylogenetically with de_novo_wf from GTDB-Tk v.2.4.0, Release 220. Following Lebreton and colleagues29, the term g__Vagococcus was used to define the outgroup. A table summarizing the IS counts as defined by ISEScan for each genome, as well as the Newick and tree.dist files output by de_novo_wf were processed with custom scripts. In addition to four genomes being excluded by GTDB-Tk, genomes were assigned to a representative by the minimum phylogenetic distance. A total of 18 genomes were excluded for having a minimum distance >0.01, yielding 2,112 assigned genomes. IS counts per family were summarized for each set of genomes assigned to a given representative. In total, 785 genomes were assigned to the representatives EnGen0015, Aus0004 and EnGen0043, defined as E. lactis, clade A1 (healthcare-associated) and clade A2, respectively. To visualize counts per IS family across enterococci (Fig. 2b), the mean count of each IS family for the 32 sets of genomes was calculated and a file containing these values was uploaded to iToL66. For better visualization of Enterococcus, Lactococcus garvieae was excluded from the tree. Relevant source code and data can be found in our repository (‘Code Availability’).ISL3 counts across E. faecium lineagesTo investigate whether certain lineages of E. faecium were enriched for ISL3 copy, ISL3 copy was associated with a phylogenetic tree of all E. faecium genomes from the high-contiguity Enterococcus collection. In brief, a pangenome was generated with panaroo67 (v.1.5.2, --core_threshold 0.99). The resulting core genome alignment was input into iqtree2 (v.2.2.0) to build a phylogenetic tree using a GTR+G substitution model and 1,000 bootstraps. The maximum-likelihood tree was visualized in iToL. The tree was midpoint rooted between E. lactis and clades A1 and A2. For each of the 785 genomes, ISL3 copy number as measured previously by ISEScan was uploaded to iToL. Furthermore, we assigned STs with multi-locus sequence typing (MLST)68 (v.2.23.0; available at https://github.com/tseemann/mlst) to each genome using the scheme efaecium. Clade information was uploaded to iToL. In addition to E. lactis and clade A2 E. faecium, common STs (17, 18, 80, 117, 203 and 1,421) were also visualized.Correlation of antibiotic resistance and IS copy in E. faecium
To predict antibiotic resistance genes (ARGs), AMRFinder69 (v.4.0.23, DB 2025-07-16.1) was run on 785 E. faecium genomes from the high-contiguity Enterococcus collection with additional parameters -plus and -organism ‘Enterococcus faecium’. Hits were deduplicated by selecting the highest identity per ORF coordinate. Then, unique ARGs per genome were counted. ARGs were then related to IS counts per genome; 65 genomes predicted to be E. lactis as defined above in ‘IS expansion in enterococci’ were considered a single ‘Non-clade A1/A2’ category. The remaining 720 predicted clade A1/A2 genomes were converted into a categorical variable on the basis of IS copy, for each of the six most common IS families as in Extended Data Fig. 4d. The genomes were separated into quintiles, with the lowest quintile as Q1 (fewest IS) and the highest quintile as Q5 (most IS).Defining human-infectious NFF Enterococcus
NFF Enterococcus causing human infection were based on the Australian Enterococcus Surveillance Outcome Program30. During the 10 years from 2013 to 2022, 11,681 Enterococcal bacteremia isolates were collected and identified. Any Enterococcus species responsible for at least ten bacteremia isolates (average of one per year) in that collection is considered here as human-infectious. These species are E. faecalis (6,319), E. faecium (4,712), E. casseliflavus (198), E. gallinarum (166), E. avium (107), E. raffinosus (60), E. hirae (45), E. lactis (29) and E. durans (27).NCBI short-read E. faecium assemblies: curation of isolate datasets for ISL3 expansion timelineA list of accessions was obtained by querying the NCBI e-utilities application programming interface, resulting in a set of 21,376 Illumina-sequenced E. faecium genomic whole-genome sequencing data with available collection year. This set of accessions was filtered to paired-end data with at least around 20× estimated coverage and then downsampled to ≤500 samples per year, resulting in a final set of 6,747 samples. Next, we ran custom Nextflow70 pipelines to process the samples. In brief, raw reads were downloaded, deduplicated, trimmed, coverage-filtered/downsampled (20–150×) and assembled, and then FastANI71 (versus clade A1 [Aus0004], clade A2 [EnGen0015] and E. lactis [EnGen0043] references) and CheckM2 (ref. 72) were run on each assembly. Finally, assemblies were filtered: (1) FastANI closest reference = A1, min_dist ≤ 0.015 and ≥97% average nucleotide identity (ANI); (2) genome size within µ ± 2 s.d.; (3) CheckM2 ≥ 99% completeness and ≤5% contamination; (4) contig N50 ≥ 25 kb. This resulted in 5,422 final short-read assemblies; we refer to this dataset in the manuscript as NCBI short-read E. faecium assemblies. Relevant source code and data can be found in our repository (‘Code Availability’).NCBI short-read S. epidermidis assemblies: curation of isolate datasets for IS expansion timelineAs with E. faecium, a list of accessions was obtained querying the NCBI e-utilities application programming interface, resulting in the set of 5,938 Illumina-sequenced S. epidermidis genomic whole-genome sequencing data with available collection year. To enrich for hospital-adapted isolates and avoid the potential confounding influence of skin commensals, we excluded isolates with sampling metadata indicating superficial body sites (for example ‘skin’, ‘scalp’, ‘hand’, ‘ear’). We also required ≥99% ANI to the S. epidermidis RefSeq reference (GCF_006094375.1) and applied additional quality control thresholds analogous to those in our E. faecium analysis. This resulted in 1,604 final short-read assemblies after applying all filtering steps. Relevant source code and data can be found in our repository (‘Code Availability’).Coverage-based IS count estimates in short-read dataTo quantify IS abundance in these short-read assemblies, we first annotated IS present on individual contigs with ISEScan. Contigs shorter than (600 × number of predicted IS on the contig) were not counted to ensure contigs that are putatively only part of an IS are not counted. This limits the inclusion of noisy coverage data from very short contigs, as well as the possibility of double counting fragmented IS elements. This probably also leads to conservative estimates of IS count, which we prefer to the alternative. For each assembly, chromosomal coverage was approximated as the length-weighted median coverage of contigs exceeding 5 kb in length. We then calculated a relative coverage for each contig by dividing the contig coverage by the chromosomal coverage. The number of IS copies on each contig was multiplied by its estimated coverage, and these values were summed subsequently across all included contigs to obtain the final per family IS count per assembly.To ensure our IS counting heuristic from fragmented assemblies was appropriate, especially to avoid differential bias towards certain IS families, we benchmarked our approach against highly contiguous genome IS counts. We generated paired short- and long-read data for 117 samples (37 E. faecium; 80 E. faecalis) from our SHC collection and compared IS counts per family and per species (Fig. 3b and Extended Data Fig. 5). These results show that our IS counting heuristic is reasonably accurate across IS families and that this is not specific to E. faecium. The results also highlight the following caveats of the heuristic: (1) it tends towards being slightly conservative (due partially to ignoring short noisy contigs); (2) it is less precise, although not less accurate, for higher-copy IS families and (3) it does not perform as well for IS6, which tends to be plasmid-borne (Fig. 3b). This result is not surprising because of the greater difficulty in resolving repetitive/plasmid sequences in short-read data. As a result, we do not include analyses of IS6 elements from short-read-based assemblies.To control for the possibility changing IS abundance over time could be explained by recent isolates being disproportionately clinical in origin, whereas older isolates might derive from nonclinical sources, we evaluated several IS families. Healthcare-associated E. faecium strains were reported previously to be enriched for IS256, IS30 and IS3 family elements. Consequently, if this kind of sampling bias were driving our findings, we would expect all IS families associated with clinical strains to follow the same temporal pattern (that is, low in older and high in more recent samples).Relevant source code and data can be found in our repository (‘Code Availability’).Trendline of IS copy number over timeTo discern whether there is a general trend in changing IS copy number over time (raw data shown in Extended Data Fig. 5), while limiting short-term fluctuations, we computed the average genomic IS copy number per year over a 5-year sliding window (95% confidence interval, mean ± t × s.e.), and plotted a trendline using locally weighted scatterplot smoothing (LOWESS) (smoothing fraction 0.25). Relevant source code can be found in our repository (‘Code Availability’).PopPUNK clustering of E. faecalis and E. faecium genomes for PanGraph inputTo characterize structural variation among a set of nearly clonal genomes, we returned to the SHC long-read Enterococcus isolates as described above. With PopPUNK73 (v.2.6.7), we clustered 722 E. faecium genomes from the high-contiguity Enterococcus collection with the v.2 full E. faecium PopPUNK database downloaded from https://ftp.ebi.ac.uk/pub/databases/pp_dbs/Enterococcus_faecium_v2_full.tar.bz2. These genomes are the 720 clade A genomes as in Fig. 2a, with FM003 and FM057 included. They were excluded previously for coverage lower than our 40× threshold (39× and 31×). We chose to include them here because they are both single-contig chromosomal assemblies. Clustering was done with the poppunk_assign command with the additional flag of run-qc. PopPUNK’s poppunk_visualize command was used to generate a Newick file, viz_core_NJ.nwk, describing the resulting phylogeny. The output tree was visualized with iToL. We assigned sequence types with MLST (v.2.23.0) to each genome using the scheme efaecium. The SHC hospital-endemic lineage corresponds to ST117 and formed a single cluster. The 66 genomes in the cluster, all from the SHC collection, were selected for further analysis. Similarly, the 1,083 E. faecalis genomes from NCBI complete and SHC long-read Enterococcus isolates were assigned to the v.2 full E. faecalis PopPUNK database downloaded from https://ftp.ebi.ac.uk/pub/databases/pp_dbs/Enterococcus_faecalis_v2_full.tar.bz2. None were removed by quality control. We attempted to assign STs as above using the scheme efaecalis. Here the largest cluster corresponding to ST179 contained 57 genomes, 44 of which were SHC isolates. The 44 genomes from the SHC collection were selected for further analysis.Filtering genomes and stitching contigs for input into PanGraphFor each of the SHC genomes in those clusters, we extracted the contigs predicted to be from the chromosome by geNomad as described before. As PanGraph74 requires single-contig inputs, we filtered for genomes with five or fewer contigs, yielding 66 E. faecium and 44 E. faecalis genomes. Multi-contig genomes were stitched together using 2 kb ‘N’ spacers in the order output by Flye.PanGraph analysis of structural variation in the SHC E. faecium ST117 and E. faecalis ST179 clustersFor each of the two clusters defined above, we adapted the ‘Structural genome evolution’ workflow as defined by Molari and colleagues39 and found at https://github.com/mmolari/structural-evo.In brief, the genomes were processed with PanGraph (v.0.7.1) to build a pangenome graph, which defines all ‘core blocks’ (regions of shared homology across all genomes). Regions of structural variation are captured as edges between consecutive core blocks, and the set of sequences that can occur between two core blocks is defined as a ‘junction’. PanGraph was rerun on each junction (with its flanking core blocks), and the resulting ‘subgraph’ defines each unique arrangement of homologous blocks through the junction as a unique ‘junction path’. Thus, for a given junction, the set of unique junction paths neatly describes the extent of structural sequence variation occurring in the cluster for a given structurally variant locus.Next, the set of junction subgraphs was analysed with a custom Python workflow: first, we confirmed that the spacers (2 kb ‘N’ as defined above) always occurred fully in the accessory region of junctions. This was expected, as disruptions in assembly contiguity are probably due to structural variation within the sample population itself. We removed any spacer-containing path from downstream analysis as their true sequence length and genotype is unclear. Next, for each junction, the workflow enumerates the distinct paths within the subgraph and characterizes the accessory genome of each non-empty path, using the Bakta-derived annotations (described above) for each source genome (Supplementary Table 4). In ST117, the most common type of junction is a ‘binary’ IS insertion. In that case, the junction comprises two paths, in which one path is empty (that is, there is no accessory region between the core blocks), and the other path contains only an IS element transposase annotation. IS-only junctions containing more than two paths and/or several IS families also occurred. Fewer large and more complex junctions occurred as well, including large indels and phages, although we did not characterize these in detail. Relevant source code and data can be found in our repository (‘Code Availability’).Mobile genetic elements have been reported previously as important to E. faecalis virulence. The analysis of E. faecalis ST179 served as a control for rates of structural variation over the same period in a related organism with similar hospital epidemiology and taxonomic relatedness. The E. faecalis samples were collected from the same environment, location and time frame, and were processed and analysed in an identical manner (Supplementary Table 4).For compatibility with our high-performance computing cluster, the PanGraph workflow was adapted to work for Snakemake (v.8.18.2). The workflow was run per cluster with existing installations of ISEScan (v.1.7.2.3), DefenseFinder75 (v.1.3.0), geNomad (v.1.8.0) and IntegronFinder76 (v.2.0.5).Generation of circos plots of IS variants in PanGraph clustersThe isolate with earliest collection date, EF006 for the ST179 E. faecalis cluster and FM001 for the ST117 E. faecium cluster, was chosen as the reference coordinate system. The set of IS-only junctions was filtered to ‘backbone-only’ junctions. In other words, IS insertions occurring in junctions not shared by all genomes (for example, within other junctions or between core blocks that are inverted for some genomes) were not shown, as their position on the core backbone is ambiguous. Each junction was given a fixed width of 3 kb, as smaller sizes were not visible on the plot. Relevant source code and data can be found in our repository (‘Code Availability’).ISL3 gene disruption and intergenic distance analysis in ST117The E. faecium ST117 PanGraph analysis defined 100 ISL3 junctions (that is, IS-only junctions containing only ISL3 family transposase annotation(s)) (Fig. 4c). Inspection of these junctions revealed that the vast majority are indeed parsimoniously explained by insertions of one or more ISL3 elements in the region. To identify candidates for intra-genic disruptions, we considered junctions where an annotated gene in the ‘empty’ path (lacking the ISL3 insertion) overlapped fully or partially with the accessory region of a ‘filled’ path. A junction was counted as gene-disruptive only if it contained such an overlapping gene and if that gene’s annotation was also missing or truncated in the filled path (Fig. 4h). The remaining ISL3 junctions were presumed to be intergenic. Of these, we considered every filled (ISL3-containing) path for every junction and extracted the distance from the 3′ end of the transposase to the next nearest gene, as well as its relative orientation. Paths that contained several accessory ISL3 transposases were counted only if they were all oriented the same way, otherwise the orientation was considered ambiguous and the path was excluded from analysis (Fig. 4i). As a control, the same analyses were done on the 51 IS30, IS256, IS110, IS3 and IS6 junctions as a combined set (‘Other’ in Fig. 4h,i).
E. faecium domination longitudinal sample curationTo identify a set of metagenomic samples that were (1) clinically relevant, (2) longitudinal and (3) technically suitable for assembling complete, closed E. faecium genomes directly, we selected longitudinal cases from our laboratory’s biobank of stool samples from people undergoing HCT with high relative abundance of E. faecium. As previously described44,45, these stool samples were collected under protocols approved by the Stanford University Institutional Review Board (Protocol nos. 8903; principal investigator: D. Miklos, co-investigator: A. Bhatt and T. Andermann, Committee IRB-3 and 42053; principal investigator: A. Bhatt, Committee IRB-8). Informed consent was obtained from all human research participants. On the basis of previous metagenomic sequencing results, samples with at least 10% relative abundance of E. faecium were considered. Any patients with at least two such samples were then identified and those samples ‘dominated’ (defined as ≥10% relative abundance) by E. faecium were selected for long-read metagenomic sequencing, yielding 34 samples from 14 patients (range, two to five samples per patient).Metagenomic high molecular weight DNA extractionTo extract high molecular weight DNA from stool samples in preparation for metagenomic long-read sequencing, we used the QIAamp PowerFecal Pro DNA Kit (Qiagen, catalogue no. 51804). In brief, working in a biosafety hood, from each stool sample kept on dry ice, roughly 200 mg stool (two to three biopsy punches) were added to the bead beating tube. Bead beating tubes were added in sets of 12 to a Vortex-Genie 2 with the horizontal Vortex Adapter described in the extraction kit manufacturer’s instructions and vortexed at maximum speed for 12 min at room temperature. All subsequent steps were consistent with manufacturer instructions using centrifugation. The final DNA was eluted in 100 µl C6 buffer and incubated at room temperature for 15 min before centrifugation. DNA concentration was measured by NanoDrop. DNA was then cleaned up with the Genomic DNA Clean & Concentrator-10 kit (Zymo Research, catalogue no. D4011) according to manufacturer instructions for large genomic DNA. In brief, 200 µl of DNA binding buffer was added to 100 µl of DNA eluate from before. DNA was eluted in 25 µl nuclease-free water. After clean-up, by Nanodrop, DNA concentrations ranged from 7.5 ng µl−1 to 768.7 ng µl−1 per sample. Two samples had both DNA concentrations below 25 ng µl−1 and purity values (OD260/OD230) less than 1.5 and were not sequenced. DNA quality was confirmed with Genomic DNA ScreenTape Analysis (Agilent, catalogue nos. 5067-5365 and 5067-5366) on the Agilent 4150 TapeStation system. Samples had peak concentrations at DNA lengths greater than 20 kb. The loss of one sample, p11580_03 led to a singleton, but we continued with sequencing of p11580_06. Therefore 32 samples from 14 patients (range 1 to 4) were sequenced as described below.SHC long-read metagenomes: ONT sequencing, basecalling, quality control and assemblyTwo pooled libraries containing 16 samples each were prepared with the Oxford Nanopore Native Barcoding 96-kit (catalogue no. SQK-NBD114.96) according to manufacturer’s instructions. In brief, 400 ng of input DNA per sample was prepared, diluting the sample in nuclease-free water if necessary. DNA concentration was confirmed for each sample with the Qubit DNA High Sensitivity kit at each checkpoint. On the basis of previous use of this approach for metagenomic sequencing of stool samples in our laboratory, and on TapeStation results, we expected our fragment length to be between 7 kb and 10 kb and therefore loaded 50 fmol of each library. Libraries were loaded onto separate R10.4.1 PromethION flow cells (FLO-PRO114M) according to the manufacturer’s instructions and sequenced simultaneously on a PromethION 2 solo instrument. We anticipated assembling complete E. faecium genomes with at least 40× coverage. Therefore, we estimated that to reach 40× coverage of a 3-Mb E. faecium genome making up 10% relative abundance of the sample, we would need 1.2 Gb of sequencing per sample. To increase the likelihood of reaching this threshold for each sample and accounting for library imbalance, we let sequencing proceed until both libraries reached at least 2.5 Gb per sample of total sequencing. We obtained 60 Gb and 40 Gb of sequencing for each library, respectively.Raw.pod5 files were transferred to our computing cluster and basecalling was performed with Oxford Nanopore’s open-source Dorado basecaller (v.0.7.3) available at https://github.com/nanoporetech/dorado. The following command was used: dorado basecaller —kit-name SQK-NBD114-96 -b 1024 —trim ‘all’ sup@v5.0.0 > .bam. Output bam files were demultiplexed and emitted as.fastq files with: dorado basecaller —kit-name SQK-NBD114-96 -b 1024 —trim ‘demux’. Of 32 sample barcodes, 31 had at least 1 Gb of sequencing after demultiplexing. One barcode corresponding to p11569_01 had less than 10 Mb of sequencing and was not considered further. Unfortunately, this made p11569_02 a singleton. Ultimately, 31 samples from 14 patients (range 1 to 4) were included for metagenomic assembly (Supplementary Table 5).NanoPlot77 (v.1.41.6) was used to evaluate sequencing quality per sample with default parameters. All 31 samples had median read quality of at least Q22.1 (22.1–24.2). Two samples, p11463_06 and p11580_06 had 9,150 and 11,220 reads, respectively. Other samples had at least 56,643 reads (56,643–659,377). The minimum read N50 for these 29 samples was 7.77 kb. Metagenomic assembly was performed with Flye78 (v.2.9.2) on all 31 sequenced samples using the following command: flye --nano-hq —meta —genome-size 200 m —keep-haplotypes —no-alt-contigs (Supplementary Table 5). We refer to this dataset in the manuscript as the SHC long-read metagenomes.Taxonomic profiling of assembled contigs with sourmashTo identify E. faecium contigs from the metagenomic assemblies, all contigs 5 kb or longer were profiled taxonomically with sourmash79 (v.4.8.11) using the successive commands: sourmash sketch dna -p k = 51,abund –singleton .fa and sourmash scripts fastmultigather -k 51 -m DNA –threshold-bp 5000 .zip gtdb-rs214-reps.k51.zip. All contigs predicted as g_Enterococcus;s_faecium longer than 2.5 Mb, and with at least 80× coverage were considered ‘complete’ chromosomal contigs. One such contig was identified for 28 of 31 (p11463_06, p11569_02, p11580_06 did not) successfully sequenced samples as described above, which resulted in 12 longitudinal cases (Extended Data Fig. 8).Split k-mer analysis of E. faecium assembliesTo determine cases of ‘isogenic’ E. faecium chromosomes such that within strain heterogeneity and structural variation could be assessed clearly, the relatedness of the 28 longitudinal E. faecium chromosomal contigs was considered with SKA80 (v.1.0). All pairwise distances and clusters of highly related contigs were identified using the following commands: first, ska alleles -f inputs.txt and then, ska distance -f skfs.txt -o results -S -s 40, where inputs.txt contained paths to the 28 E. faecium chromosomal contig fasta files. Patients with longitudinal samples with fewer than ten SNPs per megabase or higher than 99.999% identity over time were considered for longitudinal within-patient analyses. This threshold excluded p11342.Within-sample and within-patient structural variant callingTo assess within-sample E. faecium population heterogeneity, reads from each sample were aligned to their own metagenome assembly using ngmlr81 (v.0.2.7) and default ONT sequencing presets. Structural variants were called with Sniffles2 (ref. 82) (v.2.6.0), which was run in both normal and mosaic modes to capture both dominant and subclonal variants. The outputs from these two modes were combined and any duplicate variants were removed, as well as any variants mapping to contigs other than the E. faecium chromosome and variants smaller than 100 nucleotides. The resulting per sample structural variant counts are shown in Extended Data Fig. 9f (Supplementary Table 6).To determine whether ISL3 variants persisted or became fixed across sequential isolates from the same patient, we performed longitudinal variant calling: for each patient, the E. faecium chromosome from the sample collected at the earliest time point was chosen as the reference. Reads from each time point were mapped back to the reference metagenome assembly and passed as input to Sniffles2 as above (Supplementary Table 6). To look for structural variants caused by IS element transpositions, all identified structural variants mapping to the E. faecium chromosome were screened to include only insertions or deletions in the 750–3,000 bp size range, as this corresponds approximately to the length of IS elements. For each candidate variant, the predicted sequence was extracted and analysed with ISEScan to confirm the presence of IS elements. The resulting IS variant calls were compiled, and patterns of persistence or fixation were examined by comparing the frequencies and presence or absence of these IS-containing variants across each patient’s time points.p11568 ISL3 structural variant neighbourhood coverageTo measure the changing population frequency of the two identified ISL3 structural variants in the four sequenced p11568 samples over time, ONT reads from each of p11568 t1–t4 were aligned against the t4 E. faecium contig with minimap2 (ref. 83) with the flags -x asm20, -c and --eqx. Per base read depth was then counted for each position with samtools84 (v.1.21) using the command: samtools depth -a -Q 10 -b t4.bed _sorted.bam > .coverage.txt. Poor quality alignments are removed by the -Q 10 flag. First, the mean alignment depth across the entire t4 E. faecium contig for each sample was calculated. Then, relative depth per base was calculated by dividing the depth at each position by the mean depth per sample. The t4 E. faecium contig was annotated with Bakta as described above and the neighbourhood of each ISL3 structural variant was visualized.
E. faecium isolation from stoolA single punch biopsy of stool samples from the first (t1) and last (t4) time points from p11568 were suspended in 300 µl PBS. Each sample was serially diluted tenfold into 900 µl PBS to 1:10,000. Then, 30 µl of the 1:10,000 dilutions were plated onto BHI agar and incubated overnight at 37 °C. The next day, six colonies from each plate were selected and identified by biotyping with a MALDI-TOF based Bruker Biotyper per manufacturer instructions. Each identified E. faecium colony was streaked to isolation on a BHI agar plate. To confirm the insertion genotypes, overnights of isolates were prepared in 3 ml BHI broth and incubated with shaking at 37 °C. Overnights were diluted 1:1,000 into 10-ml subcultures and grown to an optical density of approximately 1.0. Cell pellets were then prepared according to Plasmidsaurus ‘Standard Bacterial Genome with DNA extraction’ specifications. The resulting genomes were annotated with Bakta as described above and the lack of both ISL3 insertions in t1, and the presence of both ISL3 insertions in t4 were confirmed in the assemblies.RNA-seq of E. faecium isolatesFrom glycerol stocks of both p11568 t1 and t4 confirmed to have the insertional genotypes (Fig. 5), four overnight cultures were prepared in BHI as before. Each overnight culture was diluted 1:100 into 5 ml of BHI and subcultured for around 3 h until OD600 reached 0.4; 5-ml microtubes (AXYGEN, catalogue no. MCT-500-C) were prepared with 0.5 ml of quenching solution. The quenching solution contained 90% (v/v) 100% ethanol and 10% (v/v) saturated phenol for RNA extraction (pH 4–5). Microtubes were stored at −20 °C before quenching occurred. After OD600 reached 0.4, 4 ml of each culture was promptly quenched. The microtube was vortexed until the solution was well mixed. All microtubes were centrifuged at 10,000g for 5 min at 4 °C. The supernatant was poured off. The pellet was then lysed in 260 µl of a prepared mastermix containing 25:1 PBS:10 mg ml−1 lysozyme (Sigma-Aldrich, catalogue no. L3790-10X1ML). Each pellet was resuspended in lysis buffer and then incubated at 37 °C for 30 min while rocking on a BenchRocker. Then, 30 µl of 20% SDS was added followed by another round of incubation at 37 °C. Then, 1.5 ml of Trizol was added and mixed well by pipette. All subsequent steps were performed at room temperature. The solution was incubated for 10 min. Then, 0.5 ml of chloroform was added and ten vigorous complete inversions of the tube were made before 3 min of incubation. After incubation, each tube was centrifuged at 12,000g for 10 min. Carefully avoiding any non-aqueous layer, approximately 800 µl of the aqueous phase was transferred to a clean 2-ml Eppendorf tube. An equal volume of 100% ethanol was added and mixed by pipette until emulsions were no longer visible. RNA was then purified using the Zymo RNA Clean & Concentrator-25 kit (catalogue no. R1018) according to manufacturer instructions. RNA was eluted in 25 µl nuclease-free water. RNA concentration was measured by NanoDrop and quality was confirmed by Bioanalyzer with the RNA 6000 Pico Kit (Agilent; catalogue no. 5067-1513). RNA was then stored at −80 °C and sequenced with Novogene’s Prokaryotic Total RNA sequencing service (NovaSeq PE150, 3G raw data per sample).Analyses of RNA-seq dataRaw RNA-seq reads were trimmed with Trim Galore, available at https://github.com/FelixKrueger/TrimGalore/, using the command trim_galore --quality 30 --length 60 --paired 1.fq.gz _2.fq.gz --retain_unpaired. Bowtie2 (ref. 85) (v.2.5.4) was used to align processed reads against the p11568 t4 isolate reference genome with default parameters. On the basis of a poor alignment fraction (20.99%) compared with the other replicates (minimum: 97.63%), the fourth replicate from the t1 isolate (t1_4) was excluded. To get the per base depth of reads, samtools depth was used as before except no quality threshold was implemented. This meant reads mapping equally well to several locations were distributed randomly. Furthermore, a feature count matrix filtering alignments against the t4 isolate reference genome with mapping quality < 10 was generated with bedtools (v.2.27.1) multicov using the command bedtools multicov -p -q 10 -bams _sorted.bam _sorted.bam. The resulting feature count matrix, along with the Bakta annotation of the t4 isolate genome, were used to perform differential expression analysis with the DESeq2 (ref. 86) package (v.1.42.1) in R. Only features with at least one count in both t1 (n = 3) and t4 (n = 4) were considered for differential expression. The adjusted P value and log2(fold change) for each of the 2,668 resulting features were visualized as a volcano plot. Thresholds of 1 × 10−25 and |log2(fold change)| ≥ 2 were used to define differentially expressed genes.
E. faecium growth and antibiotic dose–response assaysE. faecium isolates were grown as described above. Antibiotic susceptibility testing was performed as described previously46, with some modifications. In brief, each strain was grown overnight (approximately 18 h) to stationary phase in BHI medium at 37 °C. Overnight cultures were then diluted to OD600 = 0.001 in Mueller-Hinton broth (Sigma-Aldrich) and incubated with twofold serial dilutions of TMP (Sigma-Aldrich) and SMX (Fisher) at the indicated concentrations. A ratio of 1:20 (TMP:SMX) was maintained for all treatments. Folic acid (Fisher) and folinic acid (Sigma-Aldrich) were supplemented at 1 µg ml−1 where indicated. Cultures were incubated at 37 °C with continuous shaking in technical duplicate (150 μl volume per well) in a flat bottom 96-well microtiter plate (Corning) and sealed with a Breathe-Easy membrane (Diversified Biotech). Absorbance (600 nm) was monitored every 10 min for 24 h using an Agilent BioTek Synergy H1 microplate reader. Growth was calculated as the percent area under the curve (AUC) relative to a control with no antibiotics added using the R package gcplyr87.Data visualizationInitial plots were generated by custom Python (Plotly) and R (ggplot) code. Plots were organized and refined with AffinityDesigner v.2.5.7.ScriptingCustom bash, Python, R and Nextflow scripts referenced in the methods section are available in the GitHub repository (‘Code Availability’). All scripts were run in R v.4.3.2. Python scripts were run using v.3.10 with polars v.1.20.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Comments (0)