The file MMETSP_zenodo_3247846_uniclust90_2018_08_seed_valid_taxids.tar.gz is a taxonomic database (seqTaxDb) composed of two resources: 1. uniclust90 seed proteins. Downloaded from: http://gwdu111.gwdg.de/~compbiol/uniclust/2018_08/ Reference: Mirdita et al. 2016. Uniclust databases of clustered and deeply annotated protein sequences and alignments. https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkw1081 2. MMETSP proteins. Downloaded from: https://zenodo.org/record/3247846#.XhRw0nJKiUl in January 2020. References: Johnson et al. 2019. Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes. https://academic.oup.com/gigascience/article/8/4/giy158/5241890 Keeling et al. 2014. The Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP): Illuminating the Functional Diversity of Eukaryotic Life in the Oceans through Transcriptome Sequencing. http://dx.doi.org/10.1371/journal.pbio.1001889 The mapping of each MMETSP sample to its NCBI taxid was done by downloading the BioSample data from NCBI for BioProject PRJNA231566 (11.Dec.2019) in XML format. The uniclust90 database already contains NCBI taxids. The hierarchical taxonomic information by NCBI was downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz (14.01.2020) and any sequences with a deleted taxid from the uniclust90 database were removed. The two resources were then processed and merged to create a seqTaxDb in MMseqs2/MetaEuk format (https://github.com/soedinglab/MMseqs2/wiki#terminology) with a total of 88,022,300 entries.