- **Swissprot** - Protein sequence data in XML.
The details about each dataset can be found in the README files in the
corresponding subdirectories.
## Statistics
The `statistics` subdirectory contains scripts to summarize the data. The table
below shows the statistics of our datasets.
Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels
......@@ -34,6 +34,19 @@ Swissprot | 556196 | 862 | 101 | 48286 | 1
Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset.
There is a script `` for each dataset, that downloads the
raw data, converts to bracket notation, and sorts by tree size.
The `utilities` directory holds tools common for multiple datasets.
## RAM requirements and runtime estimates
For each dataset, if necessary, we list the required RAM memory and estimated
runtime. The RAM requirements go up to **60GB** for the Swissprot dataset.
The script `utilities/` has a hard-coded 10GB buffer size.
However, it executes on machines with less RAM too.
## Expected output
Each output dataset must satisfy the following requirements:
