Loading README.md +15 −2 Original line number Diff line number Diff line Loading @@ -15,11 +15,11 @@ Currently we support the following datasest: - **Swissprot** - Protein sequence data in XML. The details about each dataset can be found in the README files in the datasets subdirectories. corresponding subdirectories. ## Statistics The `statistics` subdirectory contains scripts to summirize the data. The table The `statistics` subdirectory contains scripts to summarize the data. The table below shows the statistics of our datasets. Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels Loading @@ -34,6 +34,19 @@ Swissprot | 556196 | 862 | 101 | 48286 | 1 Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset. There is a script `download_prepare.sh` for each dataset, that downloads the raw data, converts to bracket notation, and sorts by tree size. The `utilities` directory holds tools common for multiple datasets. ## RAM requirements and runtime estimates For each dataset, if necessary, we list the required RAM memory and estimated runtime. The RAM requirements go up to **60GB** for the Swissprot dataset. The script `utilities/sort_dataset.sh` has a hard-coded 10GB buffer size. However, it executes on machines with less RAM too. ## Expected output Each output dataset must satisfy the following requirements: Loading Loading
README.md +15 −2 Original line number Diff line number Diff line Loading @@ -15,11 +15,11 @@ Currently we support the following datasest: - **Swissprot** - Protein sequence data in XML. The details about each dataset can be found in the README files in the datasets subdirectories. corresponding subdirectories. ## Statistics The `statistics` subdirectory contains scripts to summirize the data. The table The `statistics` subdirectory contains scripts to summarize the data. The table below shows the statistics of our datasets. Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels Loading @@ -34,6 +34,19 @@ Swissprot | 556196 | 862 | 101 | 48286 | 1 Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset. There is a script `download_prepare.sh` for each dataset, that downloads the raw data, converts to bracket notation, and sorts by tree size. The `utilities` directory holds tools common for multiple datasets. ## RAM requirements and runtime estimates For each dataset, if necessary, we list the required RAM memory and estimated runtime. The RAM requirements go up to **60GB** for the Swissprot dataset. The script `utilities/sort_dataset.sh` has a hard-coded 10GB buffer size. However, it executes on machines with less RAM too. ## Expected output Each output dataset must satisfy the following requirements: Loading