Commit 8c94e595 authored by Mateusz Pawlik's avatar Mateusz Pawlik

Documenting.

parent 0c22efd2
......@@ -15,11 +15,11 @@ Currently we support the following datasest:
- **Swissprot** - Protein sequence data in XML.
The details about each dataset can be found in the README files in the
datasets subdirectories.
corresponding subdirectories.
## Statistics
The `statistics` subdirectory contains scripts to summirize the data. The table
The `statistics` subdirectory contains scripts to summarize the data. The table
below shows the statistics of our datasets.
Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels
......@@ -34,6 +34,19 @@ Swissprot | 556196 | 862 | 101 | 48286 | 1
Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset.
There is a script `download_prepare.sh` for each dataset, that downloads the
raw data, converts to bracket notation, and sorts by tree size.
The `utilities` directory holds tools common for multiple datasets.
## RAM requirements and runtime estimates
For each dataset, if necessary, we list the required RAM memory and estimated
runtime. The RAM requirements go up to **60GB** for the Swissprot dataset.
The script `utilities/sort_dataset.sh` has a hard-coded 10GB buffer size.
However, it executes on machines with less RAM too.
## Expected output
Each output dataset must satisfy the following requirements:
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment