Commit 8c94e595 authored by Mateusz Pawlik's avatar Mateusz Pawlik
Browse files

Documenting.

parent 0c22efd2
...@@ -15,11 +15,11 @@ Currently we support the following datasest: ...@@ -15,11 +15,11 @@ Currently we support the following datasest:
- **Swissprot** - Protein sequence data in XML. - **Swissprot** - Protein sequence data in XML.
The details about each dataset can be found in the README files in the The details about each dataset can be found in the README files in the
datasets subdirectories. corresponding subdirectories.
## Statistics ## Statistics
The `statistics` subdirectory contains scripts to summirize the data. The table The `statistics` subdirectory contains scripts to summarize the data. The table
below shows the statistics of our datasets. below shows the statistics of our datasets.
Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels
...@@ -34,6 +34,19 @@ Swissprot | 556196 | 862 | 101 | 48286 | 1 ...@@ -34,6 +34,19 @@ Swissprot | 556196 | 862 | 101 | 48286 | 1
Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset. Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset.
There is a script `download_prepare.sh` for each dataset, that downloads the
raw data, converts to bracket notation, and sorts by tree size.
The `utilities` directory holds tools common for multiple datasets.
## RAM requirements and runtime estimates
For each dataset, if necessary, we list the required RAM memory and estimated
runtime. The RAM requirements go up to **60GB** for the Swissprot dataset.
The script `utilities/sort_dataset.sh` has a hard-coded 10GB buffer size.
However, it executes on machines with less RAM too.
## Expected output ## Expected output
Each output dataset must satisfy the following requirements: Each output dataset must satisfy the following requirements:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment