Documenting. (8c94e595) · Commits · Mateusz Pawlik / ted-datasets

README.md

+15 −2

Original line number	Diff line number	Diff line
		@@ -15,11 +15,11 @@ Currently we support the following datasest:
		- Swissprot - Protein sequence data in XML.

		The details about each dataset can be found in the README files in the
		datasets subdirectories.
		corresponding subdirectories.

		## Statistics

		The `statistics` subdirectory contains scripts to summirize the data. The table
		The `statistics` subdirectory contains scripts to summarize the data. The table
		below shows the statistics of our datasets.

		Dataset \| Number of trees \| Avg. tree size \| Min tree size \| Max tree size \| Number of distinct labels
		@@ -34,6 +34,19 @@ Swissprot \| 556196 \| 862 \| 101 \| 48286 \| 1

		Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset.

		There is a script `download_prepare.sh` for each dataset, that downloads the
		raw data, converts to bracket notation, and sorts by tree size.

		The `utilities` directory holds tools common for multiple datasets.

		## RAM requirements and runtime estimates

		For each dataset, if necessary, we list the required RAM memory and estimated
		runtime. The RAM requirements go up to 60GB for the Swissprot dataset.

		The script `utilities/sort_dataset.sh` has a hard-coded 10GB buffer size.
		However, it executes on machines with less RAM too.

		## Expected output

		Each output dataset must satisfy the following requirements: