# Datasets for tree edit distance experiments This repository contains all resources to acquire datasets for experimenting on tree edit distance algorithms. **We do not store the datasets**, only the scripts to obtain and prepare them. ## Datasets description Currently we support the following datasest: - **Bolzano** - Residential addresses in the city of Bolzano. - **DBLP** - Bibliographic XML data. - **Python** - Abstract syntax trees of Python source code in JSON. - **Sentiment** - Semantic trees of movie reviews in the PennTreeBank format. - **Swissprot** - Protein sequence data in XML. The details about each dataset can be found in the README files in the corresponding subdirectories. ## Statistics The `statistics` subdirectory contains scripts to summarize the data. The table below shows the statistics of our datasets. Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels ----------|-----------------|----------------|---------------|---------------|-------------------------- Bolzano | 299 | 166 | 2 | 2105 | 592 DBLP | 3934134 | 25 | 8 | 2986 | 14664605 Python | 150000 | 946 | 1 | 46481 | 3523697 Sentiment | 9645 | 37 | 3 | 103 | 19470 Swissprot | 556196 | 862 | 101 | 48286 | 11439467 ## Repository organisation Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset. There is a script `download_prepare.sh` for each dataset, that downloads the raw data, converts to bracket notation, and sorts by tree size. The `utilities` directory holds tools common for multiple datasets. ## RAM requirements and runtime estimates For each dataset, if necessary, we list the required RAM memory and estimated runtime. The RAM requirements go up to **60GB** for the Swissprot dataset. The script `utilities/sort_dataset.sh` has a hard-coded 10GB buffer size. However, it executes on machines with less RAM too. ## Expected output Each output dataset must satisfy the following requirements: - The output dataset must be a single text file with one tree per line. - The trees must be in bracket notation. - The trees must be sorted by size. ## TODO - [ ]Explicitly break ties in sorting: first size, then bracket notation strings lexicographically. Example: `sort -k1,1nr -k2 data` - [ ]Publish snapshots of raw data. - [ ]Add where the raw data should be fetched from: original source, our snapshot.