Newer
Older
This repository contains all resources to acquire datasets for experimenting
on tree edit distance algorithms.
**We do not store the datasets**, only the scripts to obtain and prepare them.
## Datasets description
Currently we support the following datasest:
- **Bolzano** - Residential addresses in the city of Bolzano.
- **DBLP** - Bibliographic XML data.
- **Python** - Abstract syntax trees of Python source code in JSON.
- **Sentiment** - Semantic trees of movie reviews in the PennTreeBank format.
- **Swissprot** - Protein sequence data in XML.
The details about each dataset can be found in the README files in the
datasets subdirectories.
The `statistics` subdirectory contains scripts to summirize the data. The table
below shows the statistics of our datasets.
Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels
----------|-----------------|----------------|---------------|---------------|--------------------------
Bolzano | 299 | 166 | 2 | 2105 | 592
DBLP | 3934134 | 25 | 8 | 2986 | 14664605
Python | 150000 | 946 | 1 | 46481 | 3523697
Sentiment | 9645 | 37 | 3 | 103 | 19470
Swissprot | 556196 | 862 | 101 | 48286 | 11439467
## Repository organisation
Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset.
## Expected output
Each output dataset must satisfy the following requirements:
- The output dataset must be a single text file with one tree per line.
- The trees must be in bracket notation.
- The trees must be sorted by size.