README.md 1.04 KB
Newer Older
Mateusz Pawlik's avatar
Mateusz Pawlik committed
1
# Datasets for tree edit distance experiments
Mateusz Pawlik's avatar
Mateusz Pawlik committed
2

Mateusz Pawlik's avatar
Mateusz Pawlik committed
3 4
This repository contains all resources to acquire datasets for experimenting
on tree edit distance algorithms.
5 6 7

**We do not store the datasets**, only the scripts to obtain and prepare them.

Mateusz Pawlik's avatar
Mateusz Pawlik committed
8 9 10
## Datasets description

Currently we support the following datasest:
Mateusz Pawlik's avatar
Mateusz Pawlik committed
11 12 13 14 15 16 17 18
- **Bolzano** - Residential addresses in the city of Bolzano.
- **DBLP** - Bibliographic XML data.
- **Python** - Abstract syntax trees of Python source code in JSON.
- **Sentiment** - Semantic trees of movie reviews in the PennTreeBank format.
- **Swissprot** - Protein sequence data in XML.

The details about each dataset can be found in the README files in the
datasets subdirectories.
19

Mateusz Pawlik's avatar
Mateusz Pawlik committed
20 21 22
## Repository organisation

Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset.
23 24 25 26 27 28 29

## Expected output

Each output dataset must satisfy the following requirements:
- The output dataset must be a single text file with one tree per line.
- The trees must be in bracket notation.
- The trees must be sorted by size.