README.md 1.69 KB
Newer Older
Mateusz Pawlik's avatar
Mateusz Pawlik committed
1
# Datasets for tree edit distance experiments
Mateusz Pawlik's avatar
Mateusz Pawlik committed
2

Mateusz Pawlik's avatar
Mateusz Pawlik committed
3 4
This repository contains all resources to acquire datasets for experimenting
on tree edit distance algorithms.
5 6 7

**We do not store the datasets**, only the scripts to obtain and prepare them.

Mateusz Pawlik's avatar
Mateusz Pawlik committed
8 9 10
## Datasets description

Currently we support the following datasest:
Mateusz Pawlik's avatar
Mateusz Pawlik committed
11 12 13 14 15 16 17 18
- **Bolzano** - Residential addresses in the city of Bolzano.
- **DBLP** - Bibliographic XML data.
- **Python** - Abstract syntax trees of Python source code in JSON.
- **Sentiment** - Semantic trees of movie reviews in the PennTreeBank format.
- **Swissprot** - Protein sequence data in XML.

The details about each dataset can be found in the README files in the
datasets subdirectories.
19

Mateusz Pawlik's avatar
Mateusz Pawlik committed
20 21 22 23 24 25 26 27 28 29
## Statistics

Dataset   | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels
----------|-----------------|----------------|---------------|---------------|-------------------------- 
Bolzano   | 299             | 166            | 2             | 2105          | 592
DBLP      | 3934134         | 25             | 8             | 2986          | 14664605
Python    | 150000          | 946            | 1             | 46481         | 3523697
Sentiment | 9645            | 37             | 3             | 103           | 19470
Swissprot | 556196          | 862            | 101           | 48286         | 11439467

Mateusz Pawlik's avatar
Mateusz Pawlik committed
30 31 32
## Repository organisation

Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset.
33 34 35 36 37 38 39

## Expected output

Each output dataset must satisfy the following requirements:
- The output dataset must be a single text file with one tree per line.
- The trees must be in bracket notation.
- The trees must be sorted by size.