README.md 1.81 KB
Newer Older
1
# Datasets for tree edit distance experiments
Mateusz Pawlik's avatar
Mateusz Pawlik committed
2

3 4
This repository contains all resources to acquire datasets for experimenting
on tree edit distance algorithms.
5 6 7

**We do not store the datasets**, only the scripts to obtain and prepare them.

Mateusz Pawlik's avatar
Mateusz Pawlik committed
8 9 10
## Datasets description

Currently we support the following datasest:
11 12 13 14 15 16 17 18
- **Bolzano** - Residential addresses in the city of Bolzano.
- **DBLP** - Bibliographic XML data.
- **Python** - Abstract syntax trees of Python source code in JSON.
- **Sentiment** - Semantic trees of movie reviews in the PennTreeBank format.
- **Swissprot** - Protein sequence data in XML.

The details about each dataset can be found in the README files in the
datasets subdirectories.
19

Mateusz Pawlik's avatar
Mateusz Pawlik committed
20 21
## Statistics

22 23 24
The `statistics` subdirectory contains scripts to summirize the data. The table
below shows the statistics of our datasets.

Mateusz Pawlik's avatar
Mateusz Pawlik committed
25 26 27 28 29 30 31 32
Dataset   | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels
----------|-----------------|----------------|---------------|---------------|-------------------------- 
Bolzano   | 299             | 166            | 2             | 2105          | 592
DBLP      | 3934134         | 25             | 8             | 2986          | 14664605
Python    | 150000          | 946            | 1             | 46481         | 3523697
Sentiment | 9645            | 37             | 3             | 103           | 19470
Swissprot | 556196          | 862            | 101           | 48286         | 11439467

Mateusz Pawlik's avatar
Mateusz Pawlik committed
33 34 35
## Repository organisation

Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset.
36 37 38 39 40 41 42

## Expected output

Each output dataset must satisfy the following requirements:
- The output dataset must be a single text file with one tree per line.
- The trees must be in bracket notation.
- The trees must be sorted by size.