README.md 1.85 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
# DBLP

## Short description

Computer science bibliography. Can be downloaded in a large XML file.

## Conversion details

- Each tag is a node with tag's name as a label.
- Tag nesting is converted to parent-child relationships.
- A tag's *content* is tag's child node with *content* as label.
- Each attribute, (*key*, *value*) pair, are two nodes (parent and child) with the labels being *key* and *value*, respectively.
- Attribute-nodes are children of the corresponding tag-node.
- Attribute-nodes are ordered by their key values.
- Attribute-nodes come before content-node.

## Dependencies

- **lxml** (XML processor for Python)
  http://lxml.de/installation.html
- **Python3**
  https://www.python.org/downloads/
- **wget**
- **gzip**

## Steps

**For repeatability, it downloads a specific version of the data (hardcoded in ``download.sh`` and ``dblp-to-bracket.py``).**

Execute to download all necessary files.
```bash
./download.sh
```

Execute to convert the raw data file into bracket notation. **Takes around 10 minutes on i5 laptop machine.**
```bash
python dblp_to_bracket.py
```

40
**Execute to get a subset only.**
41 42 43 44 45 46
```bash
python random_lines.py 100000 dblp.bracket > dblp_random_100k.bracket
```

Execute to sort the dataset by tree size.
```bash
47 48 49 50 51 52
./sort_dataset.sh dblp.bracket
```

Execute to remove the homepage entries.
```bash
sed '/{www{key{/d' dblp_sorted.bracket > dblp_no_www_sorted.bracket
53
```
54 55 56 57 58
or
```bash
awk '!/{www{key{/' dblp_sorted.bracket > dblp_no_www_sorted.bracket
```

59 60 61 62 63

**(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files.
```bash
./tidy-up.sh
```
64 65 66 67 68

## Troubleshooting

- Encoding
In case of encoding error follow the steps on this webpage: [https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian](https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian).