Commit b90f74e1 authored by Mateusz Pawlik's avatar Mateusz Pawlik

dblp: First commit of entire workflow.

parent c43603ce
# DBLP
## Short description
Computer science bibliography. Can be downloaded in a large XML file.
## Conversion details
- Each tag is a node with tag's name as a label.
- Tag nesting is converted to parent-child relationships.
- A tag's *content* is tag's child node with *content* as label.
- Each attribute, (*key*, *value*) pair, are two nodes (parent and child) with the labels being *key* and *value*, respectively.
- Attribute-nodes are children of the corresponding tag-node.
- Attribute-nodes are ordered by their key values.
- Attribute-nodes come before content-node.
## Dependencies
- **lxml** (XML processor for Python)
http://lxml.de/installation.html
- **Python3**
https://www.python.org/downloads/
- **wget**
- **gzip**
## Steps
**For repeatability, it downloads a specific version of the data (hardcoded in ``download.sh`` and ``dblp-to-bracket.py``).**
Execute to download all necessary files.
```bash
./download.sh
```
Execute to convert the raw data file into bracket notation. **Takes around 10 minutes on i5 laptop machine.**
```bash
python dblp_to_bracket.py
```
Sample the dataset. **We perform a join on a subset only.**
```bash
python random_lines.py 100000 dblp.bracket > dblp_random_100k.bracket
```
Execute to sort the dataset by tree size.
```bash
./sort_dataset.sh dblp_random_100k.bracket
```
**(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files.
```bash
./tidy-up.sh
```
from lxml import etree
import lxml.sax
from xml.sax.handler import ContentHandler
# This class implements the sax-like events for converting XML elemnts into
# bracket-notation nodes and labels.
class DBLPContentHandler(ContentHandler):
def __init__(self):
self.bn = ""
# Open tag.
def startElementNS(self, name, qname, attributes):
uri, localname = name
self.bn += "{" + localname
d = dict(attributes)
# Sort the attributes by their keys.
for key, value in sorted(d.items(), key = lambda element : element[0][1]):
self.bn += "{" + key[1] + "{" + value + "}}"
# Close tag.
def endElementNS(self, name, qname):
self.bn += "}"
# Tag content.
def characters(self, data):
self.bn += "{" + data + "}"
print("--- Loading DBLP dataset.")
dblp_parser = etree.XMLParser(load_dtd=True, remove_blank_text=True)
dblp_data_tree = etree.parse('dblp-2017-11-01.xml', dblp_parser)
root = dblp_data_tree.getroot()
# Output files.
dblp_bracket = open('dblp.bracket', 'w')
print("--- Processing each child of DBLP's root.")
tree_id = 0
for child in root:
tree_id += 1
# Printing simple progress.
if tree_id % 100000 == 0:
print("- Tree %s" % (tree_id))
handler = DBLPContentHandler()
lxml.sax.saxify(child, handler)
dblp_bracket.write(handler.bn + "\n")
print("--- Closing output files.")
dblp_bracket.close()
#!/bin/bash
# Download the XML file.
wget -v http://dblp.dagstuhl.de/xml/release/dblp-2017-11-01.xml.gz
# Download the checksum.
wget -v http://dblp.dagstuhl.de/xml/release/dblp-2017-11-01.xml.gz.md5
# Verify the checksum.
md5sum -c dblp-2017-11-01.xml.gz.md5
# Download the DTD file.
wget -v http://dblp.dagstuhl.de/xml/release/dblp-2017-08-29.dtd
# Extract the XML file.
gzip -d dblp-2017-11-01.xml.gz
import sys
import random
number = int(sys.argv[1])
filename = sys.argv[2]
with open(filename) as f:
lines = f.readlines()
linestoprint = random.sample(range(len(lines)), number)
for ln in linestoprint:
print(lines[ln], end='')
#!/bin/bash
cat $1 | awk '{print gsub("{","{"), $0}' | sort -n | cut -d' ' -f2- > "${1%.bracket}_sorted.bracket"
#!/bin/bash
# Delete all downloaded files.
rm *xml*
rm *.dtd
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment