Commit b90f74e1 authored by Mateusz Pawlik's avatar Mateusz Pawlik
Browse files

dblp: First commit of entire workflow.

parent c43603ce
## Short description
Computer science bibliography. Can be downloaded in a large XML file.
## Conversion details
- Each tag is a node with tag's name as a label.
- Tag nesting is converted to parent-child relationships.
- A tag's *content* is tag's child node with *content* as label.
- Each attribute, (*key*, *value*) pair, are two nodes (parent and child) with the labels being *key* and *value*, respectively.
- Attribute-nodes are children of the corresponding tag-node.
- Attribute-nodes are ordered by their key values.
- Attribute-nodes come before content-node.
## Dependencies
- **lxml** (XML processor for Python)
- **Python3**
- **wget**
- **gzip**
## Steps
**For repeatability, it downloads a specific version of the data (hardcoded in ```` and ````).**
Execute to download all necessary files.
Execute to convert the raw data file into bracket notation. **Takes around 10 minutes on i5 laptop machine.**
Sample the dataset. **We perform a join on a subset only.**
python 100000 dblp.bracket > dblp_random_100k.bracket
Execute to sort the dataset by tree size.
./ dblp_random_100k.bracket
**(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files.
from lxml import etree
import lxml.sax
from xml.sax.handler import ContentHandler
# This class implements the sax-like events for converting XML elemnts into
# bracket-notation nodes and labels.
class DBLPContentHandler(ContentHandler):
def __init__(self): = ""
# Open tag.
def startElementNS(self, name, qname, attributes):
uri, localname = name += "{" + localname
d = dict(attributes)
# Sort the attributes by their keys.
for key, value in sorted(d.items(), key = lambda element : element[0][1]): += "{" + key[1] + "{" + value + "}}"
# Close tag.
def endElementNS(self, name, qname): += "}"
# Tag content.
def characters(self, data): += "{" + data + "}"
print("--- Loading DBLP dataset.")
dblp_parser = etree.XMLParser(load_dtd=True, remove_blank_text=True)
dblp_data_tree = etree.parse('dblp-2017-11-01.xml', dblp_parser)
root = dblp_data_tree.getroot()
# Output files.
dblp_bracket = open('dblp.bracket', 'w')
print("--- Processing each child of DBLP's root.")
tree_id = 0
for child in root:
tree_id += 1
# Printing simple progress.
if tree_id % 100000 == 0:
print("- Tree %s" % (tree_id))
handler = DBLPContentHandler()
lxml.sax.saxify(child, handler)
dblp_bracket.write( + "\n")
print("--- Closing output files.")
# Download the XML file.
wget -v
# Download the checksum.
wget -v
# Verify the checksum.
md5sum -c dblp-2017-11-01.xml.gz.md5
# Download the DTD file.
wget -v
# Extract the XML file.
gzip -d dblp-2017-11-01.xml.gz
import sys
import random
number = int(sys.argv[1])
filename = sys.argv[2]
with open(filename) as f:
lines = f.readlines()
linestoprint = random.sample(range(len(lines)), number)
for ln in linestoprint:
print(lines[ln], end='')
cat $1 | awk '{print gsub("{","{"), $0}' | sort -n | cut -d' ' -f2- > "${1%.bracket}_sorted.bracket"
# Delete all downloaded files.
rm *xml*
rm *.dtd
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment