Skip to content
GitLab
Menu
Projects
Groups
Snippets
/
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
Mateusz Pawlik
ted-datasets
Commits
deabcccb
Commit
deabcccb
authored
Oct 23, 2018
by
Mateusz Pawlik
Browse files
Finalized dblp and tested the entire pipeline.
parent
635388e3
Changes
3
Hide whitespace changes
Inline
Side-by-side
dblp/download_prepare.sh
View file @
deabcccb
...
@@ -18,9 +18,14 @@ gzip -d dblp-2017-11-01.xml.gz
...
@@ -18,9 +18,14 @@ gzip -d dblp-2017-11-01.xml.gz
# Convert XML to bracket notation.
# Convert XML to bracket notation.
./dblp_to_bracket.py
./dblp_to_bracket.py
# Remove 'www' entries.
awk
'!/{www{key{homepages/'
dblp.bracket
>
dblp_no_www.bracket
# Sort the dataset.
# Sort the dataset.
./../utilities/sort_dataset.sh dblp.bracket
./../utilities/sort_dataset.sh dblp
_no_www.bracket
>
dblp_no_www_sorted
.bracket
# Tidy up.
# Tidy up.
# rm *xml*
rm
*
xml
*
# rm *.dtd
rm
*
.dtd
rm
dblp.bracket
rm
dblp_no_www.bracket
dblp/tidy-up.sh
deleted
100644 → 0
View file @
635388e3
#!/bin/bash
# Delete all downloaded files.
rm
*
xml
*
rm
*
.dtd
utilities/sort_dataset.sh
View file @
deabcccb
...
@@ -32,4 +32,4 @@
...
@@ -32,4 +32,4 @@
# NOTE: We substract the escaped brackets because they're part of node labels.
# NOTE: We substract the escaped brackets because they're part of node labels.
#
#
cat
$input
|
awk
'{print gsub("{","{")-gsub("\\\\{","\\{"), $0}'
|
\
cat
$input
|
awk
'{print gsub("{","{")-gsub("\\\\{","\\{"), $0}'
|
\
sort
-n
--buffer-size
=
4
G |
cut
-d
' '
-f2-
sort
-n
--buffer-size
=
10
G |
cut
-d
' '
-f2-
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment