Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
What's new
10
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Open sidebar
Mateusz Pawlik
ted-datasets
Commits
deabcccb
Commit
deabcccb
authored
Oct 23, 2018
by
Mateusz Pawlik
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Finalized dblp and tested the entire pipeline.
parent
635388e3
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
9 additions
and
9 deletions
+9
-9
dblp/download_prepare.sh
dblp/download_prepare.sh
+8
-3
dblp/tidy-up.sh
dblp/tidy-up.sh
+0
-5
utilities/sort_dataset.sh
utilities/sort_dataset.sh
+1
-1
No files found.
dblp/download_prepare.sh
View file @
deabcccb
...
...
@@ -18,9 +18,14 @@ gzip -d dblp-2017-11-01.xml.gz
# Convert XML to bracket notation.
./dblp_to_bracket.py
# Remove 'www' entries.
awk
'!/{www{key{homepages/'
dblp.bracket
>
dblp_no_www.bracket
# Sort the dataset.
./../utilities/sort_dataset.sh dblp.bracket
./../utilities/sort_dataset.sh dblp
_no_www.bracket
>
dblp_no_www_sorted
.bracket
# Tidy up.
# rm *xml*
# rm *.dtd
rm
*
xml
*
rm
*
.dtd
rm
dblp.bracket
rm
dblp_no_www.bracket
dblp/tidy-up.sh
deleted
100644 → 0
View file @
635388e3
#!/bin/bash
# Delete all downloaded files.
rm
*
xml
*
rm
*
.dtd
utilities/sort_dataset.sh
View file @
deabcccb
...
...
@@ -32,4 +32,4 @@
# NOTE: We substract the escaped brackets because they're part of node labels.
#
cat
$input
|
awk
'{print gsub("{","{")-gsub("\\\\{","\\{"), $0}'
|
\
sort
-n
--buffer-size
=
4
G |
cut
-d
' '
-f2-
sort
-n
--buffer-size
=
10
G |
cut
-d
' '
-f2-
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment