Commit f5ccdceb authored by Mateusz Pawlik's avatar Mateusz Pawlik
Browse files

Documenting DBLP: RAM and runtime estimates.

parent 791e1282
...@@ -28,51 +28,27 @@ billions of pair. ...@@ -28,51 +28,27 @@ billions of pair.
https://www.python.org/downloads/ https://www.python.org/downloads/
- **wget** - **wget**
- **gzip** - **gzip**
- **awk**
## Steps ## RAM requirements
**For repeatability, it downloads a specific version of the data (hardcoded in ``download.sh`` and ``dblp-to-bracket.py``).**
Execute to download all necessary files.
```bash
./download.sh
```
Execute to convert the raw data file into bracket notation. **Takes around 10 minutes on i5 laptop machine.** The current way of processing DBLP dataset requires **16GB** of RAM memory.
```bash
python dblp_to_bracket.py
```
**Execute to get a subset only.** ## Steps
```bash
python random_lines.py 100000 dblp.bracket > dblp_random_100k.bracket
```
Execute to sort the dataset by tree size. **For repeatability, it downloads a specific version of the data (hardcoded in
```bash ``download.sh`` and ``dblp-to-bracket.py``).**
./sort_dataset.sh dblp.bracket
```
Execute to remove the homepage entries. Execute the following to download and prepare the dataset.
```bash ```bash
sed '/{www{key{homepages/d' dblp_sorted.bracket > dblp_no_www_sorted.bracket ./download_prepare.sh
```
or
```bash
awk '!/{www{key{homepages/' dblp_sorted.bracket > dblp_no_www_sorted.bracket
``` ```
## Estimated time
**(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files. On an Intel Xeon 2.40GHz CPU, it takes around **15min**.
```bash
./tidy-up.sh
```
## Troubleshooting ## Troubleshooting
### Encoding ### Encoding
In case of encoding error follow the steps on this webpage: [https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian](https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian). In case of encoding error follow the steps on this webpage: [https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian](https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian).
\ No newline at end of file
## Estimated time
Partially listed in Steps. Total time to be measured.
\ No newline at end of file
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment