Loading python/README.md +6 −2 Original line number Diff line number Diff line Loading @@ -24,13 +24,17 @@ https://www.sri.inf.ethz.ch/py150 - **sort** - **cut** ## RAM requirements The current way of processing DBLP dataset requires **??GB** of RAM memory. ## Steps Execute to download all necessary files, convert them into bracket notation and sort them. Execute the following to download and prepare the dataset. ```bash ./download_prepare.sh ``` ## Estimated time To be measured. No newline at end of file On an Intel Xeon 2.40GHz CPU, it takes around **??min**. No newline at end of file python/download_prepare.sh +29 −14 Original line number Diff line number Diff line #!/bin/bash # file: download_prepare.sh # The MIT License (MIT) # Copyright (c) 2017 Thomas Hütter, Mateusz Pawlik. # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # Program: Downloads and prepares data containing abstract syntax trees # from python programs. (https://www.sri.inf.ethz.ch/py150) # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # Author: Thomas Huetter # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # download abstract syntax trees # Download the raw dataset. wget http://files.srl.inf.ethz.ch/data/py150.tar.gz # extract abstract syntax trees # Extract the dataset. tar -xzf py150.tar.gz # convert ast to bracket notation python3 parse_json.py --inputfile python100k_train.json > python_ast.bracket # Convert the training dataset to bracket notation. python3 parse_json.py --inputfile python100k_train.json > python.bracket # convert ast to bracket notation python3 parse_json.py --inputfile python50k_eval.json >> python_ast.bracket # Convert the evaluation dataset to bracket notation and append to training. python3 parse_json.py --inputfile python50k_eval.json >> python.bracket # sort the trees ascending by their size ./sort_dataset.sh python_ast.bracket # Sort the trees by size. ./../utilities/sort_dataset.sh python.bracket > python_sorted.bracket python/parse_json.py +28 −11 Original line number Diff line number Diff line #!/usr/bin/env python ''' File name: parse_json.py Author: Thomas Huetter Program: Reads Python and Javascript Abstract Syntax Trees and converts them into bracket notation. (https://www.sri.inf.ethz.ch/py150) ''' #!/usr/bin/env python3 # The MIT License (MIT) # Copyright (c) 2017 Thomas Hütter. # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. import argparse import json # recursively traverses the tree and converts it step by step # into bracket notation # Reads Python and Javascript Abstract Syntax Trees and converts them into # bracket notation. # Recursively traverse the tree and convert it into bracket notation. def print_tree(json_tree, index): print('{' + json_tree[index]['type'], end='') if 'value' in json_tree[index]: Loading @@ -25,7 +42,7 @@ def print_tree(json_tree, index): print_tree(json_tree, child) print('}', end='') # parse input argurments # Parse input argurments. parser = argparse.ArgumentParser() parser.add_argument("--inputfile", type=str, help="path to input files containing line seperated JSON ASTs") Loading Loading
python/README.md +6 −2 Original line number Diff line number Diff line Loading @@ -24,13 +24,17 @@ https://www.sri.inf.ethz.ch/py150 - **sort** - **cut** ## RAM requirements The current way of processing DBLP dataset requires **??GB** of RAM memory. ## Steps Execute to download all necessary files, convert them into bracket notation and sort them. Execute the following to download and prepare the dataset. ```bash ./download_prepare.sh ``` ## Estimated time To be measured. No newline at end of file On an Intel Xeon 2.40GHz CPU, it takes around **??min**. No newline at end of file
python/download_prepare.sh +29 −14 Original line number Diff line number Diff line #!/bin/bash # file: download_prepare.sh # The MIT License (MIT) # Copyright (c) 2017 Thomas Hütter, Mateusz Pawlik. # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # Program: Downloads and prepares data containing abstract syntax trees # from python programs. (https://www.sri.inf.ethz.ch/py150) # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # Author: Thomas Huetter # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # download abstract syntax trees # Download the raw dataset. wget http://files.srl.inf.ethz.ch/data/py150.tar.gz # extract abstract syntax trees # Extract the dataset. tar -xzf py150.tar.gz # convert ast to bracket notation python3 parse_json.py --inputfile python100k_train.json > python_ast.bracket # Convert the training dataset to bracket notation. python3 parse_json.py --inputfile python100k_train.json > python.bracket # convert ast to bracket notation python3 parse_json.py --inputfile python50k_eval.json >> python_ast.bracket # Convert the evaluation dataset to bracket notation and append to training. python3 parse_json.py --inputfile python50k_eval.json >> python.bracket # sort the trees ascending by their size ./sort_dataset.sh python_ast.bracket # Sort the trees by size. ./../utilities/sort_dataset.sh python.bracket > python_sorted.bracket
python/parse_json.py +28 −11 Original line number Diff line number Diff line #!/usr/bin/env python ''' File name: parse_json.py Author: Thomas Huetter Program: Reads Python and Javascript Abstract Syntax Trees and converts them into bracket notation. (https://www.sri.inf.ethz.ch/py150) ''' #!/usr/bin/env python3 # The MIT License (MIT) # Copyright (c) 2017 Thomas Hütter. # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. import argparse import json # recursively traverses the tree and converts it step by step # into bracket notation # Reads Python and Javascript Abstract Syntax Trees and converts them into # bracket notation. # Recursively traverse the tree and convert it into bracket notation. def print_tree(json_tree, index): print('{' + json_tree[index]['type'], end='') if 'value' in json_tree[index]: Loading @@ -25,7 +42,7 @@ def print_tree(json_tree, index): print_tree(json_tree, child) print('}', end='') # parse input argurments # Parse input argurments. parser = argparse.ArgumentParser() parser.add_argument("--inputfile", type=str, help="path to input files containing line seperated JSON ASTs") Loading