added script to download and prepare sentiment data (19f39918) · Commits · Mateusz Pawlik / ted-datasets

sentiment/download_prepare.sh

0 → 100755

+29 −0

Original line number	Diff line number	Diff line
		#!/bin/bash
		# file: prepare_data.sh
		#
		# Program: Downloads and prepares data containing the sentiment dataset.
		#
		# Author: Thomas Huetter

		# create target folder and change into it
		mkdir sentiment
		cd sentiment

		# download the data files
		wget https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip

		# unzip the data into folder original_data
		unzip trainDevTestTrees_PTB.zip

		# remove zip file
		rm -rf trainDevTestTrees_PTB.zip

		# change to unzipped folder
		cd trees

		# prepare data for file L.trees
		# convert dev.txt and train.txt into UTF-8 format \| replace ( by { \| replace ) by } \| sort by number of nodes (equivalent to number of "{")
		iconv -f ISO-8859-1 -t "UTF-8" dev.txt train.txt \| sed -e 's/(/{/g' \| sed -e 's/)/}/g' \| awk '{print gsub("{","{"), $0}' \| sort -n \| cut -d' ' -f2- > ../sentiment.bracket

		# go back to the folder
		cd ..
		No newline at end of file