If you need to test out your big data tools below is a useful set of scripts that I have used in the past for aws emr and redshift the below might be helpful:
install git
sudo yum install make git -y
install the tpch-kit
git clone https://github.com/gregrahn/tpch-kit
cd tpch-kit/dbgen
sudo yum install gcc -y
Compile the tpch kit
make OS=LINUX
Go home
cd ~
Now make your emr data
mkdir emrdata
Tell tcph to use the this dir
export DSS_PATH=$HOME/emrdata
cd tpch-kit/dbgen
Now run dbgen in verbose mode, with tables (orders), 10gb data size
./dbgen -v -T o -s 10
move the data to a s3 bucket
cd $HOME/emrdata
aws s3api create-bucket -- bucket andrewbakerbigdata --region af-south-1 --LocationConstraint=af-south-1
aws s3 cp $HOME/emrdata s3://andrewbakerbigdata/emrdata --recursive
cd $HOME
mkdir redshiftdata
Tell tcph to use the this dir
export DSS_PATH=$HOME/redshiftdata
Now make your redshift data
cd tpch-kit/dbgen
Now run dbgen in verbose mode, with tables (orders), 40gb data size
./dbgen -v -T o -s 40
These are big files, so lets find out how big they are and split them
Count lines
cd $HOME/redshiftdata
wc -l orders.tbl
Now split orders into 15m lines per file
split -d -l 15000000 -a 4 orders.tbl orders.tbl.
Now split line items
wc -l lineitem.tbl
split -d -l 60000000 -a 4 lineitem.tbl lineitem.tbl.
Now clean up the master files
rm orders.tbl
rm lineitem.tbl
move the split data to a s3 bucket
aws s3 cp $HOME/redshiftdata s3://andrewbakerbigdata/redshiftdata --recursive