TPC-H Test Data for AWS Redshift and EMR: Setup Guide
Generating TPC-H test data for AWS Redshift and EMR requires the open-source dbgen tool from the TPC organization. Download and compile dbgen, then run it with a scale factor flag to produce flat files. Upload the generated files to S3, then use Redshift's COPY command or an EMR Hive or Spark job to load the data into your target environment.
If you need to test out your big data tools below is a useful set of scripts that I have used in the past for aws emr and redshift the below might be helpful:
install git
sudo yum install make git -y
install the tpch-kit
git clone https://github.com/gregrahn/tpch-kit
cd tpch-kit/dbgen
sudo yum install gcc -y
Compile the tpch kit
make OS=LINUX
Go home
cd ~
Now make your emr data
mkdir emrdata
Tell tcph to use the this dir
export DSS_PATH=$HOME/emrdata
cd tpch-kit/dbgen
Now run dbgen in verbose mode, with tables (orders), 10gb data size
./dbgen -v -T o -s 10
move the data to a s3 bucket
cd $HOME/emrdata
aws s3api create-bucket -- bucket andrewbakerbigdata --region af-south-1 --LocationConstraint=af-south-1
aws s3 cp $HOME/emrdata s3://andrewbakerbigdata/emrdata --recursive
cd $HOME
mkdir redshiftdata
Tell tcph to use the this dir
export DSS_PATH=$HOME/redshiftdata
Now make your redshift data
cd tpch-kit/dbgen
Now run dbgen in verbose mode, with tables (orders), 40gb data size
./dbgen -v -T o -s 40
These are big files, so lets find out how big they are and split them
Count lines
cd $HOME/redshiftdata
wc -l orders.tbl
Now split orders into 15m lines per file
split -d -l 15000000 -a 4 orders.tbl orders.tbl.
Now split line items
wc -l lineitem.tbl
split -d -l 60000000 -a 4 lineitem.tbl lineitem.tbl.
Now clean up the master files
rm orders.tbl
rm lineitem.tbl
move the split data to a s3 bucket
aws s3 cp $HOME/redshiftdata s3://andrewbakerbigdata/redshiftdata --recursive