How to import IMDB database using csvkit and copy

#dev, #docker

Start your server and basic dependencies

If you don’t have a PostgreSQL server running on your machine you can run using Docker:

$ docker run -v imdb:/var/lib/postgresql/data --name imdbpg --rm postgres

We will need python and wget. If you are using the command above before installing the dependencies you have to run:

$ docker exec -it imdbpg bash
# apt-get update

And install the dependencies using apt-get:

# apt-get --yes install wget python-dev python-pip postgresql-server-dev-all

Step 1: Install csvkit

Install csvkit using pip:

# pip install csvkit

Step 2: Download the tsv.gz files

You can use a regex to download only the files which name contains tsv.gz:

# wget -A "*tsv.gz" --mirror "https://datasets.imdbws.com/"
# gunzip datasets.imdbws.com/*gz

Step 3: Import the data

# su - postgres
$ createdb imdb
$ for file in /datasets.imdbws.com/*tsv
do
  table_name=$(basename $file .tsv)
  head -n 20 $file | csvsql --no-constraints --table $table_name | sed "s/\(DECIMAL\|BOOLEAN\)/varchar/g" | psql imdb
  psql imdb -c "copy \"${table_name}\" from '${file}' delimiter E'\t'"
done

Notice that I replaced decimal and boolean for varchar. In my case, I need the database just to take a look at it but I’ve got problems with some columns. This is not the best way, but for me, it is enough.

If you care about constraints and column types you can create your own schema looking at each tsv file or using csvsql just to get the basic schema. Anyway, you will import using PostgreSQL copy function.

And that is it 😉.

Posts in this series

References