How to import IMDB database using csvkit and copy
Start your server and basic dependencies
If you don't have a PostgreSQL server running on your machine you can run using Docker:
$ docker run -v imdb:/var/lib/postgresql/data --name imdbpg --rm postgres
We will need python
and wget
. If you are using the command above before installing the dependencies you have to run:
$ docker exec -it imdbpg bash
# apt-get update
And install the dependencies using apt-get
:
# apt-get --yes install wget python-dev python-pip postgresql-server-dev-all
Step 1: Install csvkit
Install csvkit using pip
:
# pip install csvkit
Step 2: Download the tsv.gz files
You can use a regex to download only the files which name contains tsv.gz
:
# wget -A "*tsv.gz" --mirror "https://datasets.imdbws.com/"
# gunzip datasets.imdbws.com/*gz
Step 3: Import the data
# su - postgres
$ createdb imdb
$ for file in /datasets.imdbws.com/*tsv
do
table_name=$(basename $file .tsv)
head -n 20 $file | csvsql --no-constraints --table $table_name | sed "s/\(DECIMAL\|BOOLEAN\)/varchar/g" | psql imdb
psql imdb -c "copy \"${table_name}\" from '${file}' delimiter E'\t'"
done
Notice that I replaced decimal
and boolean
for varchar
. In my case, I need the database just to take a look at it but I've got problems with some columns. This is not the best way, but for me, it is enough.
If you care about constraints and column types you can create your own schema looking at each tsv
file or using csvsql
just to get the basic schema. Anyway, you will import using PostgreSQL copy
function.
And that is it 😉.