Building a Relational Database from a CSV file using PostgreSQL
Dataset
The data is from the Superstore Sales dataset, which can be found here. This dataset is stored as a csv file and contains example sales transaction data. Metadata is also shared with the file.
Metadata
Row ID: Unique ID for each row.
Order ID: Unique Order ID for each Customer.
Order Date: Order Date of the product.
Ship Date: Shipping Date of the Product.
Ship Mode: Shipping Mode specified by the Customer.
Customer ID: Unique ID to identify each Customer.
Customer Name: Name of the Customer.
Segment: The segment where the Customer belongs.
Country: Country of residence of the Customer.
City: City of residence of the Customer.
State: State of residence of the Customer.
Postal Code: Postal Code of every Customer.
Region: Region where the Customer belongs.
Product ID: Unique ID of the Product.
Category: Category of the product ordered.
Subcategory: Subcategory of the product ordered.
Product Name: Name of the Product
Sales: Sales of the Product.
Quantity: Quantity of the Product.
Discount: Discount provided.
Profit: Profit/Loss incurred.
Creating a database
CREATE DATABASE superstore_db;
Loading the data to the PostgreSQL
Using the internal Query Tool in pgAdmin, we create a table with an SQL query. Then we populate the table by importing the csv file.
CREATE TABLE superstore(
Row_ID INT,
Order_ID VARCHAR(30),
Order_Date DATE,
Ship_Date DATE,
Ship_Mode VARCHAR(20),
Customer_ID VARCHAR(20),
Customer_Name VARCHAR(50),
Segment VARCHAR(20),
Country VARCHAR(100),
City VARCHAR(100),
State VARCHAR(100),
Postal_Code INT,
Region VARCHAR(10),
Product_ID VARCHAR(20),
Category VARCHAR(30),
Subcategory VARCHAR(30),
Product_Name VARCHAR(300),
Sales NUMERIC,
Quantity INT,
Discount NUMERIC,
Profit NUMERIC);
Creating an Entity Relationship Diagram (ERD)
Basically, the information contained in the data can be divided into four entities: Customers, Products, Shipment, and Sales.
While examining the address information, we see that there is more than one shipping address for each customer. Since there aren't any data logs about the updates on the customer data, we create a separate table for shipment details including the address information for each order. Besides we create a separate table with the customer_id, customer_name, and segment columns.
Creating the tables
CREATE TABLE customers (
customer_id VARCHAR(10),
customer_name VARCHAR(50),
segment VARCHAR(20),
CONSTRAINT pk_customer_id PRIMARY KEY(customer_id)
);
CREATE TABLE shipment (
order_id VARCHAR(20),
ship_date DATE,
ship_mode VARCHAR(20),
country VARCHAR(100),
city VARCHAR(100),
state VARCHAR(100),
postal_code INT,
region VARCHAR(10),
customer_id VARCHAR(10),
CONSTRAINT pk_order_id PRIMARY KEY(order_id),
CONSTRAINT fk_customer_id FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
CREATE TABLE products (
product_id VARCHAR(20),
product_name VARCHAR(300),
category VARCHAR(30),
subcategory VARCHAR(30),
CONSTRAINT pk_product_id PRIMARY KEY(Product_ID)
);
CREATE TABLE sales (
order_id VARCHAR(20),
product_id VARCHAR(20),
order_date DATE,
payment NUMERIC,
quantity INT,
discount NUMERIC,
profit NUMERIC
);
Inserting records into the tables
Populating the customers table :
INSERT INTO customers
SELECT
s.customer_id,
customer_name
segment
FROM superstore AS s
INNER JOIN
( SELECT
customer_id, max(row_id) AS row_id
FROM superstore
GROUP BY customer_id) AS sub
ON s.row_id = sub.row_id;
Populating the shipment table :
INSERT INTO shipment
SELECT
order_id,
ship_date,
ship_mode,
country,
city,
state,
postal_code,
region,
customer_id
FROM superstore
GROUP BY
order_id,
ship_date,
ship_mode,
country,
city,
state,
postal_code,
region,
customer_id
ORDER BY
order_id,
country,
city,
state;
Populating the products table :
INSERT INTO products
SELECT
product_id,
product_name,
category,
subcategory
FROM superstore
GROUP BY
product_id,
product_name,
category,
subcategory;
Populating the sales table :
INSERT INTO sales
SELECT
order_id,
product_id,
order_date,
sales AS payment,
quantity,
discount,
profit
FROM superstore;
Adding a surrogate key to the sales table
We'll add a surrogate key to the sales table because the existing attributes are not really suited as the primary key.
ALTER TABLE sales
ADD COLUMN sales_id serial,
ADD CONSTRAINT sales_id_pk PRIMARY KEY(sales_id),
ADD CONSTRAINT order_id_pk FOREIGN KEY(order_id) REFERENCES shipment(order_id),
ADD CONSTRAINT product_id_pk FOREIGN KEY(product_id) REFERENCES products(product_id);