IntermediateSQL

Structured Query Language

User only need to specify what data they want rather than the strategy on how to query the data

Three part of SQL

  • Data Manipulate Language(DML)
  • Data Definition Language(DDL)
  • Data Control Language(DCL)

SQL is based on bags(duplicate) rather then sets(no-duplicate)

Basic Syntax

1
2
3
4
SELECT column1,column2,... FROM Relation WHERE predicate1,...;
# it's OK to use lowercase key word
# However, use capital for key words is convention
select column1,column2,... from Relation where predicate1,...;

Aggregation+Group by

Aggregate: return a single value from a bag of tuples
1
2
3
4
5
AVG(col);
MIN(col);
MAX(col);
SUM(col);
COUNT(col);
Group by : Project tuples into subset and calculate aggregate against each subset
1
GROUP BY

group by

String/Date/Time Operation

In mysql string is case-insensitive while in sql-92 it is case sensitive

Like
1
2
3
4
# match any substring : %
# match exactly one character : _
SELECT * FROM student AS s WHERE s.login LIKE '%@cs';
SELECT * FROM enrolled AS e WHERE e.cid LIKE '15-7_1';
String functions

Each DBMS have its own implementation

String concatenation
1
2
'a' || 'b' # in sql-92
CONCAT('a','b') # in MySQL
Date Operation

differs widely between different DBMS

Output Control + Redirection

Redirect the query result to another new table

INTO
1
SELECT DISTINCT cid INTO newtable FROM enrolled;
INSERT
1
2
INSERT INTO existTable (SELECT DISTINCT cid FROM enrolled);
# query result must generate the same columns as the to be inserted table
Order by
1
2
3
# order the results by a specified order
SELECT sid,grade FROM enrolled WHERE cid='15-721' ORDER BY grade ;
SELECT sid,grade FROM enrolled WHERE cid='15-721' ORDER BY grade [DESC | ASC];
Limit

limit the number returned by querying

1
2
SELECT * FROM student LIMIT 5; # get the first 5 students
SELECT * FROM student LIMIT 5 OFFSET 2; # get student 2~7

Nested Queries

IN/ANY/ALL/EXISTS
1
2
3
SELECT name FROM student WHERE sid IN(SELECT sid FROM enrolled);
SELECT name FROM student WHERE sid ALL(SELECT sid FROM enrolled);
# IN (first query)

Common Table Expression

create temporary table in large query

1
WITH sname AS (SELECT 1) SELECT * FROM sname;

recursion can be implemented via common table expression

Window Function

performs a “sliding” calculation across a set of tuples are related.


SQL from CS50

1
2
3
4
5
6
7
8
9
# CRUD
# Create
CREATE , INSERT
# Retrieve
SELECT
# Update
UPDATE
# Delete
DELETE
1
CREATE TABLE tablename (columnname TYPE, ...);
1
2
3
4
5
DISTINCT()
AVG()
MAX()
UPPER()
...
1
SELECT ... FROM ... WHERE ... GROUP BY ... ORDER BY ... LIMIT ...;
1
SELECT LOWER(TRIM(title)),COUNT(LOWER(TRIM(title))) FROM shows GROUP BY LOWER(TRIM(title)) ORDER BY COUNT(TRIM(title)) DESC LIMIT 10;
1
INSERT INTO tablename (columnname , ...) VALUES(valuee,...);
1
UPDATE tablename SET column1 = value1, column2 = value2, ... WHERE ...;
1
DELETE FROM table_name WHERE condition;

Main Datatype

1
2
3
4
5
BLOB # binary stuff
INTEGER # int
NUMERIC
REAL # float
TEXT

INDEX

you can build a B-tree in memory via INDEX in SQL

USER ? Placeholder to avoid SQLI attack

Comment in SQL

1
-- some comment after `--`

Connect with Python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import sqlite3 as sql
open('new.db','wb').close()
con = sql.connect('new.db')
cur = con.cursor()
cur.execute("CREATE TABLE shows (id INTEGER , title TEXT , PRIMARY KEY(id))")
cur.execute("CREATE TABLE genres (show_id INTEGER , genre TEXT , FOREIGN KEY(show_id) REFERENCES shows(id))")

import csv

with open('favorites.csv','rt') as file:
reader = csv.DictReader(file)
cot = 1
for row in reader:
title = row['title'].strip().lower()
cmd = 'INSERT INTO shows (id , title) VALUES({} , "{}")'.format(cot , title)
cur.execute(cmd)
for genre in row['genres'].split():
cmd = 'INSERT INTO genres (show_id , genre) VALUES( {} , "{}")'.format(cot , genre)
cur.execute(cmd)
cot += 1
con.commit() # Don't forget to commit !
con.close()
print("OK")

Homework 1

link

The homework contains 10 questions in total and is graded out of 100 points. For each question, you will need to construct a SQL query that fetches the desired data from the SQLite DBMS. It will likely take you approximately 6-8 hours to complete the questions.

1
2
3
4
5
6
7
8
9
10
11
12
13
$ mkdir placeholder
$ cd placeholder
$ touch q1_sample.sql\
q2_string_function.sql\
q3_northamerican.sql\
q4_delaypercent.sql\
q5_aggregates.sql\
q6_discontinued.sql\
q7_order_lags.sql\
q8_total_cost_quartiles.sql\
q9_youngblood.sql\
q10_christmas.sql
$ cd ..
1
$ wget https://15445.courses.cs.cmu.edu/fall2021/files/northwind-cmudb2021.db.gz

schema

Q1

List all Category Names ordered alphabetically.

1
SELECT CategoryName FROM Category ORDER BY CategoryName;

Q2

Get all unique ShipNames from the Order table that contain a hyphen '-'.

Details: In addition, get all the characters preceding the (first) hyphen. Return ship names alphabetically. Your first row should look like Bottom-Dollar Markets|Bottom

1
SELECT DISTINCT ShipName,SUBSTR(ShipName , 1, INSTR(ShipName , '-') - 1) FROM 'Order' WHERE ShipName LIKE '%-%' ORDER BY ShipName ASC;
1
2
INSTR() -- Find a substring in a string and returns an integer indicating the position of the first occurrence of the substring.
SUBSTR() -- Extract and returns a substring with a predefined length starting at a specified position in a source string

Q3

Details: You should print the Order Id, ShipCountry, and another column that is either 'NorthAmerica' or 'OtherPlace' depending on the Ship Country.
Order by the primary key (Id) ascending and return 20 rows starting from Order Id 15445 Your output should look like 15445|France|OtherPlace or 15454|Canada|NorthAmerica

1
2
3
4
5
6
7
SELECT Id,ShipCountry,
CASE
WHEN ShipCountry IN ('USA', 'Mexico','Canada')
THEN 'NorthAmerica'
ELSE 'OtherPalce'
END
FROM 'Order' WHERE Id >= 15445 ORDER BY ID ASC LIMIT 20 ;

Q4

For each Shipper, find the percentage of orders which are late.

Details: An order is considered late if ShippedDate > RequiredDate. Print the following format, order by descending precentage, rounded to the nearest hundredths, like United Package|23.44

This one is too complex for me at the time

  • Nested selection
  • Table Joining
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- First get order count
SELECT ShipVia, COUNT(ShipVia) FROM Shipper AS S INNER JOIN 'Order' AS O ON S.Id=O.ShipVia GROUP BY O.ShipVia;

-- Then get later count
SELECT ShipVia,COUNT(ShipVia) FROM Shipper AS S INNER JOIN 'Order' AS O ON S.Id=O.ShipVia WHERE O.ShippedDate > O.RequiredDate GROUP BY O.ShipVia;

-- Finally calculate the percentage

SELECT ROUND(late*100.0/cnt,2) AS percent FROM

(SELECT ShipVia, COUNT(ShipVia) AS cnt FROM Shipper AS S INNER JOIN 'Order' AS O ON S.Id=O.ShipVia GROUP BY O.ShipVia) AS cntT

INNER JOIN

(SELECT ShipVia,COUNT(ShipVia) AS late FROM Shipper AS S INNER JOIN 'Order' AS O ON S.Id=O.ShipVia WHERE O.ShippedDate > O.RequiredDate GROUP BY O.ShipVia) AS lateT

ON cntT.ShipVia=lateT.ShipVia;

-- Add the CompanyName and reorder table

SELECT CompanyName , ROUND(late*100.0/cnt,2) AS percent FROM

(SELECT ShipVia, COUNT(ShipVia) AS cnt FROM Shipper AS S INNER JOIN 'Order' AS O ON S.Id=O.ShipVia GROUP BY O.ShipVia) AS cntT

INNER JOIN

(SELECT ShipVia,COUNT(ShipVia) AS late FROM Shipper AS S INNER JOIN 'Order' AS O ON S.Id=O.ShipVia WHERE O.ShippedDate > O.RequiredDate GROUP BY O.ShipVia) AS lateT

ON cntT.ShipVia=lateT.ShipVia

INNER JOIN 'Shipper' AS S ON S.Id=cntT.ShipVia
ORDER BY percent DESC;

Q5

Details: Get the number of products, average unit price (rounded to 2 decimal places), minimum unit price, maximum unit price, and total units on order for categories containing greater than 10 products.
Order by Category Id. Your output should look like Beverages|12|37.98|4.5|263.5|60

You often use the HAVING clause with the GROUP BY clause. The GROUP BY clause groups a set of rows into a set of summary rows or groups. Then the HAVING clause filters groups based on a specified condition.

1
2
3
4
5
6
7
8
9
10
11
SELECT
column_1,
column_2,
aggregate_function (column_3)
FROM
table
GROUP BY
column_1,
column_2
HAVING
search_condition;
1
2
3
4
5
-- First, get categoryId whose size >= 10
SELECT CategoryId , COUNT(*) AS C FROM Product GROUP BY CategoryId HAVING C>=10;

-- Then, join it with Category Table
SELECT CategoryName , C FROM (SELECT CategoryId , COUNT(*) AS C FROM Product GROUP BY CategoryId HAVING C>=10) AS P INNER JOIN Category ON P.CategoryId=Category.Id;