Databases Pt. 2

Guest lecture by Bradley Dice, [@bdice](https://github.com/bdice/).

Example, continued: IMBD, Index of Marvel data with Bradley Dice

The code below sets up the movie database that we created last time, all in one cell.

In [1]:
import sqlite3
import os

database = 'movie_ratings.sqlite'

# Removes the database if it already exists
if os.path.exists(database):
    os.remove(database)

with sqlite3.connect(database) as conn:
    # Here we create a primary key, and use "NOT NULL" to prevent inserting invalid data
    conn.execute(
        """CREATE TABLE Movies
        (id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, year INTEGER NOT NULL)""")
    conn.executemany(
        """INSERT INTO Movies (title, year) VALUES (?,?)""",
        [('Iron Man', 2008),
         ('The Incredible Hulk', 2008),
         ('Iron Man 2', 2010),
         ('Thor', 2011),
         ('Captain America: The First Avenger', 2011),
         ('The Avengers', 2012),
         ('Iron Man 3', 2013),
         ('Captain America: The Winter Soldier', 2014),
         ('Avengers: Age of Ultron', 2015),
         ('Captain America: Civil War', 2016),
         ('Doctor Strange', 2016),
         ('Black Panther', 2018),
         ('Avengers: Infinity War', 2018),
        ])
    conn.execute(
        """CREATE TABLE Actors
        (id INTEGER PRIMARY KEY AUTOINCREMENT,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        character TEXT NOT NULL,
        age REAL NOT NULL)""")
    conn.executemany(
        """INSERT INTO Actors (first_name, last_name, character, age) VALUES (?,?,?,?)""",
        [('Robert', 'Downey Jr.', 'Iron Man', 53),
         ('Chris', 'Evans', 'Captain America', 37),
         ('Scarlett', 'Johansson', 'Black Widow', 33),
         ('Samuel L.', 'Jackson', 'Nick Fury', 69),
         ('Benedict', 'Cumberbatch', 'Dr. Strange', 42),
         ('Brie', 'Larson', 'Captain Marvel', 29),
         ('Chadwick', 'Boseman', 'Black Panther', 40)
        ])
    conn.execute(
        """CREATE TABLE MovieActors
        (id INTEGER PRIMARY KEY AUTOINCREMENT, movie_id INTEGER NOT NULL, actor_id INTEGER NOT NULL)""")
    conn.executemany(
        """INSERT INTO MovieActors (movie_id, actor_id) VALUES (?,?)""",
        [(1, 1), (2, 1), (3, 1), (6, 1), (7, 1), (9, 1), (10, 1), (13, 1), (5, 2), (6, 2), (8, 2), (9, 2), (10, 2), (13, 2),
        (3, 3), (6, 3), (8, 3), (9, 3), (10, 3), (13, 3), (1, 4), (3, 4), (4, 4), (5, 4), (6, 4), (8, 4), (9, 4), (13, 4),
        (11, 5), (13, 5), (10, 7), (12, 7), (13, 7)])

Using pandas 🐼 to help us:

In [2]:
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline

with sqlite3.connect(database) as conn:
    df = pd.read_sql('SELECT * FROM Movies', conn)
    display(df)
    df.hist('year')
id title year
0 1 Iron Man 2008
1 2 The Incredible Hulk 2008
2 3 Iron Man 2 2010
3 4 Thor 2011
4 5 Captain America: The First Avenger 2011
5 6 The Avengers 2012
6 7 Iron Man 3 2013
7 8 Captain America: The Winter Soldier 2014
8 9 Avengers: Age of Ultron 2015
9 10 Captain America: Civil War 2016
10 11 Doctor Strange 2016
11 12 Black Panther 2018
12 13 Avengers: Infinity War 2018
../_images/notebooks_lecture18_databases_4_1.png

Review of table contents

The next three cells print the table content for Movies, Actors, and MovieActors.

In [3]:
with sqlite3.connect(database) as conn:
    df = pd.read_sql("""SELECT * FROM Actors""", conn)
df
Out[3]:
id first_name last_name character age
0 1 Robert Downey Jr. Iron Man 53.0
1 2 Chris Evans Captain America 37.0
2 3 Scarlett Johansson Black Widow 33.0
3 4 Samuel L. Jackson Nick Fury 69.0
4 5 Benedict Cumberbatch Dr. Strange 42.0
5 6 Brie Larson Captain Marvel 29.0
6 7 Chadwick Boseman Black Panther 40.0
In [4]:
with sqlite3.connect(database) as conn:
    df = pd.read_sql("""SELECT * FROM Movies""", conn)
df
Out[4]:
id title year
0 1 Iron Man 2008
1 2 The Incredible Hulk 2008
2 3 Iron Man 2 2010
3 4 Thor 2011
4 5 Captain America: The First Avenger 2011
5 6 The Avengers 2012
6 7 Iron Man 3 2013
7 8 Captain America: The Winter Soldier 2014
8 9 Avengers: Age of Ultron 2015
9 10 Captain America: Civil War 2016
10 11 Doctor Strange 2016
11 12 Black Panther 2018
12 13 Avengers: Infinity War 2018
In [5]:
with sqlite3.connect(database) as conn:
    df = pd.read_sql("""SELECT * FROM MovieActors""", conn)
df
Out[5]:
id movie_id actor_id
0 1 1 1
1 2 2 1
2 3 3 1
3 4 6 1
4 5 7 1
5 6 9 1
6 7 10 1
7 8 13 1
8 9 5 2
9 10 6 2
10 11 8 2
11 12 9 2
12 13 10 2
13 14 13 2
14 15 3 3
15 16 6 3
16 17 8 3
17 18 9 3
18 19 10 3
19 20 13 3
20 21 1 4
21 22 3 4
22 23 4 4
23 24 5 4
24 25 6 4
25 26 8 4
26 27 9 4
27 28 13 4
28 29 11 5
29 30 13 5
30 31 10 7
31 32 12 7
32 33 13 7

JOINing [STRIKEOUT:forces] tables

This demonstrates a basic JOIN operation.

In [6]:
with sqlite3.connect(database) as conn:
    df = pd.read_sql(
        """SELECT Movies.title, Actors.character FROM MovieActors
        JOIN Movies ON MovieActors.movie_id = Movies.id
        JOIN Actors ON MovieActors.actor_id = Actors.id""", conn)
df
Out[6]:
title character
0 Iron Man Iron Man
1 The Incredible Hulk Iron Man
2 Iron Man 2 Iron Man
3 The Avengers Iron Man
4 Iron Man 3 Iron Man
5 Avengers: Age of Ultron Iron Man
6 Captain America: Civil War Iron Man
7 Avengers: Infinity War Iron Man
8 Captain America: The First Avenger Captain America
9 The Avengers Captain America
10 Captain America: The Winter Soldier Captain America
11 Avengers: Age of Ultron Captain America
12 Captain America: Civil War Captain America
13 Avengers: Infinity War Captain America
14 Iron Man 2 Black Widow
15 The Avengers Black Widow
16 Captain America: The Winter Soldier Black Widow
17 Avengers: Age of Ultron Black Widow
18 Captain America: Civil War Black Widow
19 Avengers: Infinity War Black Widow
20 Iron Man Nick Fury
21 Iron Man 2 Nick Fury
22 Thor Nick Fury
23 Captain America: The First Avenger Nick Fury
24 The Avengers Nick Fury
25 Captain America: The Winter Soldier Nick Fury
26 Avengers: Age of Ultron Nick Fury
27 Avengers: Infinity War Nick Fury
28 Doctor Strange Dr. Strange
29 Avengers: Infinity War Dr. Strange
30 Captain America: Civil War Black Panther
31 Black Panther Black Panther
32 Avengers: Infinity War Black Panther

Types of JOINs

The default when you just write JOIN is also called INNER JOIN, but there are also LEFT JOIN, RIGHT JOIN, and OUTER JOIN. There are different kinds of JOINs, explained nicely in this resource. The types of JOINs can be represented as Venn diagrams, visualized on this website. visual-join.png

LEFT JOIN example

Note that this LEFT JOIN leaves us with a NaN for Brie Larson, whose movie Captain Marvel has not been released yet.

In [7]:
with sqlite3.connect(database) as conn:
    df = pd.read_sql("""SELECT Actors.first_name, Actors.last_name, MovieActors.movie_id
    FROM Actors LEFT JOIN MovieActors ON Actors.id = MovieActors.actor_id""", conn)
df
Out[7]:
first_name last_name movie_id
0 Robert Downey Jr. 1.0
1 Robert Downey Jr. 2.0
2 Robert Downey Jr. 3.0
3 Robert Downey Jr. 6.0
4 Robert Downey Jr. 7.0
5 Robert Downey Jr. 9.0
6 Robert Downey Jr. 10.0
7 Robert Downey Jr. 13.0
8 Chris Evans 5.0
9 Chris Evans 6.0
10 Chris Evans 8.0
11 Chris Evans 9.0
12 Chris Evans 10.0
13 Chris Evans 13.0
14 Scarlett Johansson 3.0
15 Scarlett Johansson 6.0
16 Scarlett Johansson 8.0
17 Scarlett Johansson 9.0
18 Scarlett Johansson 10.0
19 Scarlett Johansson 13.0
20 Samuel L. Jackson 1.0
21 Samuel L. Jackson 3.0
22 Samuel L. Jackson 4.0
23 Samuel L. Jackson 5.0
24 Samuel L. Jackson 6.0
25 Samuel L. Jackson 8.0
26 Samuel L. Jackson 9.0
27 Samuel L. Jackson 13.0
28 Benedict Cumberbatch 11.0
29 Benedict Cumberbatch 13.0
30 Brie Larson NaN
31 Chadwick Boseman 10.0
32 Chadwick Boseman 12.0
33 Chadwick Boseman 13.0
In [8]:
with sqlite3.connect(database) as conn:
    df = pd.read_sql("""SELECT Actors.first_name, Actors.last_name,
    MAX(Movies.year) AS most_recent_release_year, Movies.title
    FROM Actors JOIN MovieActors ON Actors.id = MovieActors.actor_id
    JOIN Movies ON MovieActors.movie_id = Movies.id
    GROUP BY Actors.last_name""", conn)
df
Out[8]:
first_name last_name most_recent_release_year title
0 Chadwick Boseman 2018 Black Panther
1 Benedict Cumberbatch 2018 Avengers: Infinity War
2 Robert Downey Jr. 2018 Avengers: Infinity War
3 Chris Evans 2018 Avengers: Infinity War
4 Samuel L. Jackson 2018 Avengers: Infinity War
5 Scarlett Johansson 2018 Avengers: Infinity War

More Derived Columns

Here, we make a derived column name by concatenating first_name with a space and last_name. The syntax for other concatenation in other SQL databases can differ, e.g. with the function CONCAT.

In [9]:
with sqlite3.connect(database) as conn:
    df = pd.read_sql("""SELECT *,
    Actors.first_name || " " || Actors.last_name AS name FROM MovieActors
    JOIN Movies ON MovieActors.movie_id = Movies.id
    JOIN Actors ON MovieActors.actor_id = Actors.id""", conn)
df
Out[9]:
id movie_id actor_id id title year id first_name last_name character age name
0 1 1 1 1 Iron Man 2008 1 Robert Downey Jr. Iron Man 53.0 Robert Downey Jr.
1 2 2 1 2 The Incredible Hulk 2008 1 Robert Downey Jr. Iron Man 53.0 Robert Downey Jr.
2 3 3 1 3 Iron Man 2 2010 1 Robert Downey Jr. Iron Man 53.0 Robert Downey Jr.
3 4 6 1 6 The Avengers 2012 1 Robert Downey Jr. Iron Man 53.0 Robert Downey Jr.
4 5 7 1 7 Iron Man 3 2013 1 Robert Downey Jr. Iron Man 53.0 Robert Downey Jr.
5 6 9 1 9 Avengers: Age of Ultron 2015 1 Robert Downey Jr. Iron Man 53.0 Robert Downey Jr.
6 7 10 1 10 Captain America: Civil War 2016 1 Robert Downey Jr. Iron Man 53.0 Robert Downey Jr.
7 8 13 1 13 Avengers: Infinity War 2018 1 Robert Downey Jr. Iron Man 53.0 Robert Downey Jr.
8 9 5 2 5 Captain America: The First Avenger 2011 2 Chris Evans Captain America 37.0 Chris Evans
9 10 6 2 6 The Avengers 2012 2 Chris Evans Captain America 37.0 Chris Evans
10 11 8 2 8 Captain America: The Winter Soldier 2014 2 Chris Evans Captain America 37.0 Chris Evans
11 12 9 2 9 Avengers: Age of Ultron 2015 2 Chris Evans Captain America 37.0 Chris Evans
12 13 10 2 10 Captain America: Civil War 2016 2 Chris Evans Captain America 37.0 Chris Evans
13 14 13 2 13 Avengers: Infinity War 2018 2 Chris Evans Captain America 37.0 Chris Evans
14 15 3 3 3 Iron Man 2 2010 3 Scarlett Johansson Black Widow 33.0 Scarlett Johansson
15 16 6 3 6 The Avengers 2012 3 Scarlett Johansson Black Widow 33.0 Scarlett Johansson
16 17 8 3 8 Captain America: The Winter Soldier 2014 3 Scarlett Johansson Black Widow 33.0 Scarlett Johansson
17 18 9 3 9 Avengers: Age of Ultron 2015 3 Scarlett Johansson Black Widow 33.0 Scarlett Johansson
18 19 10 3 10 Captain America: Civil War 2016 3 Scarlett Johansson Black Widow 33.0 Scarlett Johansson
19 20 13 3 13 Avengers: Infinity War 2018 3 Scarlett Johansson Black Widow 33.0 Scarlett Johansson
20 21 1 4 1 Iron Man 2008 4 Samuel L. Jackson Nick Fury 69.0 Samuel L. Jackson
21 22 3 4 3 Iron Man 2 2010 4 Samuel L. Jackson Nick Fury 69.0 Samuel L. Jackson
22 23 4 4 4 Thor 2011 4 Samuel L. Jackson Nick Fury 69.0 Samuel L. Jackson
23 24 5 4 5 Captain America: The First Avenger 2011 4 Samuel L. Jackson Nick Fury 69.0 Samuel L. Jackson
24 25 6 4 6 The Avengers 2012 4 Samuel L. Jackson Nick Fury 69.0 Samuel L. Jackson
25 26 8 4 8 Captain America: The Winter Soldier 2014 4 Samuel L. Jackson Nick Fury 69.0 Samuel L. Jackson
26 27 9 4 9 Avengers: Age of Ultron 2015 4 Samuel L. Jackson Nick Fury 69.0 Samuel L. Jackson
27 28 13 4 13 Avengers: Infinity War 2018 4 Samuel L. Jackson Nick Fury 69.0 Samuel L. Jackson
28 29 11 5 11 Doctor Strange 2016 5 Benedict Cumberbatch Dr. Strange 42.0 Benedict Cumberbatch
29 30 13 5 13 Avengers: Infinity War 2018 5 Benedict Cumberbatch Dr. Strange 42.0 Benedict Cumberbatch
30 31 10 7 10 Captain America: Civil War 2016 7 Chadwick Boseman Black Panther 40.0 Chadwick Boseman
31 32 12 7 12 Black Panther 2018 7 Chadwick Boseman Black Panther 40.0 Chadwick Boseman
32 33 13 7 13 Avengers: Infinity War 2018 7 Chadwick Boseman Black Panther 40.0 Chadwick Boseman

Visualizing Data with Pandas

So how long have these actors been doing Marvel movies, anyway?

In [10]:
with sqlite3.connect(database) as conn:
    df = pd.read_sql("""SELECT MAX(Movies.year) - MIN(Movies.year) AS years_of_service,
    Actors.first_name || " " || Actors.last_name AS name FROM MovieActors
        JOIN Movies ON MovieActors.movie_id = Movies.id
        JOIN Actors ON MovieActors.actor_id = Actors.id
        GROUP BY name
        ORDER BY years_of_service DESC""", conn)
    df.plot.bar(x='name', y='years_of_service')
../_images/notebooks_lecture18_databases_18_0.png

And how old were they when they started?

In [11]:
with sqlite3.connect(database) as conn:
    df = pd.read_sql("""SELECT Actors.age - (MAX(Movies.year) - MIN(Movies.year)) as first_movie_age,
    Actors.first_name || " " || Actors.last_name as name FROM MovieActors
        JOIN Movies ON MovieActors.movie_id = Movies.id
        JOIN Actors ON MovieActors.actor_id = Actors.id
        GROUP BY name
        ORDER BY first_movie_age DESC""", conn)
    df.plot.bar(x='name', y='first_movie_age')
../_images/notebooks_lecture18_databases_20_0.png

We can visualize this data in many ways, including this graph. (Note that Brie Larson appears, who was absent from our earlier plots because her movie, Captain Marvel, hasn’t been released yet.)

In [12]:
# Fancy demo, requires networkx
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

# Create a networkx Graph object
G = nx.Graph()

# Fetch data and fill the graph
with sqlite3.connect(database) as conn:
    movies = conn.execute("""SELECT title FROM Movies""")
    G.add_nodes_from([movie[0] for movie in movies], bipartite=0)
    actors = conn.execute("""SELECT first_name || " " || last_name FROM Actors""")
    G.add_nodes_from([actor[0] for actor in actors], bipartite=1)
    edges = conn.execute("""SELECT Movies.title, Actors.first_name || " " || Actors.last_name FROM MovieActors
        JOIN Movies ON MovieActors.movie_id = Movies.id
        JOIN Actors ON MovieActors.actor_id = Actors.id""")
    G.add_edges_from(edges)

# Plot the graph
plt.figure(figsize=(16, 12))
pos = nx.kamada_kawai_layout(G)
nx.draw_networkx_nodes(G, pos, nodelist=[n for n, d in G.nodes(data=True) if d['bipartite']==0], node_color='gold')
nx.draw_networkx_nodes(G, pos, nodelist=[n for n, d in G.nodes(data=True) if d['bipartite']==1], node_color='blue')
nx.draw_networkx_labels(G, pos, font_size=14)
nx.draw_networkx_edges(G, pos)
plt.axis('off')
plt.show()
../_images/notebooks_lecture18_databases_22_0.png

Observation: Avengers movies are near the center! What else could we learn from this data?

What if we wanted to add ratings? What kind of mapping is needed? What table columns would we use?

Common problems in designing relational databases

Update anomaly

758px-Update_anomaly.svg.png ### Insertion anomaly Insertion_anomaly.png ### Deletion anomaly 776px-Deletion_anomaly.svg.png

Normal Form

There are rules to help us design relational databases! They’re called “normal form.” This is a term worth looking up if you need to design a database.

Chinook Sample Database

This is a sample database named after winds in Canada, which is a joke from another sample database called Northwind. I picked this because it’s rich and demonstrates a professional, production-level database schema. image.png

In [13]:
!echo "Removing previously downloaded database." && rm -f Chinook.sqlite
!echo "Downloading database." && curl https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite > Chinook.sqlite
!echo "Done. Check that the file exists:"
!ls "Chinook.sqlite"
Removing previously downloaded database.
Downloading database.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1042k  100 1042k    0     0   176k      0  0:00:05  0:00:05 --:--:--  313k
Done. Check that the file exists:
Chinook.sqlite
In [14]:
database = 'Chinook.sqlite'

def get_results(sql, *args, **kwargs):
    with sqlite3.connect(database) as conn:
        df = pd.read_sql(sql, conn, *args, **kwargs)
    return df

def show_results(sql, *args, **kwargs):
    display(get_results(sql, *args, **kwargs))

show_results("""SELECT * FROM sqlite_master""")
type name tbl_name rootpage sql
0 table Album Album 2 CREATE TABLE [Album]\n(\n [AlbumId] INTEGER...
1 table Artist Artist 3 CREATE TABLE [Artist]\n(\n [ArtistId] INTEG...
2 table Customer Customer 4 CREATE TABLE [Customer]\n(\n [CustomerId] I...
3 table Employee Employee 7 CREATE TABLE [Employee]\n(\n [EmployeeId] I...
4 table Genre Genre 9 CREATE TABLE [Genre]\n(\n [GenreId] INTEGER...
5 table Invoice Invoice 10 CREATE TABLE [Invoice]\n(\n [InvoiceId] INT...
6 table InvoiceLine InvoiceLine 12 CREATE TABLE [InvoiceLine]\n(\n [InvoiceLin...
7 table MediaType MediaType 14 CREATE TABLE [MediaType]\n(\n [MediaTypeId]...
8 table Playlist Playlist 15 CREATE TABLE [Playlist]\n(\n [PlaylistId] I...
9 table PlaylistTrack PlaylistTrack 16 CREATE TABLE [PlaylistTrack]\n(\n [Playlist...
10 index sqlite_autoindex_PlaylistTrack_1 PlaylistTrack 17 None
11 table Track Track 19 CREATE TABLE [Track]\n(\n [TrackId] INTEGER...
12 index IPK_Album Album 21 CREATE UNIQUE INDEX [IPK_Album] ON [Album]([Al...
13 index IPK_Artist Artist 22 CREATE UNIQUE INDEX [IPK_Artist] ON [Artist]([...
14 index IPK_Customer Customer 23 CREATE UNIQUE INDEX [IPK_Customer] ON [Custome...
15 index IPK_Employee Employee 24 CREATE UNIQUE INDEX [IPK_Employee] ON [Employe...
16 index IPK_Genre Genre 26 CREATE UNIQUE INDEX [IPK_Genre] ON [Genre]([Ge...
17 index IPK_Invoice Invoice 27 CREATE UNIQUE INDEX [IPK_Invoice] ON [Invoice]...
18 index IPK_InvoiceLine InvoiceLine 28 CREATE UNIQUE INDEX [IPK_InvoiceLine] ON [Invo...
19 index IPK_MediaType MediaType 29 CREATE UNIQUE INDEX [IPK_MediaType] ON [MediaT...
20 index IPK_Playlist Playlist 30 CREATE UNIQUE INDEX [IPK_Playlist] ON [Playlis...
21 index IPK_PlaylistTrack PlaylistTrack 31 CREATE UNIQUE INDEX [IPK_PlaylistTrack] ON [Pl...
22 index IPK_Track Track 32 CREATE UNIQUE INDEX [IPK_Track] ON [Track]([Tr...
23 index IFK_AlbumArtistId Album 33 CREATE INDEX [IFK_AlbumArtistId] ON [Album] ([...
24 index IFK_CustomerSupportRepId Customer 34 CREATE INDEX [IFK_CustomerSupportRepId] ON [Cu...
25 index IFK_EmployeeReportsTo Employee 36 CREATE INDEX [IFK_EmployeeReportsTo] ON [Emplo...
26 index IFK_InvoiceCustomerId Invoice 37 CREATE INDEX [IFK_InvoiceCustomerId] ON [Invoi...
27 index IFK_InvoiceLineInvoiceId InvoiceLine 38 CREATE INDEX [IFK_InvoiceLineInvoiceId] ON [In...
28 index IFK_InvoiceLineTrackId InvoiceLine 39 CREATE INDEX [IFK_InvoiceLineTrackId] ON [Invo...
29 index IFK_PlaylistTrackTrackId PlaylistTrack 40 CREATE INDEX [IFK_PlaylistTrackTrackId] ON [Pl...
30 index IFK_TrackAlbumId Track 41 CREATE INDEX [IFK_TrackAlbumId] ON [Track] ([A...
31 index IFK_TrackGenreId Track 42 CREATE INDEX [IFK_TrackGenreId] ON [Track] ([G...
32 index IFK_TrackMediaTypeId Track 43 CREATE INDEX [IFK_TrackMediaTypeId] ON [Track]...

Table structures for music data

In [15]:
show_results("""PRAGMA table_info(Track)""")  # sqlite specific
show_results("""PRAGMA table_info(Genre)""")
show_results("""PRAGMA table_info(Album)""")
show_results("""PRAGMA table_info(Artist)""")
cid name type notnull dflt_value pk
0 0 TrackId INTEGER 1 None 1
1 1 Name NVARCHAR(200) 1 None 0
2 2 AlbumId INTEGER 0 None 0
3 3 MediaTypeId INTEGER 1 None 0
4 4 GenreId INTEGER 0 None 0
5 5 Composer NVARCHAR(220) 0 None 0
6 6 Milliseconds INTEGER 1 None 0
7 7 Bytes INTEGER 0 None 0
8 8 UnitPrice NUMERIC(10,2) 1 None 0
cid name type notnull dflt_value pk
0 0 GenreId INTEGER 1 None 1
1 1 Name NVARCHAR(120) 0 None 0
cid name type notnull dflt_value pk
0 0 AlbumId INTEGER 1 None 1
1 1 Title NVARCHAR(160) 1 None 0
2 2 ArtistId INTEGER 1 None 0
cid name type notnull dflt_value pk
0 0 ArtistId INTEGER 1 None 1
1 1 Name NVARCHAR(120) 0 None 0

Generating a typical music library layout

In [16]:
show_results("""SELECT Track.Name, Album.Title, Artist.Name, Genre.Name FROM Track
JOIN Genre on Track.GenreId = Genre.GenreId
JOIN Album on Track.AlbumId = Album.AlbumId
JOIN Artist on Album.ArtistId = Artist.ArtistId""")
Name Title Name Name
0 For Those About To Rock (We Salute You) For Those About To Rock We Salute You AC/DC Rock
1 Balls to the Wall Balls to the Wall Accept Rock
2 Fast As a Shark Restless and Wild Accept Rock
3 Restless and Wild Restless and Wild Accept Rock
4 Princess of the Dawn Restless and Wild Accept Rock
5 Put The Finger On You For Those About To Rock We Salute You AC/DC Rock
6 Let's Get It Up For Those About To Rock We Salute You AC/DC Rock
7 Inject The Venom For Those About To Rock We Salute You AC/DC Rock
8 Snowballed For Those About To Rock We Salute You AC/DC Rock
9 Evil Walks For Those About To Rock We Salute You AC/DC Rock
10 C.O.D. For Those About To Rock We Salute You AC/DC Rock
11 Breaking The Rules For Those About To Rock We Salute You AC/DC Rock
12 Night Of The Long Knives For Those About To Rock We Salute You AC/DC Rock
13 Spellbound For Those About To Rock We Salute You AC/DC Rock
14 Go Down Let There Be Rock AC/DC Rock
15 Dog Eat Dog Let There Be Rock AC/DC Rock
16 Let There Be Rock Let There Be Rock AC/DC Rock
17 Bad Boy Boogie Let There Be Rock AC/DC Rock
18 Problem Child Let There Be Rock AC/DC Rock
19 Overdose Let There Be Rock AC/DC Rock
20 Hell Ain't A Bad Place To Be Let There Be Rock AC/DC Rock
21 Whole Lotta Rosie Let There Be Rock AC/DC Rock
22 Walk On Water Big Ones Aerosmith Rock
23 Love In An Elevator Big Ones Aerosmith Rock
24 Rag Doll Big Ones Aerosmith Rock
25 What It Takes Big Ones Aerosmith Rock
26 Dude (Looks Like A Lady) Big Ones Aerosmith Rock
27 Janie's Got A Gun Big Ones Aerosmith Rock
28 Cryin' Big Ones Aerosmith Rock
29 Amazing Big Ones Aerosmith Rock
... ... ... ... ...
3473 October Song Frank Amy Winehouse Pop
3474 What Is It About Men Frank Amy Winehouse Pop
3475 Help Yourself Frank Amy Winehouse Pop
3476 Amy Amy Amy (Outro) Frank Amy Winehouse Pop
3477 Slowness Carried to Dust (Bonus Track Version) Calexico Alternative
3478 Prometheus Overture, Op. 43 Beethoven: Symphony No. 6 'Pastoral' Etc. Otto Klemperer & Philharmonia Orchestra Classical
3479 Sonata for Solo Violin: IV: Presto Bartok: Violin & Viola Concertos Yehudi Menuhin Classical
3480 A Midsummer Night's Dream, Op.61 Incidental Mu... Mendelssohn: A Midsummer Night's Dream Philharmonia Orchestra & Sir Neville Marriner Classical
3481 Suite No. 3 in D, BWV 1068: III. Gavotte I & II Bach: Orchestral Suites Nos. 1 - 4 Academy of St. Martin in the Fields, Sir Nevil... Classical
3482 Concert pour 4 Parties de V**les, H. 545: I. P... Charpentier: Divertissements, Airs & Concerts Les Arts Florissants & William Christie Classical
3483 Adios nonino South American Getaway The 12 Cellists of The Berlin Philharmonic Classical
3484 Symphony No. 3 Op. 36 for Orchestra and Sopran... Górecki: Symphony No. 3 Adrian Leaper & Doreen de Feis Classical
3485 Act IV, Symphony Purcell: The Fairy Queen Roger Norrington, London Classical Players Classical
3486 3 Gymnopédies: No.1 - Lent Et Grave, No.3 - Le... The Ultimate Relexation Album Charles Dutoit & L'Orchestre Symphonique de Mo... Classical
3487 Music for the Funeral of Queen Mary: VI. "Thou... Purcell: Music for the Queen Mary Equale Brass Ensemble, John Eliot Gardiner & M... Classical
3488 Symphony No. 2: III. Allegro vivace Weill: The Seven Deadly Sins Kent Nagano and Orchestre de l'Opéra de Lyon Classical
3489 Partita in E Major, BWV 1006A: I. Prelude J.S. Bach: Chaconne, Suite in E Minor, Partita... Julian Bream Classical
3490 Le Sacre Du Printemps: I.iv. Spring Rounds Prokofiev: Symphony No.5 & Stravinksy: Le Sacr... Berliner Philharmoniker & Herbert Von Karajan Classical
3491 Sing Joyfully English Renaissance The King's Singers Classical
3492 Metopes, Op. 29: Calypso Szymanowski: Piano Works, Vol. 1 Martin Roscoe Classical
3493 Symphony No. 2, Op. 16 - "The Four Temperamen... Nielsen: The Six Symphonies Göteborgs Symfoniker & Neeme Järvi Classical
3494 24 Caprices, Op. 1, No. 24, for Solo Violin, i... Great Recordings of the Century: Paganini's 24... Itzhak Perlman Classical
3495 Étude 1, In C Major - Preludio (Presto) - Liszt Liszt - 12 Études D'Execution Transcendante Michele Campanella Classical
3496 Erlkonig, D.328 Great Recordings of the Century - Shubert: Sch... Gerald Moore Classical
3497 Concerto for Violin, Strings and Continuo in G... Locatelli: Concertos for Violin, Strings and C... Mela Tenenbaum, Pro Musica Prague & Richard Kapp Classical
3498 Pini Di Roma (Pinien Von Rom) \ I Pini Della V... Respighi:Pines of Rome Eugene Ormandy Classical
3499 String Quartet No. 12 in C Minor, D. 703 "Quar... Schubert: The Late String Quartets & String Qu... Emerson String Quartet Classical
3500 L'orfeo, Act 3, Sinfonia (Orchestra) Monteverdi: L'Orfeo C. Monteverdi, Nigel Rogers - Chiaroscuro; Lon... Classical
3501 Quintet for Horn, Violin, 2 Violas, and Cello ... Mozart: Chamber Music Nash Ensemble Classical
3502 Koyaanisqatsi Koyaanisqatsi (Soundtrack from the Motion Pict... Philip Glass Ensemble Soundtrack

3503 rows × 4 columns

Artists with the most tracks

In [17]:
get_results("""SELECT COUNT(Track.Name) as NumTracks, Artist.Name FROM Track
JOIN Album on Track.AlbumId = Album.AlbumId
JOIN Artist on Album.ArtistId = Artist.ArtistId
GROUP BY Artist.Name ORDER BY NumTracks DESC""").head(10)
Out[17]:
NumTracks Name
0 213 Iron Maiden
1 135 U2
2 114 Led Zeppelin
3 112 Metallica
4 92 Deep Purple
5 92 Lost
6 67 Pearl Jam
7 57 Lenny Kravitz
8 56 Various Artists
9 53 The Office

More table structures for sales data

In [18]:
show_results("""PRAGMA table_info(Customer)""")
show_results("""PRAGMA table_info(Invoice)""")
show_results("""PRAGMA table_info(InvoiceLine)""")
cid name type notnull dflt_value pk
0 0 CustomerId INTEGER 1 None 1
1 1 FirstName NVARCHAR(40) 1 None 0
2 2 LastName NVARCHAR(20) 1 None 0
3 3 Company NVARCHAR(80) 0 None 0
4 4 Address NVARCHAR(70) 0 None 0
5 5 City NVARCHAR(40) 0 None 0
6 6 State NVARCHAR(40) 0 None 0
7 7 Country NVARCHAR(40) 0 None 0
8 8 PostalCode NVARCHAR(10) 0 None 0
9 9 Phone NVARCHAR(24) 0 None 0
10 10 Fax NVARCHAR(24) 0 None 0
11 11 Email NVARCHAR(60) 1 None 0
12 12 SupportRepId INTEGER 0 None 0
cid name type notnull dflt_value pk
0 0 InvoiceId INTEGER 1 None 1
1 1 CustomerId INTEGER 1 None 0
2 2 InvoiceDate DATETIME 1 None 0
3 3 BillingAddress NVARCHAR(70) 0 None 0
4 4 BillingCity NVARCHAR(40) 0 None 0
5 5 BillingState NVARCHAR(40) 0 None 0
6 6 BillingCountry NVARCHAR(40) 0 None 0
7 7 BillingPostalCode NVARCHAR(10) 0 None 0
8 8 Total NUMERIC(10,2) 1 None 0
cid name type notnull dflt_value pk
0 0 InvoiceLineId INTEGER 1 None 1
1 1 InvoiceId INTEGER 1 None 0
2 2 TrackId INTEGER 1 None 0
3 3 UnitPrice NUMERIC(10,2) 1 None 0
4 4 Quantity INTEGER 1 None 0

Total Spending by State/Country

In [19]:
get_results("""SELECT SUM(InvoiceLine.UnitPrice * InvoiceLine.Quantity) AS Spending, Customer.State, Customer.Country
FROM InvoiceLine
JOIN Invoice ON InvoiceLine.InvoiceId = Invoice.InvoiceId
JOIN Customer ON Invoice.CustomerId = Customer.CustomerId
GROUP BY Customer.State, Customer.Country ORDER BY Spending DESC""")
Out[19]:
Spending State Country
0 195.10 None France
1 156.48 None Germany
2 115.86 CA USA
3 114.86 SP Brazil
4 112.86 None United Kingdom
5 90.24 None Czech Republic
6 77.24 None Portugal
7 75.26 None India
8 75.24 ON Canada
9 47.62 TX USA
10 46.62 None Chile
11 45.62 None Hungary
12 45.62 Dublin Ireland
13 43.62 IL USA
14 43.62 UT USA
15 42.62 WI USA
16 42.62 None Austria
17 41.62 None Finland
18 40.62 VV Netherlands
19 39.62 None Norway
20 39.62 FL USA
21 39.62 QC Canada
22 39.62 WA USA
23 38.62 None Sweden
24 38.62 BC Canada
25 37.62 None Argentina
26 37.62 None Belgium
27 37.62 None Denmark
28 37.62 None Poland
29 37.62 None Spain
30 37.62 AB Canada
31 37.62 AZ USA
32 37.62 DF Brazil
33 37.62 MA USA
34 37.62 MB Canada
35 37.62 NS Canada
36 37.62 NSW Australia
37 37.62 NT Canada
38 37.62 NV USA
39 37.62 NY USA
40 37.62 RJ Brazil
41 37.62 RM Italy

Subqueries

This is a subquery, where we nest multiple SELECT statements. We must name the parenthesized subquery, and here we call it Subquery. Then in the parent SELECT statement, we can access all of the subquery columns via Subquery.*.

In [20]:
show_results("""SELECT Subquery.* FROM
(SELECT Artist.Name, COUNT(Artist.Name) as ArtistCount, Customer.State, Customer.Country
FROM InvoiceLine
JOIN Invoice ON InvoiceLine.InvoiceId = Invoice.InvoiceId
JOIN Customer ON Invoice.CustomerId = Customer.CustomerId
JOIN Track ON InvoiceLine.TrackId = Track.TrackId
JOIN Album ON Track.TrackId = Album.AlbumId
JOIN Artist ON Artist.ArtistId = Album.ArtistId
GROUP BY Artist.Name, Customer.State, Customer.Country) AS Subquery
""")
Name ArtistCount State Country
0 AC/DC 1 None Germany
1 AC/DC 1 RM Italy
2 Academy of St. Martin in the Fields & Sir Nevi... 1 None United Kingdom
3 Academy of St. Martin in the Fields & Sir Nevi... 1 SP Brazil
4 Academy of St. Martin in the Fields Chamber En... 1 MB Canada
5 Accept 1 None Germany
6 Accept 1 DF Brazil
7 Accept 1 NT Canada
8 Adrian Leaper & Doreen de Feis 1 NV USA
9 Aerosmith 1 RM Italy
10 Aisha Duo 1 SP Brazil
11 Alanis Morissette 1 None Norway
12 Amy Winehouse 1 None United Kingdom
13 Amy Winehouse 1 NV USA
14 Anne-Sophie Mutter, Herbert Von Karajan & Wien... 1 MB Canada
15 Antal Doráti & London Symphony Orchestra 1 None France
16 Antônio Carlos Jobim 1 None Norway
17 Antônio Carlos Jobim 1 NT Canada
18 Apocalyptica 1 DF Brazil
19 Apocalyptica 1 RM Italy
20 Audioslave 1 None Norway
21 Audioslave 1 SP Brazil
22 BackBeat 1 None Norway
23 Battlestar Galactica (Classic) 1 None Sweden
24 Berliner Philharmoniker & Hans Rosbaud 1 NV USA
25 Berliner Philharmoniker & Herbert Von Karajan 1 None United Kingdom
26 Berliner Philharmoniker & Herbert Von Karajan 1 SP Brazil
27 Billy Cobham 1 RM Italy
28 Black Label Society 1 DF Brazil
29 Black Label Society 1 NT Canada
... ... ... ... ...
171 The Clash 1 CA USA
172 The Cult 1 CA USA
173 The Doors 1 None Czech Republic
174 The Office 1 MB Canada
175 The Police 1 FL USA
176 The Rolling Stones 1 None Czech Republic
177 The Rolling Stones 1 None Sweden
178 The Rolling Stones 1 FL USA
179 The Rolling Stones 1 MA USA
180 The Tea Party 1 FL USA
181 The Who 1 FL USA
182 Tim Maia 1 None Czech Republic
183 Tim Maia 1 None Sweden
184 Titãs 1 TX USA
185 U2 4 None France
186 U2 1 None Germany
187 U2 1 None Sweden
188 U2 2 RJ Brazil
189 U2 2 TX USA
190 UB40 1 None Sweden
191 UB40 1 TX USA
192 Van Halen 2 None France
193 Van Halen 1 TX USA
194 Various Artists 1 None Belgium
195 Various Artists 1 None France
196 Various Artists 1 NT Canada
197 Velvet Revolver 1 RJ Brazil
198 Vinícius De Moraes 1 None Sweden
199 Yehudi Menuhin 1 SP Brazil
200 Zeca Pagodinho 1 Dublin Ireland

201 rows × 4 columns

Total Spending by Artist

This uses a subquery to get the sales by track, and then joins that to the Artists table so we can see the total sales by artist.

In [21]:
show_results("""SELECT Artist.Name as ArtistName, SUM(TrackSales.Sales) AS ArtistSales FROM (
SELECT Track.TrackId, Track.AlbumId, Track.Name, SUM(InvoiceLine.UnitPrice * InvoiceLine.Quantity) AS Sales
FROM Track JOIN InvoiceLine ON Track.TrackId = InvoiceLine.TrackId
GROUP BY Track.TrackId
) AS TrackSales
JOIN Album on TrackSales.AlbumId = Album.AlbumId
JOIN Artist on Album.ArtistId = Artist.ArtistId
GROUP BY Artist.Name ORDER BY ArtistSales DESC
""")
ArtistName ArtistSales
0 Iron Maiden 138.60
1 U2 105.93
2 Metallica 90.09
3 Led Zeppelin 86.13
4 Lost 81.59
5 The Office 49.75
6 Os Paralamas Do Sucesso 44.55
7 Deep Purple 43.56
8 Faith No More 41.58
9 Eric Clapton 39.60
10 R.E.M. 38.61
11 Creedence Clearwater Revival 36.63
12 Queen 36.63
13 Battlestar Galactica (Classic) 35.82
14 Guns N' Roses 35.64
15 Titãs 33.66
16 Green Day 32.67
17 Pearl Jam 31.68
18 Kiss 30.69
19 Van Halen 28.71
20 Various Artists 28.71
21 Red Hot Chili Peppers 26.73
22 Chico Buarque 26.73
23 Heroes 25.87
24 Lenny Kravitz 25.74
25 Chico Science & Nação Zumbi 24.75
26 Battlestar Galactica 23.88
27 Ozzy Osbourne 23.76
28 Smashing Pumpkins 23.76
29 Tim Maia 23.76
... ... ...
135 Academy of St. Martin in the Fields, John Birc... 0.99
136 Adrian Leaper & Doreen de Feis 0.99
137 Antal Doráti & London Symphony Orchestra 0.99
138 Berliner Philharmoniker & Hans Rosbaud 0.99
139 Calexico 0.99
140 Dread Zeppelin 0.99
141 Edo de Waart & San Francisco Symphony 0.99
142 Emanuel Ax, Eugene Ormandy & Philadelphia Orch... 0.99
143 English Concert & Trevor Pinnock 0.99
144 Felix Schmidt, London Symphony Orchestra & Raf... 0.99
145 Göteborgs Symfoniker & Neeme Järvi 0.99
146 Habib Koité and Bamada 0.99
147 Julian Bream 0.99
148 Kent Nagano and Orchestre de l'Opéra de Lyon 0.99
149 Luciana Souza/Romero Lubambo 0.99
150 Martin Roscoe 0.99
151 Maurizio Pollini 0.99
152 Michele Campanella 0.99
153 Orchestra of The Age of Enlightenment 0.99
154 Otto Klemperer & Philharmonia Orchestra 0.99
155 Philharmonia Orchestra & Sir Neville Marriner 0.99
156 Roger Norrington, London Classical Players 0.99
157 Royal Philharmonic Orchestra & Sir Thomas Beecham 0.99
158 Scholars Baroque Ensemble 0.99
159 Sir Georg Solti & Wiener Philharmoniker 0.99
160 The 12 Cellists of The Berlin Philharmonic 0.99
161 The King's Singers 0.99
162 The Posies 0.99
163 Yehudi Menuhin 0.99
164 Yo-Yo Ma 0.99

165 rows × 2 columns

Searching Databases

Using the keyword LIKE, we can search for strings that are partial matches. Using the percent sign % is kind of like using an asterisk * in the terminal, it just fills in one or more other characters.

In [22]:
show_results("""SELECT * FROM Artist WHERE Name LIKE (?)""", params=('%foo%',))
show_results("""SELECT * FROM Artist WHERE Name LIKE (?)""", params=('bar%',))
ArtistId Name
0 84 Foo Fighters
ArtistId Name
0 48 Barão Vermelho
1 224 Barry Wordsworth & BBC Concert Orchestra

Scientific databases

Acquiring quality data with the information you might need for particular research questions is still very hard. Many databases found online are of low quality, or small breadth / depth. Here are some databases of various kinds of materials and bioactive molecules. In many cases, domain scientists/engineers and data scientists must collaborate to make the most out of limited information.

Hosting a production database

If reliability is important and/or you have some money and want to get going quickly, use a cloud provider like Amazon Web Services or Google Cloud. - https://aws.amazon.com/products/databases/ - https://cloud.google.com/sql/

Exploring more on your own

To learn how databases work on your own, install the mysql-server package on a server you have access to, or follow instructions for a LAMP (Linux, Apache, MySQL, PHP) server. The second link below has demonstrations for all kinds of other databases I mentioned previously, including several NoSQL databases. - https://www.linode.com/docs/web-servers/lamp/install-lamp-stack-on-ubuntu-18-04/ - https://www.linode.com/docs/databases/

Administrative interfaces

To administer MySQL databases, I really like PHPMyAdmin. It’s how I learned to use databases, because its web interface shows the SQL commands when you click on the buttons. - https://www.phpmyadmin.net/