Databases

Guest lecture by Bradley Dice, [@bdice](https://github.com/bdice/).

Why databases?

Tons of applications: managing digital content (websites), product inventories, transactions, health records… structure, security, and speed are all important.

  • What if the size of your data exceeds the memory of the system?
    • Out-of-core algorithms.
  • What if your data needs to be accessed by servers around the world?
    • Split it up by geography.
  • Text files and spreadsheets can only do so much.

Databases administrators have to think about software, hardware, data structures, permissions and roles, backups, reports, optimization, and migration. This lecture will focus on the “user” side, especially how to get information out of a SQL database for analysis.

What can we expect from most databases? CRUD + ACID.

Operations you can perform: CRUD. When interfacing with a database, you can expect to have these functions available. - Create: Add new data. - Read: Fetch existing data. - Update: Edit existing data. - Delete: Remove data.

Guarantees for how the database will act: ACID. These properties ensure databases are protected against hardware failures and software errors. - Atomicity: Transactions will be completed fully, or not at all. - Consistency: Transactions must leave the database in a valid state. - Isolation: Transactions performed concurrently must act the same as transactions performed sequentially. - Durability: Transactions must be permanent (i.e. stored on a disk, not in memory).

Types of databases

  • Relational and non-relational.
  • These generally correspond to “SQL-like” and “NoSQL-like.”

Relational Databases

Programmers interface with relational databases through Structured Query Language (SQL). ### Examples: Open Source: - MySQL - PostgreSQL (aka Postgres) - SQLite (for local, application-level storage)

Commercial: - Oracle Database (aka Oracle) - Microsoft SQL Server - DB2 (IBM) - Microsoft Access, though casual users might not even realize it

Non-relational Databases

Sometimes the structure of relational databases can be limiting. What if some properties are defined for one object but not another? There are a few types of non-relational databases including column stores (e.g. Cassandra), document stores (e.g. MongoDB), key-value stores (e.g. Redis), and graph databases (e.g. Neo4j).

Examples:

Open Source: - MongoDB - Apache Cassandra - Redis - Apache CouchDB - Neo4j

Python Example with sqlite3

Reminder: triple-double-quotes """like this""" are just another form of Python string, and you can include single or double quotes inside. This is useful for complicated SQL statements.

In [1]:
import sqlite3
import os

database = 'test.sqlite'

# Removes the database if it already exists
if os.path.exists(database):
    os.remove(database)

Create data with the CREATE TABLE and INSERT statements. There are many data types, including TEXT, INTEGER, REAL, BLOB, and more. See this reference list of common SQL data types.

In [2]:
with sqlite3.connect(database) as conn:
    cursor = conn.cursor()
    cursor.execute(
        """CREATE TABLE People
        (first_name TEXT, last_name TEXT, job TEXT, age NUMERIC)""")
    cursor.execute(
        """INSERT INTO People VALUES (?,?,?,?)""",
        ('Bradley', 'Dice', 'Guest Lecturer', 25))
In [3]:
with sqlite3.connect(database) as conn:
    results = conn.execute("""SELECT * FROM People""")
    print(list(results))
[('Bradley', 'Dice', 'Guest Lecturer', 25)]

Note: We always use the “prepared statement” syntax (?,?,?,?) instead of formatting the string directly to avoid SQL Injection!

We can also just use conn directly, with a shorter syntax that creates cursors automatically. Also, we use executemany with a list of tuples, to perform many actions at once.

In [4]:
# Let's add lots of people
famous_actors = [
    ('Robert', 'Downey Jr.', 'Iron Man', 53),
    ('Chris', 'Evans', 'Captain America', 37),
    ('Scarlett', 'Johansson', 'Black Widow', 33),
    ('Samuel', 'Jackson', 'Nick Fury', 69),
    ('Benedict', 'Cumberbatch', 'Dr. Strange', 42),
    ('Brie', 'Larson', 'Captain Marvel', 29),
    ('Chadwick', 'Boseman', 'Black Panther', 40),
]
# Yes, I'm a Marvel fan
with sqlite3.connect(database) as conn:
    conn.executemany("""INSERT INTO People VALUES (?,?,?,?)""", famous_actors)

Read data with the SELECT statement.

In [5]:
with sqlite3.connect(database) as conn:
    cursor = conn.cursor()
    cursor.execute(
        """SELECT * FROM People""")
    print(cursor.fetchall())
[('Bradley', 'Dice', 'Guest Lecturer', 25), ('Robert', 'Downey Jr.', 'Iron Man', 53), ('Chris', 'Evans', 'Captain America', 37), ('Scarlett', 'Johansson', 'Black Widow', 33), ('Samuel', 'Jackson', 'Nick Fury', 69), ('Benedict', 'Cumberbatch', 'Dr. Strange', 42), ('Brie', 'Larson', 'Captain Marvel', 29), ('Chadwick', 'Boseman', 'Black Panther', 40)]

Update data with the UPDATE statement. We forgot a middle initial! Here we use WHERE to identify rows that match a set of criteria.

In [6]:
with sqlite3.connect(database) as conn:
    conn.execute(
        """UPDATE People SET first_name = ? WHERE first_name = ? AND last_name = ?""",
        ("Samuel L.", "Samuel", "Jackson"))

We can also take user input and return dynamic data:

In [7]:
age = 45 # input('People older than:')
with sqlite3.connect(database) as conn:
    results = conn.execute(
        """SELECT first_name, age FROM People WHERE age >= ?""",
        (age,))
    for r in results:
        print(r)
('Robert', 53)
('Samuel L.', 69)

We can return specific columns by listing them after SELECT, and filter the rows with WHERE.

In [8]:
with sqlite3.connect(database) as conn:
    results = conn.execute("""SELECT first_name, last_name FROM People WHERE last_name = ?""", ("Jackson",))
    print(list(results))
[('Samuel L.', 'Jackson')]

SQL also lets you make complex selections, groupings, and filterings. Here is an example.

In [9]:
with sqlite3.connect(database) as conn:
    results = conn.execute("""SELECT COUNT(*) FROM People""")
    print('Total count of people:', list(results))
    results = conn.execute(
        """SELECT SUBSTR(first_name, 1, 1) AS first_letter, COUNT(*) FROM People GROUP BY first_letter""")
    print('Count by first letters of first names:', list(results))
    results = conn.execute("""SELECT AVG(age) FROM People""")
    print('Average age of people:', list(results))
    results = conn.execute("""SELECT SUM(age) FROM People""")
    print('Summed ages of people:', list(results))
Total count of people: [(8,)]
Count by first letters of first names: [('B', 3), ('C', 2), ('R', 1), ('S', 2)]
Average age of people: [(41.0,)]
Summed ages of people: [(328,)]

There are tons of functions for math, string manipulations, date/time manipulations, and more. If you want to learn to do something in particular, just Google it. StackExchange is a great place to find SQL examples. The full reference manual for MySQL is here: https://dev.mysql.com/doc/refman/8.0/en/

In [10]:
# Show us the guts of the database! This command is SQLite-specific.
with sqlite3.connect(database) as conn:
    results = conn.execute("""SELECT * FROM sqlite_master WHERE type = 'table'""")
    print(list(results))
[('table', 'People', 'People', 2, 'CREATE TABLE People\n        (first_name TEXT, last_name TEXT, job TEXT, age NUMERIC)')]

The last operation is delete. Here’s how we delete data, also called “dropping” rows/columns/tables.

In [11]:
with sqlite3.connect(database) as conn:
    conn.execute("""DELETE FROM People WHERE first_name = ?""", ("Bradley",))
    results = conn.execute("""SELECT COUNT(*) FROM People""")
    print('Total count of people after removing Bradley:', list(results))
    # You can't rename or remove columns in sqlite, but this is how you would do it in most SQL databases:
    #conn.execute("""ALTER TABLE people DROP COLUMN age""")
    conn.execute("""DROP TABLE People""")
    print('The table "people" has been dropped.')
Total count of people after removing Bradley: [(7,)]
The table "people" has been dropped.

The story of Bobby Tables. image.png

…so what does it mean to be “relational”?

Now we’re going to dive into some of the most important parts of SQL: keys and relationships. Think about your UMich ID card. It gives you a unique identifying number that isn’t your name (there could be many Jane Smiths). It is a permanent number and won’t ever change, and nobody else will ever have the same ID number. Similarly, most databases need a primary key.

If we specified a name every time, it would be hard to find-and-replace. The database can guarantee that every row has an identifier for us, which improves searchability and ensures good performance for comparisons.

We will create multiple tables and JOIN them to understand how one type of data (e.g. actors) is connected to another (e.g. movies). This is how huge social networks associate photos to your account, associate tagged friends to photos, track friend/follow relationships, and more.

Goal: We would like to know which actors have been in the most Marvel movies, and how long they’ve been acting in Marvel films. To do this, we need a way to connect actors with movies.

Example: IMBD, Index of Marvel data with Bradley Dice

First, we create a table Movies to store the movie titles and years. We will use a primary key id.

In [12]:
database = 'movie_ratings.sqlite'

# Removes the database if it already exists
if os.path.exists(database):
    os.remove(database)

with sqlite3.connect(database) as conn:
    # Here we create a primary key, and use "NOT NULL" to prevent inserting invalid data
    conn.execute(
        """CREATE TABLE Movies
        (id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, year INTEGER NOT NULL)""")
    conn.executemany(
        """INSERT INTO Movies (title, year) VALUES (?,?)""",
        [('Iron Man', 2008),
         ('The Incredible Hulk', 2008),
         ('Iron Man 2', 2010),
         ('Thor', 2011),
         ('Captain America: The First Avenger', 2011),
         ('The Avengers', 2012),
         ('Iron Man 3', 2013),
         ('Captain America: The Winter Soldier', 2014),
         ('Avengers: Age of Ultron', 2015),
         ('Captain America: Civil War', 2016),
         ('Doctor Strange', 2016),
         ('Black Panther', 2018),
         ('Avengers: Infinity War', 2018),
        ])

Let’s read back that data and make sure it worked as expected.

In [13]:
with sqlite3.connect(database) as conn:
    results = conn.execute("""SELECT * FROM Movies""")
    print(list(results))
[(1, 'Iron Man', 2008), (2, 'The Incredible Hulk', 2008), (3, 'Iron Man 2', 2010), (4, 'Thor', 2011), (5, 'Captain America: The First Avenger', 2011), (6, 'The Avengers', 2012), (7, 'Iron Man 3', 2013), (8, 'Captain America: The Winter Soldier', 2014), (9, 'Avengers: Age of Ultron', 2015), (10, 'Captain America: Civil War', 2016), (11, 'Doctor Strange', 2016), (12, 'Black Panther', 2018), (13, 'Avengers: Infinity War', 2018)]
In [14]:
with sqlite3.connect(database) as conn:
    results = conn.execute("""SELECT * FROM Movies ORDER BY year DESC""")
    print(list(results))
[(12, 'Black Panther', 2018), (13, 'Avengers: Infinity War', 2018), (10, 'Captain America: Civil War', 2016), (11, 'Doctor Strange', 2016), (9, 'Avengers: Age of Ultron', 2015), (8, 'Captain America: The Winter Soldier', 2014), (7, 'Iron Man 3', 2013), (6, 'The Avengers', 2012), (4, 'Thor', 2011), (5, 'Captain America: The First Avenger', 2011), (3, 'Iron Man 2', 2010), (1, 'Iron Man', 2008), (2, 'The Incredible Hulk', 2008)]

Now we’ll make a table for Actors.

In [15]:
with sqlite3.connect(database) as conn:
    conn.execute(
        """CREATE TABLE Actors
        (id INTEGER PRIMARY KEY AUTOINCREMENT,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        character TEXT NOT NULL,
        age REAL NOT NULL)""")
    conn.executemany(
        """INSERT INTO Actors (first_name, last_name, character, age) VALUES (?,?,?,?)""",
        [('Robert', 'Downey Jr.', 'Iron Man', 53),
         ('Chris', 'Evans', 'Captain America', 37),
         ('Scarlett', 'Johansson', 'Black Widow', 33),
         ('Samuel L.', 'Jackson', 'Nick Fury', 69),
         ('Benedict', 'Cumberbatch', 'Dr. Strange', 42),
         ('Brie', 'Larson', 'Captain Marvel', 29),
         ('Chadwick', 'Boseman', 'Black Panther', 40)
        ])
    # ...and print the results
    results = conn.execute("""SELECT * FROM Actors""")
    print(list(results))
[(1, 'Robert', 'Downey Jr.', 'Iron Man', 53.0), (2, 'Chris', 'Evans', 'Captain America', 37.0), (3, 'Scarlett', 'Johansson', 'Black Widow', 33.0), (4, 'Samuel L.', 'Jackson', 'Nick Fury', 69.0), (5, 'Benedict', 'Cumberbatch', 'Dr. Strange', 42.0), (6, 'Brie', 'Larson', 'Captain Marvel', 29.0), (7, 'Chadwick', 'Boseman', 'Black Panther', 40.0)]

We need a name for the table mapping between these. This is a many-to-many mapping. Actors can star in multiple movies, and movies can have multiple actors. Other mappings could be one-to-one, one-to-many, etc.

There are only two hard things in Computer Science: cache invalidation and naming things. – Phil Karlton

We shouldn’t name this table with the reserved word Cast (see CAST). Another common naming scheme for these just concatenates the names: MovieActors, so we’ll use that instead.

In [16]:
with sqlite3.connect(database) as conn:
    conn.execute(
        """CREATE TABLE MovieActors
        (id INTEGER PRIMARY KEY AUTOINCREMENT, movie_id INTEGER NOT NULL, actor_id INTEGER NOT NULL)""")
    conn.executemany(
        """INSERT INTO MovieActors (movie_id, actor_id) VALUES (?,?)""",
        [(1, 1), (2, 1), (3, 1), (6, 1), (7, 1), (9, 1), (10, 1), (13, 1), (5, 2), (6, 2), (8, 2), (9, 2), (10, 2), (13, 2),
        (3, 3), (6, 3), (8, 3), (9, 3), (10, 3), (13, 3), (1, 4), (3, 4), (4, 4), (5, 4), (6, 4), (8, 4), (9, 4), (13, 4),
        (11, 5), (13, 5), (10, 7), (12, 7), (13, 7)])
    # ...and print the results
    results = conn.execute("""SELECT * FROM MovieActors""")
    print(list(results))
[(1, 1, 1), (2, 2, 1), (3, 3, 1), (4, 6, 1), (5, 7, 1), (6, 9, 1), (7, 10, 1), (8, 13, 1), (9, 5, 2), (10, 6, 2), (11, 8, 2), (12, 9, 2), (13, 10, 2), (14, 13, 2), (15, 3, 3), (16, 6, 3), (17, 8, 3), (18, 9, 3), (19, 10, 3), (20, 13, 3), (21, 1, 4), (22, 3, 4), (23, 4, 4), (24, 5, 4), (25, 6, 4), (26, 8, 4), (27, 9, 4), (28, 13, 4), (29, 11, 5), (30, 13, 5), (31, 10, 7), (32, 12, 7), (33, 13, 7)]

JOINing [STRIKEOUT:forces] tables

This demonstrates a JOIN operation. There are different kinds of JOINs, explained nicely in this resource.

In [17]:
with sqlite3.connect(database) as conn:
    results = conn.execute(
        """SELECT * FROM MovieActors
        JOIN Movies ON MovieActors.movie_id = Movies.id
        JOIN Actors ON MovieActors.actor_id = Actors.id""")
    print(list(results))
[(1, 1, 1, 1, 'Iron Man', 2008, 1, 'Robert', 'Downey Jr.', 'Iron Man', 53.0), (2, 2, 1, 2, 'The Incredible Hulk', 2008, 1, 'Robert', 'Downey Jr.', 'Iron Man', 53.0), (3, 3, 1, 3, 'Iron Man 2', 2010, 1, 'Robert', 'Downey Jr.', 'Iron Man', 53.0), (4, 6, 1, 6, 'The Avengers', 2012, 1, 'Robert', 'Downey Jr.', 'Iron Man', 53.0), (5, 7, 1, 7, 'Iron Man 3', 2013, 1, 'Robert', 'Downey Jr.', 'Iron Man', 53.0), (6, 9, 1, 9, 'Avengers: Age of Ultron', 2015, 1, 'Robert', 'Downey Jr.', 'Iron Man', 53.0), (7, 10, 1, 10, 'Captain America: Civil War', 2016, 1, 'Robert', 'Downey Jr.', 'Iron Man', 53.0), (8, 13, 1, 13, 'Avengers: Infinity War', 2018, 1, 'Robert', 'Downey Jr.', 'Iron Man', 53.0), (9, 5, 2, 5, 'Captain America: The First Avenger', 2011, 2, 'Chris', 'Evans', 'Captain America', 37.0), (10, 6, 2, 6, 'The Avengers', 2012, 2, 'Chris', 'Evans', 'Captain America', 37.0), (11, 8, 2, 8, 'Captain America: The Winter Soldier', 2014, 2, 'Chris', 'Evans', 'Captain America', 37.0), (12, 9, 2, 9, 'Avengers: Age of Ultron', 2015, 2, 'Chris', 'Evans', 'Captain America', 37.0), (13, 10, 2, 10, 'Captain America: Civil War', 2016, 2, 'Chris', 'Evans', 'Captain America', 37.0), (14, 13, 2, 13, 'Avengers: Infinity War', 2018, 2, 'Chris', 'Evans', 'Captain America', 37.0), (15, 3, 3, 3, 'Iron Man 2', 2010, 3, 'Scarlett', 'Johansson', 'Black Widow', 33.0), (16, 6, 3, 6, 'The Avengers', 2012, 3, 'Scarlett', 'Johansson', 'Black Widow', 33.0), (17, 8, 3, 8, 'Captain America: The Winter Soldier', 2014, 3, 'Scarlett', 'Johansson', 'Black Widow', 33.0), (18, 9, 3, 9, 'Avengers: Age of Ultron', 2015, 3, 'Scarlett', 'Johansson', 'Black Widow', 33.0), (19, 10, 3, 10, 'Captain America: Civil War', 2016, 3, 'Scarlett', 'Johansson', 'Black Widow', 33.0), (20, 13, 3, 13, 'Avengers: Infinity War', 2018, 3, 'Scarlett', 'Johansson', 'Black Widow', 33.0), (21, 1, 4, 1, 'Iron Man', 2008, 4, 'Samuel L.', 'Jackson', 'Nick Fury', 69.0), (22, 3, 4, 3, 'Iron Man 2', 2010, 4, 'Samuel L.', 'Jackson', 'Nick Fury', 69.0), (23, 4, 4, 4, 'Thor', 2011, 4, 'Samuel L.', 'Jackson', 'Nick Fury', 69.0), (24, 5, 4, 5, 'Captain America: The First Avenger', 2011, 4, 'Samuel L.', 'Jackson', 'Nick Fury', 69.0), (25, 6, 4, 6, 'The Avengers', 2012, 4, 'Samuel L.', 'Jackson', 'Nick Fury', 69.0), (26, 8, 4, 8, 'Captain America: The Winter Soldier', 2014, 4, 'Samuel L.', 'Jackson', 'Nick Fury', 69.0), (27, 9, 4, 9, 'Avengers: Age of Ultron', 2015, 4, 'Samuel L.', 'Jackson', 'Nick Fury', 69.0), (28, 13, 4, 13, 'Avengers: Infinity War', 2018, 4, 'Samuel L.', 'Jackson', 'Nick Fury', 69.0), (29, 11, 5, 11, 'Doctor Strange', 2016, 5, 'Benedict', 'Cumberbatch', 'Dr. Strange', 42.0), (30, 13, 5, 13, 'Avengers: Infinity War', 2018, 5, 'Benedict', 'Cumberbatch', 'Dr. Strange', 42.0), (31, 10, 7, 10, 'Captain America: Civil War', 2016, 7, 'Chadwick', 'Boseman', 'Black Panther', 40.0), (32, 12, 7, 12, 'Black Panther', 2018, 7, 'Chadwick', 'Boseman', 'Black Panther', 40.0), (33, 13, 7, 13, 'Avengers: Infinity War', 2018, 7, 'Chadwick', 'Boseman', 'Black Panther', 40.0)]