Questions about anything?
Today we're going to continue our discussion of relational databases, and practice with SQL.
I've posted the assignment for Monday, including asking you to propose and start work on your final project.
Over the next few weeks we'll be browsing through a variety of topics, with less depth than some of what we've done earlier. Please do spend time working on your projects each week. Remember that we have tutors who can help, or you can make appointments to chat with me as needed.
Since this session was remote and I know that some you are out sick, I've recorded the class - you can find a link to the youtube video over on the video page (if you're logged in).
Consider the following python code:
students = ["Mary", "Sally", "Bob"]
staff = ["Harold", "Stephanie"]
Here the information is organized into structures where the collection (i.e. students) has "within" it data whose length can vary.
This is what we're used to use in a typical programming language, and it works fine ... up to a point. But if the length of the list gets big (millions or more), changing it or searching it gets slow and/or memory expensive.
It turns out that we can instead organize our information into "tables", each table looking like an excel spreadsheet. Instead of putting the data into the collection, we instead put a collection id into each data row. So that data would instead take something like this form:
Role
====
id name
-- ----
1 student
2 staff
Person (name of table)
======
id name role_id (columns i.e. properties)
-- ---- ----
1 Mary 1
2 Sally 1
3 Bob 1
4 Harold 2
5 Stephanie 2
Then to find for example all the students, we search for all the rows in the Person table that have role_id=1.
This approach generalizes to data sets with many different connections between the tables - not just "one to many" relationships (like that one) but also "many to many" relationships (like students and courses in a registrar's database.)
This is how data sets are commonly organized for data science, web development, geoinformatics, and many other use cases.
The relations can get quite elaborate.
Note that individual tables may be stored as pure text files in CSV format, particularly for transfer between different types of software.
To make this more concrete, I'll work through a specific example.
The SQL dialect I'll use is sqlite3; your can read its syntax documentation. For our purposes there are only a few types of statements that we need: TABLE, INSERT, UPDATE, and SELECT.
Let's create a registrar database with
students : John Smith, Jane Doe, Wendy Rogers, Bob Bradly, River Song.
courses :
Intro CS (Learn to program.)
Algorithms (Solve Stuff.)
Graphics (Shaders!)
Data Science (Better living with numbers.)
and put John in Intro CS & Algorithms, Jane in to Graphics, River into Compilers & Data Science, and everyone into Graphics.
Our first task is to actually do this with sqlite. I'll create the database on jupyter.bennington, using this registrar.sql file. (The steps are also described in that file.)
Once the database is set up, we can query it, either with more SQL commands, or with SQL commands issued from within a program.
Here's an example of accessing registrar.db from within python : registrar.py. Notice that you should always insert variables into the SQL using "parameter substion" syntax (i.e. a ? symbol); see the docs.
Now let's work through the complications that arise when you want to query the database:
What courses is River taking?
We'll work through this together, using the SELECT statement ... typically the trickiest of the SQL syntax.
Combining information from several tables is where the syntax can get tricky. The options are :
There are a number of types of joins (left, right, inner, outer), differing mostly in what to do for data which is in one of the two tables but not the other.
A few language quirks :
.tables
. Most of these are for information or settings.SELECT name FROM Person where name="Sally"
. SELECT Person.name from Person JOIN Company where Company.name="IBM" and Person.name LIKE "John %";
After trying to work this out in class, we can look at my answers to the "What courses is River taking?" question in registrar_answer.sql.
For Monday, I'd like you to try some exercises like this yourself, working through some at either/or
If there's time, we can try some in class from the wikibooks exercises.