Extract The Data¶
We've learned how to create a database, insert data and then update or delete it.
We've also seen how to view the content of a table using pandas.
But we've only seen how to display the entire contents of a single table. What if we only want a subset of the records? What if we want to extract a set based on records in more than one table?
It's time to learn about the SELECT statement.
First, let's connect to our database, empty both tables and create records for three flights.
We'll need to enable foreign keys again as this setting applies to the connection:
import sqlalchemy as sa
import pandas as pd
engine = sa.create_engine('sqlite:///flight.db')
connection = engine.connect()
connection.execute("PRAGMA foreign_keys=on")
connection.execute("DELETE FROM flights")
sql = """
INSERT INTO flights(name, country_code, latitude, longitude)
VALUES
('hab1', 'GB', 51.50722, 0),
('hab2', 'GB', 53.46667, -2.23333),
('hab3', 'US', 40.7648, -73.9808)
"""
connection.execute(sql)
pd.read_sql('flights', connection)
name | country_code | latitude | longitude | |
---|---|---|---|---|
0 | hab1 | GB | 51.50722 | 0.00000 |
1 | hab2 | GB | 53.46667 | -2.23333 |
2 | hab3 | US | 40.76480 | -73.98080 |
And then generate 20 randomised readings for each flight:
from collections import namedtuple
from random import uniform, randint
Reading = namedtuple('Reading', 'flight, ts, temp, pressure, humidity')
readings = [
Reading(
flight=flight,
ts=f'2015-01-01 09:{str(i+1).zfill(2)}:00',
temp=round(uniform(23, 27), 1),
pressure=randint(1020, 1025),
humidity=randint(30, 50))
for flight in ['hab1', 'hab2', 'hab3']
for i in range(20)
]
sql = """
INSERT INTO readings
(flight, ts, temp, pressure, humidity)
VALUES
(?, ?, ?, ?, ?)
"""
for reading in readings:
values = (reading.flight, reading.ts, reading.temp, reading.pressure, reading.humidity)
connection.execute(sql, values)
Let's extract and display the readings for flight hab2.
We've seen previously that we can pass a table name to read_sql
. Now, we'll see that we can also pass a SELECT statement:
SELECT flight, ts, temp, pressure, humidity
FROM readings
WHERE flight = 'hab2'
The SELECT clause defines the columns we want to see.
The FROM clause defines the table(s) we want to query.
We've seen the WHERE clause used previously in INSERT, UPDATE and DELETE statements. It applies a filter for us to define the records we want returned.
Let's run that query:
sql = """
SELECT flight, ts, temp, pressure, humidity
FROM readings
WHERE flight = 'hab2'
"""
pd.read_sql(sql, connection)
flight | ts | temp | pressure | humidity | |
---|---|---|---|---|---|
0 | hab2 | 2015-01-01 09:01:00 | 25.9 | 1023 | 40 |
1 | hab2 | 2015-01-01 09:02:00 | 25.4 | 1025 | 40 |
2 | hab2 | 2015-01-01 09:03:00 | 24.2 | 1025 | 49 |
3 | hab2 | 2015-01-01 09:04:00 | 24.8 | 1024 | 48 |
4 | hab2 | 2015-01-01 09:05:00 | 25.4 | 1025 | 31 |
5 | hab2 | 2015-01-01 09:06:00 | 24.8 | 1023 | 42 |
6 | hab2 | 2015-01-01 09:07:00 | 24.1 | 1022 | 48 |
7 | hab2 | 2015-01-01 09:08:00 | 25.3 | 1024 | 42 |
8 | hab2 | 2015-01-01 09:09:00 | 25.3 | 1025 | 34 |
9 | hab2 | 2015-01-01 09:10:00 | 24.2 | 1021 | 33 |
10 | hab2 | 2015-01-01 09:11:00 | 25.4 | 1022 | 44 |
11 | hab2 | 2015-01-01 09:12:00 | 23.5 | 1022 | 34 |
12 | hab2 | 2015-01-01 09:13:00 | 24.5 | 1020 | 37 |
13 | hab2 | 2015-01-01 09:14:00 | 26.0 | 1022 | 33 |
14 | hab2 | 2015-01-01 09:15:00 | 25.4 | 1023 | 41 |
15 | hab2 | 2015-01-01 09:16:00 | 23.2 | 1025 | 39 |
16 | hab2 | 2015-01-01 09:17:00 | 24.1 | 1020 | 36 |
17 | hab2 | 2015-01-01 09:18:00 | 26.8 | 1023 | 49 |
18 | hab2 | 2015-01-01 09:19:00 | 23.8 | 1023 | 38 |
19 | hab2 | 2015-01-01 09:20:00 | 25.1 | 1022 | 44 |
Let's try another example where we might be interested only in the temperature of readings taken before 09:15:
sql = """
SELECT temp
FROM readings
WHERE
flight = 'hab2'
AND ts < '2015-01-01 09:15:00'
"""
pd.read_sql(sql, connection)
temp | |
---|---|
0 | 25.9 |
1 | 25.4 |
2 | 24.2 |
3 | 24.8 |
4 | 25.4 |
5 | 24.8 |
6 | 24.1 |
7 | 25.3 |
8 | 25.3 |
9 | 24.2 |
10 | 25.4 |
11 | 23.5 |
12 | 24.5 |
13 | 26.0 |
But what if we wanted to see the first few readings for any flights that took place in the UK?
We'll use a JOIN clause to join the readings and flights tables together
SELECT readings.*
FROM readings JOIN flights
ON readings.flight = flights.name
WHERE
flights.country_code = 'GB'
AND readings.ts < '2015-01-01 09:05:00'
Here, we're telling the database engine that, for any record in the readings table, it can use the value in the 'flight' column to find a record in the flights table by matching on the 'name' column.
It's common in SQL to use aliases for table names in queries using joins:
SELECT r.*
FROM readings AS r JOIN flights AS f
ON r.flight = f.name
WHERE
f.country_code = 'GB'
AND r.ts < '2015-01-01 09:05:00'
We can also include fields from both tables in the select clause:
SELECT r.flight, r.ts, r.temp, r.pressure, f.longitude, f.latitude
FROM readings AS r JOIN flights AS f
ON r.flight = f.name
WHERE
f.country_code = 'GB'
AND r.ts < '2015-01-01 09:05:00'
Let's try running that last example:
sql = """
SELECT r.flight, r.ts, r.temp, r.pressure, f.longitude, f.latitude
FROM readings AS r JOIN flights AS f
ON r.flight = f.name
WHERE
f.country_code = 'GB'
AND r.ts < '2015-01-01 09:05:00'
"""
pd.read_sql(sql, connection)
flight | ts | temp | pressure | longitude | latitude | |
---|---|---|---|---|---|---|
0 | hab1 | 2015-01-01 09:01:00 | 24.7 | 1024 | 0.00000 | 51.50722 |
1 | hab1 | 2015-01-01 09:02:00 | 23.0 | 1024 | 0.00000 | 51.50722 |
2 | hab1 | 2015-01-01 09:03:00 | 24.1 | 1025 | 0.00000 | 51.50722 |
3 | hab1 | 2015-01-01 09:04:00 | 26.5 | 1020 | 0.00000 | 51.50722 |
4 | hab2 | 2015-01-01 09:01:00 | 25.9 | 1023 | -2.23333 | 53.46667 |
5 | hab2 | 2015-01-01 09:02:00 | 25.4 | 1025 | -2.23333 | 53.46667 |
6 | hab2 | 2015-01-01 09:03:00 | 24.2 | 1025 | -2.23333 | 53.46667 |
7 | hab2 | 2015-01-01 09:04:00 | 24.8 | 1024 | -2.23333 | 53.46667 |