Aggregation¶
SQL's SELECT statement can also perform aggregation on sets of records.
Let's connect to our database:
In [1]:
import sqlalchemy as sa
import pandas as pd
engine = sa.create_engine('sqlite:///flight.db')
connection = engine.connect()
connection.execute("PRAGMA foreign_keys=on")
Out[1]:
<sqlalchemy.engine.result.ResultProxy at 0x116f91c18>
We're expecting have three records in our flights table and sixty readings.
Let's see if that's correct:
In [3]:
pd.read_sql("SELECT COUNT(*) FROM flights", connection)
Out[3]:
COUNT(*) | |
---|---|
0 | 3 |
In [4]:
pd.read_sql("SELECT COUNT(*) FROM readings", connection)
Out[4]:
COUNT(*) | |
---|---|
0 | 60 |
COUNT is one of SQL's aggregation functions. We also have SUM, MAX, MIN and AVG available.
Let's try some examples:
In [9]:
sql = """
SELECT MAX(pressure)
FROM readings
WHERE flight = 'hab1'
"""
pd.read_sql(sql, connection)
Out[9]:
MAX(pressure) | |
---|---|
0 | 1025 |
In [8]:
sql = """
SELECT AVG(temp)
FROM readings
WHERE flight = 'hab3'
"""
pd.read_sql(sql, connection)
Out[8]:
AVG(temp) | |
---|---|
0 | 25.485 |
We can also group our record set before aggregation using a GROUP BY clause.
Let's check that our sixty readings is comprised of twenty for each flight:
In [11]:
sql = """
SELECT flight, COUNT(*)
FROM readings
GROUP BY flight
"""
pd.read_sql(sql, connection)
Out[11]:
flight | COUNT(*) | |
---|---|---|
0 | hab1 | 20 |
1 | hab2 | 20 |
2 | hab3 | 20 |
Or find the average pressure reading for each flight:
In [14]:
sql = """
SELECT flight, AVG(pressure)
FROM readings
GROUP BY flight
"""
pd.read_sql(sql, connection)
Out[14]:
flight | AVG(pressure) | |
---|---|---|
0 | hab1 | 1022.00 |
1 | hab2 | 1022.95 |
2 | hab3 | 1022.15 |
There's nothing to stop us joining multiple tables in an aggregation query.
We can also use aliases in the column names.
Let's find the maximum and minimum temperature reading by country:
In [17]:
sql = """
SELECT
f.country_code,
MAX(r.temp) AS max_temp,
MIN(r.temp) AS min_temp
FROM readings AS r JOIN flights AS f
ON r.flight = f.name
GROUP BY f.country_code
"""
pd.read_sql(sql, connection)
Out[17]:
country_code | max_temp | min_temp | |
---|---|---|---|
0 | GB | 26.9 | 23.0 |
1 | US | 26.8 | 23.5 |