Bulk Loading¶
Previously, we inserted sixty records into the readings table but often we have much bigger sets of data.
The flight-data.csv file contains data for over 200 000 readings. Let's learn how to load that into our database.
First we connect:
import sqlalchemy as sa
import pandas as pd
engine = sa.create_engine('sqlite:///flight.db')
connection = engine.connect()
connection.execute("PRAGMA foreign_keys=on")
<sqlalchemy.engine.result.ResultProxy at 0x10be48ac8>
And then we load the contents of the file into a pandas dataframe.
The file contains more columns than we are using, so we'll specify which ones we want to import:
flight_data = pd.read_csv(
'flight-data.csv',
usecols=['timestamp', 'temp_h', 'pressure', 'humidity', 'accel_x', 'accel_y', 'accel_z'])
flight_data.head()
temp_h | humidity | pressure | accel_x | accel_y | accel_z | timestamp | |
---|---|---|---|---|---|---|---|
0 | 25.108898 | -1.364401 | 0.000000 | -0.031118 | 0.266690 | 0.945787 | 2015-09-17 12:17:32.518121 |
1 | 25.237246 | -1.375071 | 1011.770020 | -0.028465 | 0.265963 | 0.940182 | 2015-09-17 12:17:32.601389 |
2 | 25.218910 | -2.299459 | 1011.776123 | -0.027982 | 0.267175 | 0.941157 | 2015-09-17 12:17:32.681060 |
3 | 25.072227 | -1.684385 | 1011.782715 | -0.027741 | 0.265721 | 0.941401 | 2015-09-17 12:17:32.757071 |
4 | 25.072227 | -1.684385 | 1011.783203 | -0.028465 | 0.266448 | 0.939939 | 2015-09-17 12:17:32.833635 |
Two of the column names don't quite match what we've used in our table.
Let's fix that:
flight_data = flight_data.rename(columns={'temp_h': 'temp', 'timestamp': 'ts'})
flight_data.head()
temp | humidity | pressure | accel_x | accel_y | accel_z | ts | |
---|---|---|---|---|---|---|---|
0 | 25.108898 | -1.364401 | 0.000000 | -0.031118 | 0.266690 | 0.945787 | 2015-09-17 12:17:32.518121 |
1 | 25.237246 | -1.375071 | 1011.770020 | -0.028465 | 0.265963 | 0.940182 | 2015-09-17 12:17:32.601389 |
2 | 25.218910 | -2.299459 | 1011.776123 | -0.027982 | 0.267175 | 0.941157 | 2015-09-17 12:17:32.681060 |
3 | 25.072227 | -1.684385 | 1011.782715 | -0.027741 | 0.265721 | 0.941401 | 2015-09-17 12:17:32.757071 |
4 | 25.072227 | -1.684385 | 1011.783203 | -0.028465 | 0.266448 | 0.939939 | 2015-09-17 12:17:32.833635 |
There are some errors in the humidity readings due to sensor glitches. (e.g. You can see some negative figures in the first few rows).
Let's set those to 0 or 100 if they are out of range:
flight_data['humidity'] = flight_data['humidity'].clip(0, 100)
flight_data.head()
temp | humidity | pressure | accel_x | accel_y | accel_z | ts | |
---|---|---|---|---|---|---|---|
0 | 25.108898 | 0.0 | 0.000000 | -0.031118 | 0.266690 | 0.945787 | 2015-09-17 12:17:32.518121 |
1 | 25.237246 | 0.0 | 1011.770020 | -0.028465 | 0.265963 | 0.940182 | 2015-09-17 12:17:32.601389 |
2 | 25.218910 | 0.0 | 1011.776123 | -0.027982 | 0.267175 | 0.941157 | 2015-09-17 12:17:32.681060 |
3 | 25.072227 | 0.0 | 1011.782715 | -0.027741 | 0.265721 | 0.941401 | 2015-09-17 12:17:32.757071 |
4 | 25.072227 | 0.0 | 1011.783203 | -0.028465 | 0.266448 | 0.939939 | 2015-09-17 12:17:32.833635 |
Finally, there is no flight name.
Let's make all these for flight hab1:
flight_data['flight'] = 'hab1'
flight_data.head()
temp | humidity | pressure | accel_x | accel_y | accel_z | ts | flight | |
---|---|---|---|---|---|---|---|---|
0 | 25.108898 | 0.0 | 0.000000 | -0.031118 | 0.266690 | 0.945787 | 2015-09-17 12:17:32.518121 | hab1 |
1 | 25.237246 | 0.0 | 1011.770020 | -0.028465 | 0.265963 | 0.940182 | 2015-09-17 12:17:32.601389 | hab1 |
2 | 25.218910 | 0.0 | 1011.776123 | -0.027982 | 0.267175 | 0.941157 | 2015-09-17 12:17:32.681060 | hab1 |
3 | 25.072227 | 0.0 | 1011.782715 | -0.027741 | 0.265721 | 0.941401 | 2015-09-17 12:17:32.757071 | hab1 |
4 | 25.072227 | 0.0 | 1011.783203 | -0.028465 | 0.266448 | 0.939939 | 2015-09-17 12:17:32.833635 | hab1 |
Now, we are ready to insert our data into our readings table. Let's try using the same technique as when we loaded the sixty records in Chapter 3.
We defined an INSERT statement using bound parameters and then looped over our dataset and executed that statement once for each record.
We'll do that again now but with the loop inside a function that we call so that we can time it.
NOTE If you are using the online service, it is best not to run this example as it will take a significant length of time.
sql = """
INSERT INTO readings
(flight, ts, temp, pressure, humidity,
accel_x, accel_y, accel_z)
VALUES
(?, ?, ?, ?, ?, ?, ?, ?)
"""
def load_data(connection, data):
for row in data.itertuples():
connection.execute(sql, (
row.flight, row.ts, row.temp, row.pressure,
row.humidity, row.accel_x, row.accel_y, row.accel_z
))
connection.execute("DELETE FROM readings")
%timeit -n1 -r1 load_data(connection, flight_data)
5min 19s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
To understand why this technique takes so long to run, we need to understand database transactions.
When you ask a DBMS to make a change to a table using an INSERT, UPDATE or DELETE statement, it will guarantee:
- Atomicity - The entire change will either succeed or fail. It will never partially complete.
- Consistency - Your data will always be consistent. Your table constraints will never be violated.
- Isolation - Nobody else will see any of your changes until they have all completed.
- Durability - Once your changes are complete, they have been entirely written to the underlying storage.
These guarantees are known as ACID transactions.
You can make as many changes as you want within a single transaction. If, for example, it is important that records inserted into two tables must either succeed or fail together (e.g. an invoice header and invoice detail table in an accounting system), then it would be sensible to make those inserts within a single transaction.
Unless told otherwise, SQLite creates a transaction for each and every INSERT, UPDATE and DELETE statement. Because it has to guarantee the durability, it has to write to the database file each time. This repeated disk writing is the reason our first example is slow.
Instead, let's explicitly specify that all our inserts should take place in a single transaction. This will ensure there is only one disk write for the entire task.
We specify a transaction using our connection object's begin
method:
def load_data(connection, data):
with connection.begin():
for row in data.itertuples():
connection.execute(sql, (
row.flight, row.ts, row.temp, row.pressure,
row.humidity, row.accel_x, row.accel_y, row.accel_z
))
connection.execute("DELETE FROM readings")
%timeit -n1 -r1 load_data(connection, flight_data)
11.1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
You should see a huge drop in the amount of time it takes to execute this second example.
NOTE The DELETE statement in the previous example is outside the transacation so that we have a fair time comparison with the first example. However, this means that, if any problem occurs with the inserts, the table will be left empty because the delete will already have succeeded. In the real world, it's likely that you would want the delete to be included in the same transaction as the inserts.
But we can do better. At the moment, we are using a for loop to iterate over the pandas dataframe. Instead, let's use a list comprehension:
def load_data(connection, data):
data = [
(row.flight, row.ts, row.temp, row.pressure,
row.humidity, row.accel_x, row.accel_y, row.accel_z)
for row in data.itertuples()
]
with connection.begin():
connection.execute(sql, data)
connection.execute("DELETE FROM readings")
%timeit -n1 -r1 load_data(connection, flight_data)
2.75 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
On my laptop, that's now 100x faster than our first example and 4x faster than the second.
Finally, although this tutorial focuses on learning SQL, it is worth noting that a pandas dataframe can be loaded directly into a database table using the to_sql
method.
Let's see how that performs:
connection.execute("DELETE FROM readings")
%timeit -n1 -r1 flight_data.to_sql('readings', connection, if_exists='append', index=False)
5.53 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
By default, to_sql
will place all the insert statements into a single transaction so this is far more efficient than our first example.
Pandas will also iterate over our dataframe more efficiently than the for loop in our second example.
However, pandas generates an insert statement for each row and passes those to the DBMS for execution. Our third example is faster because we used an SQL statement with bound parameters. We are passing that single SQL statement to the DBMS, along with the relevant data, and asking it handle those as best it can. A DBMS is designed to handle just that sort of task efficiently and so it's faster than pandas generating the insert statements itself.