Django Annotate: 7 Examples to Supercharge your Querysets
Want to feel powerful? Learn how to use the .annotate()
method on your querysets. While your less experienced colleagues write multiple queries and for loops to get the result they are looking for, you can get your queryset to do all the heavy-lifting in a single step.
What do annotations do?
Add additional fields to each result of the queryset, calculated from other fields
Group results, similar to SQLs
GROUP BY
expression.
Applying .annotate()
on an unevaluated queryset will apply the annotation to all results. If you evaluate the queryset using .values()
or values_list()
first and apply .annotate()
, then this will have a grouping effect.
We will cover many examples of both kinds of annotation.
Annotate vs Aggregate
What is the difference between annotating and aggregating a queryset? Annotations will apply the expression to all results in the queryset, whereas aggregates will condense the data into one result.
Formula 1 Examples
I have a Django application that has imported historic Formula 1 data into a relational database (I used PostgreSQL but SQLite is fine). I am going to use .annotate()
to generate some interesting statistics about the drivers.
I used the Ergast Developer API to get my data. Rather than call the API, I downloaded their csv tables of historic data and wrote a script to ingest the data into my database.
I ingested the data into six tables: Race, Circuit, Result, Driver, Constructor, Status. The tables are linked
A schema for the database used in the examples.
Annotations are an advanced skill for Django ORM users. If you'd like to build your understanding of Django queries (filter
, exclude
, distinct
, values
, values_list
etc), then I would recommend my post on Django ORM queries, where you will learn through several F1 examples.
1. Basic Annotate
In this example we will count the number of times each status appears in the Race Result table.
What are the most common reasons for drivers to drop out of a race?
I want to know what the most common reasons are for a driver to not finish a race. The Status
table has the names of each outcome (e.g. "engine failure", "collision"). Every row in the Result
table has the outcome of the race for that driver.
I can use .annotate
to find the number of times each status has occurred.
This query will produce the Top 10 most common outcomes:
from stats.models import Status
from django.db.models import Count
top_10_outcomes = (
Status.objects.annotate(occurences=Count("results"))
.order_by("-occurences")
.values_list("status", "occurences")[:10]
)
This is the result:
<QuerySet [('Finished', 6952), ('+1 Lap', 3812), ('Engine', 2005), ('+2 Laps', 1590), ('Accident', 1041), ('Did not qualify', 1025), ('Collision', 829), ('Gearbox', 804), ('Spun off', 791), ('+3 Laps', 731)]>
The .annotate()
counted how many times each status appeared in the in Result
table.
You can filter your data before annotating it
The results also include statuses that imply the race was finished (e.g. "Finished" and "+3 laps"). We can filter the data first before applying the annotation.
from django.db.models import Count, Q
top_10_dnf_outcomes = (
Status.objects.exclude(Q(status="Finished") | Q(status__icontains="lap"))
.annotate(occurences=Count("results"))
.order_by("-occurences")
.values_list("status", "occurences")[:10]
)
The annotate still works after applying .filter()
, or .exclude()
. Using Q
expressions, the equivalent to OR
in SQL, we can include the "Finished" status or any status that has "lap" in the name.
<QuerySet [('Engine', 2005), ('Accident', 1041), ('Did not qualify', 1025), ('Collision', 829), ('Gearbox', 804), ('Spun off', 791), ('Suspension', 431), ('Did not prequalify', 331), ('Transmission', 321), ('Electrical', 315)]>
Now we can see that engine failure is the most common cause of a DNF (Did Not Finish), followed by accidents, with more than 1000 occurences.
2. Concat
What are the driver's full names?
I want the full names of each driver. Using .annotate
, I can concatenate the first and last names of each driver into a new field called full_name
.
Driver.objects.annotate(
full_name=Concat ("first_name", Value(" "), "last_name")
).order_by("full_name")
\don't put a space after "Concat" like I have. This was only to prevent my website from mistaking it for SQL injection.*
This is the result:
<QuerySet [<Driver: Adolf Brudes>, <Driver: Adolfo Cruz>, <Driver: Adrian Sutil>, <Driver: Adrián Campos>, <Driver: Aguri Suzuki>, <Driver: Al Herman>, <Driver: Al Keller>, <Driver: Al Pease>,...>
3. Adding Conditions (Case & When example)
We can use Case and When inside our annotations too add a condition to put the last name before the surname.
You can add conditions to your annotations, which is a way to apply IF
statements to your annotations.
For example, Zhou Guanyu is a Chinese driver who prefers to go by his surname before his first name. In the previous example, his full name would be calculated as "Guanyu Zhou". We can use Case
and When
in our annotation to fix this.
drivers = (
models.Driver.objects.annotate(
full_name=Case(
When(
nationality="Chinese",
then=Concat ("last_name", Value(" "), "first_name"),
),
default=Concat ("first_name", Value(" "), "last_name"),
)
)
.order_by("-full_name")
.values_list("full_name", "nationality")[4:10]
)
This is the result:
<QuerySet [('Zsolt Baumgartner', 'Hungarian'), ('Zhou Guanyu', 'Chinese'), ('Yves Cabantous', 'French'), ('Yuki Tsunoda', 'Japanese'), ('Yuji Ide', 'Japanese'), ('Yannick Dalmas', 'French')]>
4. How to Group Querysets with Django ORM
How many times did each driver in the 2021 season not finish a race?
In the first 3 examples, the annotation was applied to every result in the queryset. But what do you do if you want to use Django ORM to group your data, similar to SQL's GROUP BY
?
With .annotate()
, it is possible to group data, end up with fewer results than the original queryset. However, you will have to evaluate it using .values()
first to get the grouping effect.
In this example, I want to know which driver in the 2021 season had the most DNFs (they didn't finish a race). We can use .annotate()
to return every 2021 with their total number of DNFs for that season.
We can tell if a driver didn't finish a race if the finishing_position
field of the Result
table is null.
This is how we can do it:
driver_dnfs = (
models.Result.objects.filter(race__year=2021, finishing_position__isnull=True)
.values(
"driver",
)
.annotate(dnfs=Count("driver"))
.order_by("-dnfs")
.values_list("driver", "dnfs")
)
This queryset takes the Result
table (one row per driver, per race) and counts the number of rows for each driver where the finishing_position
doesn't contain a number (if they have a finishing position, then they completed the race).
These are the results:
<QuerySet [(853, 6), (847, 5), (822, 4), (852, 4), (842, 3), (840, 3), (839, 3), (854, 3), (849, 3), (830, 3), (20, 3), (4, 2), (844, 2), (815, 2), (8, 2), (841, 1), (1, 1), (846, 1), (817, 1)]>
In each tuple, the first element is the ID of the driver. The second element is the number of DNFs that driver had.
We can include more driver information in our result:
driver_dnfs = (
models.Result.objects.filter(race__year=2021, finishing_position__isnull=True)
.values(
"driver",
)
.annotate(dnfs=Count("driver"))
.order_by("-dnfs")
.values_list("driver__first_name", "driver__last_name", "dnfs")
)
This is the result:
<QuerySet [('Nikita', 'Mazepin', 6), ('George', 'Russell', 5), ('Yuki', 'Tsunoda', 4), ('Valtteri', 'Bottas', 4), ('Lance', 'Stroll', 3), ('Max', 'Verstappen', 3), ('Esteban', 'Ocon', 3), ('Mick', 'Schumacher', 3), ('Pierre', 'Gasly', 3), ('Sebastian', 'Vettel', 3), ('Nicholas', 'Latifi', 3), ('Kimi', 'Räikkönen', 2), ('Charles', 'Leclerc', 2), ('Sergio', 'Pérez', 2), ('Fernando', 'Alonso', 2), ('Lewis', 'Hamilton', 1), ('Lando', 'Norris', 1), ('Antonio', 'Giovinazzi', 1), ('Daniel', 'Ricciardo', 1)]>
Now we can see that Nikita Mazepin had the most DNFs in 2021.
5. Sum Example
How many points did each driver score in the 2021 driver championship?
By annotating a queryset of the results table, we can generate the leaderboard for the 2021 F1 championship.
For any readers not familiar with Formula 1, the 2021 driver championship was equally exciting and controversial. Lewis Hamilton and Max Verstappen were neck-and-neck and there was one race left to decide who was going to win the championship. For most of the race, Lewis Hamilton was expected to win, but a last-minute safety car narrowed the gap, giving Verstappen a chance to overtake and take the championship.
This query is going to generate the leaderboard for the 2021 driver championship.
We are going to do this by taking all the race results, where the race year was 2021 and take the sum of the points scored for every driver. Note, that there is a result in the Result
table per driver and per race.
Similar to the previous query, we will apply .values()
on the queryset before applying .annotate
to group the results by driver.
# Which driver scored the most points in 2021?
from django.db.models import Sum
driver_championship_2021 = (
Result.objects.filter(race__year=2021)
.values("driver")
.annotate(total_points=Sum("points_scored"))
.order_by("-total_points")
.values_list("driver__first_name", "driver__last_name", "total_points")
)
The Sum
expression adds all values in the points_scored
for all 2021 race results for each driver.
These are the results:
<QuerySet [('Max', 'Verstappen', 389.0), ('Lewis', 'Hamilton', 386.0), ('Valtteri', 'Bottas', 219.0), ('Sergio', 'Pérez', 190.0), ('Carlos', 'Sainz', 164.0), ('Lando', 'Norris', 160.0), ('Charles', 'Leclerc', 159.0), ('Daniel', 'Ricciardo', 114.0), ('Pierre', 'Gasly', 110.0), ('Fernando', 'Alonso', 81.0), ('Esteban', 'Ocon', 74.0), ('Sebastian', 'Vettel', 43.0), ('Lance', 'Stroll', 34.0), ('Yuki', 'Tsunoda', 32.0), ('George', 'Russell', 16.0), ('Kimi', 'Räikkönen', 10.0), ('Nicholas', 'Latifi', 7.0), ('Antonio', 'Giovinazzi', 3.0), ('Nikita', 'Mazepin', 0.0), ('Robert', 'Kubica', 0.0), '...(remaining elements truncated)...']>
We can take this further by adding the annotation to get the driver's full name. However, we need to tweak it because we are querying the Result
table rather than the Driver
table.
driver_championship_2021 = (
models.Result.objects.filter(race__year=2021)
.values("driver")
.annotate(
total_points=Sum("points_scored"),
full_name=Concat ("driver__first_name", Value(" "), "driver__last_name"),
)
.order_by("-total_points")
.values_list("full_name", "total_points")
)
6. Count with conditions
How many point scoring races did each 2021 driver have?
In F1, drivers score points if they finish the race in the top 10. I will use Count
and Sum
to count the number of races for each driver where they scored points.
I could filter the queryset for results where points_scored
is more than 0 before the annotation or I could filter inside the annotation.
# How many point scoring races did each 2021 driver have?
from django.db.models import Count, Q, Sum, Value
from django.db.models.functions import Concat
points_scoring_races = (
models.Result.objects.filter(race__year=2021)
.values("driver")
.annotate(
full_name=Concat ("driver__first_name", Value(" "), "driver__last_name"),
points_scoring_races=Count("driver", filter=Q(points_scored__gt=0)),
total_points=Sum("points_scored"),
)
.order_by("-points_scoring_races", "-total_points")
.values("full_name", "points_scoring_races")
)
This is the result:
<QuerySet [{'full_name': 'Lewis Hamilton', 'points_scoring_races': 20}, {'full_name': 'Carlos Sainz', 'points_scoring_races': 20}, {'full_name': 'Lando Norris', 'points_scoring_races': 20}, {'full_name': 'Max Verstappen', 'points_scoring_races': 19}, {'full_name': 'Charles Leclerc', 'points_scoring_races': 18}, {'full_name': 'Sergio Pérez', 'points_scoring_races': 16}... >
From the results, we can see that Lewis Hamilton, Carlos Sainz and Lando Norris each had 20 races where they scored points.
We did this by only counting the race result if the points_scored
was more than 0:
points_scoring_races=Count("driver", filter=Q(points_scored__gt=0))
What is Q?
Q allows you to express a condition for filtering data (docs). You can also use them in queryset filters to combine multiple conditions with AND / OR logic.
7. ExpressionWrapper Example
Show me the age of each driver.
In this example, I will show you how to use Django's .annotate()
to calculate a person's age from their date of birth.
The database contains the date of birth in the database. We can use .annotate()
to use the date of birth to give the driver's age in years.
Our annotation needs to perform some arithmetic. We need to find the difference between today's date and the driver's date of birth and convert it into years.
The problem is, .annotate()
only accepts instances of the Expression class. Recall that in the names example, we had to use Value(" ")
just to insert a space between names.
We need to express our calculation of the driver's age as an instance of Expression
. All the classes we have been using so far: Count
, Value
, Concat
etc are subclasses of Expression
.
There are more Expression
subclasses we can use to calculate the driver's age.
We are going to use ExpressionWrapper
.
This is the query. Credit to LowerDeez on Github, whose code (source) this query has been adapted from.
import datetime
from django.db.models import ExpressionWrapper, DateField, IntegerField, F
from django.db.models.functions import ExtractDay
driver_ages = (
Driver.objects.annotate(
full_name=full_name_expression,
age=ExpressionWrapper(
ExtractDay(
ExpressionWrapper(
Value(datetime.datetime.now()) - F("date_of_birth"),
output_field=DateField(),
)
)
/ 365.25,
output_field=IntegerField(),
),
)
.order_by("-date_of_birth")
.values_list("full_name", "age")[:10]
)
What is happening here? Let's start in the middle and work our way out.
In this ExpressionWrapper
, we find the difference between today's date and the driver's date of birth:
ExpressionWrapper(
Value(datetime.datetime.now()) - F("date_of_birth"),
output_field=DateField(),
)
If you subtract two dates in Python, you get a timedelta
object. We can extract the number of days out of the delta using ExtractDay
.
ExtractDay(
ExpressionWrapper(
Value(datetime.datetime.now()) - F("date_of_birth"),
output_field=DateField(),
)
)
Finally, we can divide the days by 365.25 to get the years passed:
age=ExpressionWrapper(
ExtractDay(
ExpressionWrapper(
Value(datetime.datetime.now()) - F("date_of_birth"),
output_field=DateField(),
)
)
/ 365.25,
output_field=IntegerField(),
)
When using ExpressionWrapper
, you must specify a value for output_field
. This specifies how the result will be stored in the queryset.
What is F?
F expressions are used to represent database fields inside expressions (docs).
Conclusion
Django developers can use .annotate()
to add fields to querysets and/or group the data.
Annotations are different to aggregations, as annotations are applied to every result in the queryset.
Data can be grouped by applying values()
to the queryset before applying .annotate()
.
Annotations are powerful because they can create complex queries using multiple tables. The alternative is making multiple queries and writing your own code to fuse the data together. Annotations can handle the complexity in a single step.
To learn how to annotate querysets, you must also learn how to use expressions. In this tutorial, we covered Count
, Sum
, Value
, Case
, When
and ExpressionWrapper
. Functions give us more flexibility. We used Concat
and ExtractDay
in this tutorial.