6 months slick - Movio blog

We've been using Slick's Lifted Embedding DSL for relational database access at Movio for a while now (see our previous blog Slick for database access in Scala) , and have learnt a lot along the way. This post runs though some patterns we have found useful, and a few things to watch out for.

First, lets take a quick look at the schema definition we will be using in the examples (based off the schema used in the Slick documentation). This is how we would have written it when we first started:

  object Tables {
    val suppliers = TableQuery[SuppliersT]
    val coffees = TableQuery[CoffeesT]
  }

  case class Suppliers(
    id: Int, name: String, street: String,
    city: String, state: String, zip: String
  )
  class SuppliersT(tag: Tag) extends Table[Suppliers](tag, "SUPPLIERS") {
    def id = column[Int]("SUP_ID", O.PrimaryKey)
    def name = column[String]("SUP_NAME")
    def street = column[String]("STREET")
    def city = column[String]("CITY")
    def state = column[String]("STATE")
    def zip = column[String]("ZIP")

    def * = (id, name, street, city, state, zip) <>
      ((Suppliers.apply _).tupled, Suppliers.unapply)
  }

  case class Coffees(id: Int, name: String, supID: Int, price: Double)
  class CoffeesT(tag: Tag) extends Table[Coffees](tag, "COFFEES") {

    def id = column[Int]("ID", O.PrimaryKey, O.AutoInc)
    def name = column[String]("COF_NAME")
    def supID = column[Int]("SUP_ID")
    def price = column[Double]("PRICE")

    def * = (id, name, supID, price) <> ((Coffees.apply _).tupled, Coffees.unapply)

    def supplier = foreignKey("SUP_FK", supID, Tables.suppliers)(_.id)
  }

The main question we had, going from examples to production code, was where to put the table query vals. There are a few options, but we initially settled on collecting them in dedicated objects.

With a small schema, this pattern wasn't bad. As our schema grew, however, we quickly ended up with case classes spread out over multiple files, and multiple collections of table queries in different objects. Things got confusing when a given piece of code needed to span these different groups.

Eventually someone came up with the smart - and in hindsight, obvious - idea to put table definitions and queries in the case class' companion objects.

  case class Suppliers(
    id: Int, name: String, street: String,
    city: String, state: String, zip: String
  )
  object Suppliers {
    class SuppliersT(tag: Tag) extends Table[Suppliers](tag, "SUPPLIERS") {
      def id = column[Int]("SUP_ID", O.PrimaryKey)
      def name = column[String]("SUP_NAME")
      def street = column[String]("STREET")
      def city = column[String]("CITY")
      def state = column[String]("STATE")
      def zip = column[String]("ZIP")

      def * = (id, name, street, city, state, zip) <>
        ((Suppliers.apply _).tupled, Suppliers.unapply)
    }
    val table = TableQuery[SuppliersT]
  }

  case class Coffees(id: Int, name: String, supID: Int, price: Double)
  object Coffees {
    class CoffeesT(tag: Tag) extends Table[Coffees](tag, "COFFEES") {

      def id = column[Int]("ID", O.PrimaryKey, O.AutoInc)
      def name = column[String]("COF_NAME")
      def supID = column[Int]("SUP_ID")
      def price = column[Double]("PRICE")

      def * = (id, name, supID, price) <> ((Coffees.apply _).tupled, Coffees.unapply)

      def supplier = foreignKey("SUP_FK", supID, Suppliers.table)(_.id)
    }
    val table = TableQuery[CoffeesT]
  }

We find this pattern much easier to manage. The table definitions and queries are all at a consistent location relative to the case class. As there's no need to try and group table queries together, there's no penalty if code needs to cross group boundaries.

Now to something a bit more meaty. Slick doesn't always generate the queries you would expect. For example if you want to count the number of rows in the Coffees table, you could do this:

  Coffees.table.length

And expect something like:

SELECT count(1)
FROM "COFFEES";

The Slick documentation even appears to say that's what you'll get. However, note that most of the examples are actually annotated as "simplified". What you actually get is:

SELECT s18.s17
FROM (
  SELECT count(1) AS s17
  FROM (
    SELECT s15.`ID` AS s19,
      s15.`COF_NAME` AS s20,
      s15.`SUP_ID` AS s21,
      s15.`PRICE` AS s22
    FROM `COFFEES` s15
    ) s24
  ) s18;

You can check this yourself with the .selectStatement function. i.e.

  Coffees.table.length.selectStatement

So, what is this query actually doing? First, it's selecting all the columns in our "*" definition for all rows, then counting the selected rows, and finally selecting the count. This will get the correct answer, but it is probably less efficient than our expected query, depending on how your database engine executes queries. It's also not particularly easy to read in your logs.

There are a bunch of reasons why queries are not always what you expect, and the slick team appears to be working on improving the situation for 3.0, but the upshot is that for now you will likely need to check and optimise your queries at some point. Imagine some air quotes when reading "optimise", as determining what constitutes an optimised query can be difficult. I find it to be a balancing act between 3 factors:

Code readability,
SQL readability,
Query performance.

Typically, you have to pick two. For particularly complex queries, you may have to pick one. Note that how query performance is impacted often depends significantly on the database you are running. For example, unnecessary sub-selects can severely impact the performance of MySQL, but have negligible impact on InfiniDB. SQL readability can be tied to query performance, but it isn't always; sometimes you will need to decrease readability to optimise performance.

With that in mind, lets get back to counting our coffees. This will get us our expected query:

  Coffees.table.groupBy(_ ⇒ 0).map(_._2.length)

This seems a little strange, but all becomes clear when you think about how the SQL count function works. Count is an aggregate function - it works on groups of rows, counting the number of rows in each group. For example:

SELECT COF_NAME,
  count(1)
FROM "COFFEES"
GROUP BY COF_NAME;

This groups all the coffees by name, and counts the coffees for each name.

Given that count works on groups of rows, how does our expected query work? We haven't grouped by anything, so what is count counting?

When you use an aggregate function and don't specify how to group the rows, SQL implicitly groups all the rows into a single group and applies the aggregate function to it.

Slick's DSL can't do this implicitly, but we can do it explicitly by grouping by a constant - in this case "0". All rows will end up in the same group, and we can then map it as we would any other groupBy.

Note that Slick is smart enough to remove the group by in the SQL. This is what it generates for our improved coffee count query:

SELECT count(1)
FROM `COFFEES` s16;

This "group by constant" trick is also how you to perform multiple aggregations in the same query. For example, getting the max price as well as the count:

  Coffees.table.groupBy(_ ⇒ 0).map {
    case (_, rows) ⇒ (rows.length, rows.map(_.price).max)
  }

Generates:

SELECT count(1),
  max(s20.`PRICE`)
FROM `COFFEES` s20;

Another interesting case is explicit joins. For example, if we wanted an inner join on our Coffees and Suppliers tables, we could do this:

 for {
   (c, s) ← Coffees.table innerJoin Suppliers.table on (_.supID === _.id)
 } yield (c.name, s.name)

We are expecting SQL like this:

SELECT c.COF_NAME,
  s.SUP_NAME
FROM COFFEES c
INNER JOIN SUPPLIERS s ON c.SUP_ID = s.SUP_ID;

What we actually get is:

SELECT s3.s16,
  s4.s24
FROM (
  SELECT s27.`COF_NAME` AS s16,
    s27.`SUP_ID` AS s17
  FROM `COFFEES` s27
  ) s3
INNER JOIN (
  SELECT s29.`SUP_ID` AS s20,
    s29.`SUP_NAME` AS s24
  FROM `SUPPLIERS` s29
  ) s4 ON s3.s17 = s4.s20;

Slick is first selecting the columns it needs for each table, and then joining the results.

The two sides of a join may need to be wrapped in subselects - for example, they may have group by statements - but in a lot of cases it isn't necessary. We can get around this, for inner joins at least, by doing an implicit inner join instead:

  for {
    c ← Coffees.table
    s ← Suppliers.table if c.supID === s.id
  } yield (c.name, s.name)

Generating:

SELECT s17.`COF_NAME`,
  s19.`SUP_NAME`
FROM `COFFEES` s17,
  `SUPPLIERS` s19
WHERE s17.`SUP_ID` = s19.`SUP_ID`;

Or, even better, making use of our foreign key to construct the implicit inner join for us:

  for {
    c ← Coffees.table
    s ← c.supplier
  } yield (c.name, s.name)

The above examples are quite simple, but in real queries you can often get combinations of issues resulting in very complex SQL. In my experience, these can generally be pulled apart and each issue tackled piecemeal.

Finally, all this brings up the question of why keep using Lifted Embedding if it is causing these issues. We originally planned to fall back to Slick’s plain SQL support for complex queries, but quickly found that we didn’t want to give up the benefits Lifted Embedding provides. The main ones are:

Type Safety,
Syntax Safety,
Conditional Query Construction.

The first two basically mean that (if your Slick schema definition matches the actual DB schema) Lifted Embedding will ensure that all your types in queries, and the query syntax itself, are correct at compile time. Not only does this massively decrease the kinds of issues you need to worry about when testing, but it also allows IDEs (if you are so inclined) to give you support while writing queries.

The third point refers to being able to vary your queries at runtime (beyond simply swapping out variables). The Lifted Embedding DSL allows you to do this without resorting to string concatenation, and still type and syntax checks everything.

Imagine you have a query that retrieves a list of coffees, based on a set of optional filters. Some filters work on the columns of the Coffees table, others on the Suppliers table. Using the DSL you can optionally include where clauses, and even joins, based on what filters are selected at runtime.

For example, this query optionally filters coffees on supplier id:

  type CoffeesQuery = Query[Coffees.CoffeesT, Coffees, Seq]

  def optionallyFilterBySupplier(supplier: Option[Int])
    (coffees: CoffeesQuery): CoffeesQuery =
    supplier match {
      case Some(supplier) ⇒ coffees.filter(_.supID === supplier)
      case None           ⇒ coffees
    }

  val supplierFilter: Option[Int] = ???

  optionallyFilterBySupplier(supplierFilter)(Coffees.table)

The optionallyFilterBySupplier function takes a optional supplier id to filter by, and a query that selects coffees. If a supplier filter is not provided (i.e. None is provided) the query is returned unchanged, otherwise a where clause is added to filter on the supplier.

No supplier filter:

SELECT s17.`ID`,
  s17.`COF_NAME`,
  s17.`SUP_ID`,
  s17.`PRICE`
FROM `COFFEES` s17;

Filtered to supplier "5":

SELECT s18.`ID`,
  s18.`COF_NAME`,
  s18.`SUP_ID`,
  s18.`PRICE`
FROM `COFFEES` s18
WHERE s18.`SUP_ID` = 5;

This query optionally filters coffees on their supplier's cities, only joining on the Suppliers table if the filter is provided:

  type CoffeesQuery = Query[Coffees.CoffeesT, Coffees, Seq]

  def optionallyFilterByCity(city: Option[String])
    (coffees: CoffeesQuery): CoffeesQuery =
    city match {
      case Some(city) ⇒
        for {
          c ← coffees
          s ← c.supplier if s.city === city
        } yield c
      case None ⇒ coffees
    }

  val cityFilter: Option[String] = ???

  optionallyFilterByCity(cityFilter)(Coffees.table)

No city filter:

SELECT s17.`ID`,
  s17.`COF_NAME`,
  s17.`SUP_ID`,
  s17.`PRICE`
FROM `COFFEES` s17;

Filtered to city "Auckland":

SELECT s28.`ID`,
  s28.`COF_NAME`,
  s28.`SUP_ID`,
  s28.`PRICE`
FROM `COFFEES` s28,
  `SUPPLIERS` s30
WHERE (s30.`SUP_ID` = s28.`SUP_ID`)
  AND (s30.`CITY` = 'Auckland');

These two filters, along with any others you care to write, can be applied together, creating a pipeline that constructs a query based on any number of parameters. We use this kind of pattern extensively, and it is just the tip of the iceberg when it comes to what the DSL allows.

Slick's Lifted Embedding DSL isn't without its eccentricities, but once you are aware of them they can largely be mitigated. Identifying and optimising problematic queries can be time consuming, but has paid dividends for us both in terms of query debugging (i.e. improved SQL readability) and performance. Its compile time checking and query construction power makes it an indispensable tool.

Categories

View Contributors List >

24 Feb 2015 · 10 min read · Braedon Vickers

Six Months Later: How Scala Changed The Game For Us

Keep me
in the loop

Thank you for signing up

Categories

24 Feb 2015 · 10 min read · Braedon Vickers

Six Months Later: How Scala Changed The Game For Us

Keep mein the loop

Thank you for signing up

Keep me
in the loop