Do you recall a situation when, at the beginning of the project, your team, after many days of intense discussions, finally established coding standards that everyone agreed to follow? You thoroughly discussed rules regarding capitalization, spacing, indentation, line lengths, etc. You nearly came to blows over whether the comma should be placed at the beginning or end of the line in a SELECT query. And then, after a week or two… no one cared.
I bet so. We all do.
But does this mean we shouldn’t establish these standards? That we should give up and accept that code will always be messy? Not at all! We just need to find a way to enforce them. And no, I don’t mean using the threat of firing people for not following those rules, at least… not right away.
git hooks
Githooks, natively built into Git, are at the heart of the solution presented in this article. These hooks are nothing more than scripts located in each repository, by default in the .git/hooks folder, that are invoked each time a particular event occurs. They are disabled by default, and enabling them simply involves changing the filename by removing the “.sample” prefix; no additional tools or plugins are required.
After activation, a hook will start to react to the events mentioned earlier and trigger the script placed inside the file.
Ok, but what events are we exactly talking about? Let’s look at a few examples.
- pre-commit: Triggered before a commit is created. Often used for code style checks, linting, or running tests to ensure the commit adheres to standards.
- post-commit: Triggered immediately after a commit is completed. Can be used for tasks like triggering notifications.
- pre-push: Triggered before a push to a remote repository. Useful for running additional checks or tests before changes are pushed.
Let’s now attempt to activate the pre-commit hook and run a sample ‘Hello World’ script. The following Bash script displays the text “Hello, world!” sequentially in green, red, and yellow colors.
#!/bin/sh #!/bin/bash # Define colors GREEN='\033[0;32m' RED='\033[0;31m' YELLOW='\033[1;33m' NC='\033[0m' # No Color # Print "Hello, world!" in green echo -e "${GREEN}Hello, world!${NC}" # Print "Hello, world!" in red echo -e "${RED}Hello, world!${NC}" # Print "Hello, world!" in yellow echo -e "${YELLOW}Hello, world!${NC}" # Exit with success status exit 0
The effect looks as follows:
It seems like we have the solution handed to us on a silver platter. All we have to do now is write a few thousand lines of code that will consistently check whether the proposed changes in the commit adhere to the rules we’ve established.
Oh no, wait, it’s not that simple after all.
pre-commit framework
Fortunately, our team isn’t the first in the world to decide to enforce code standardization in a solution, and the vast majority of solutions needed to accomplish this task is readily available. Typically under an open-source license.
The tool that is definitely worth knowing in the context of our problem is pre-commit framework (link: pre-commit) – “a multi-language package manager for pre-commit hooks”. This framework redefines the way we work with pre-commit hooks. By using it, instead of creating scripts manually for each hook, we define a configuration file specifying the locations of ready-to-use “projects”.
repos: - repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.6.0 hooks: - id: trailing-whitespace - id: check-added-large-files - repo: https://github.com/astral-sh/ruff-pre-commit rev: v0.3.5 hooks: - id: ruff - repo: https://github.com/numpy/numpydoc rev: v1.7.0 hooks: - id: numpydoc-validation files: src
Meanwhile, in the .git/hooks/pre-commit file, we place just a simple script that activates the framework and invokes projects specified in the configuration file. To be precise, we execute the command pre-commit install
, which creates the script for us.
#!/bin/sh #!/usr/bin/env bash # File generated by pre-commit: https://pre-commit.com # ID: 138fd403232d2ddd5efb44317e38bf03 # start templated INSTALL_PYTHON='C:\Users\tomek\AppData\Local\Programs\Python\Python311\python.exe' ARGS=(hook-impl --config=.pre-commit-config.yaml --hook-type=pre-commit) # end templated HERE="$(cd "$(dirname "$0")" && pwd)" ARGS+=(--hook-dir "$HERE" -- "$@") if [ -x "$INSTALL_PYTHON" ]; then exec "$INSTALL_PYTHON" -mpre_commit "${ARGS[@]}" elif command -v pre-commit > /dev/null; then exec pre-commit "${ARGS[@]}" else echo '`pre-commit` not found. Did you forget to activate your virtualenv?' 1>&2 exit 1 fi
From this moment on, every time we invoke the git commit
command, all hooks specified in the configuration file will be called and will validate the changes we want to commit to the branch.
This framework enables us to easily and quickly reuse many ready-made and proven solutions such as code validation, linting, or automatic documentation generation. The list of supported hooks is available at https://pre-commit.com/hooks.html.
examples
Allow me to now introduce my favorite hooks, which speed up my work in data engineering projects on a daily basis.
black (link: black) & ruff (link: ruff-pre-commit)
Two of the most popular tools for Python linting and code formatting, that help us ensure consistent code across the project. A significant advantage is that in many cases, they can automatically fix our code, requiring only our approval for these changes.
numpydoc (link: numpydoc)
This hook ensures that docstrings in committed files conform to the numpydoc standards. It allows us to specify a list of rules regarding docstrings that we want to enforce.
sqlfluff (link: sqlfluff)
Highly configurable SQL linter that is adaptable to various dialects (T-SQL, Databricks, HIVE, …) and customizable in its settings.
summary
The githooks mechanism along with the pre-commit framework are tools that are very easy to install and configure; they don’t require the user to have any knowledge of scripting languages. Their usage in a project provides powerful support in daily efforts to ensure the quality of the code being developed.
- Terraforming Databricks #3: Lakehouse Federation - October 15, 2024
- Terraforming Databricks #2: Catalogs & Schemas - September 16, 2024
- Terraforming Databricks #1: Unity Catalog Metastore - September 4, 2024
Last comments