SciPy 2023

DataJoint: Bringing databases back into data science
07-14, 13:55–14:25 (America/Chicago), Zlotnik Ballroom

Relational databases manage structured data and facilitate queries in collaborative repositories, but using SQL from a scientific programming language is awkward. DataJoint is an open-source framework for managing scientific data supporting data definition, diagramming, and queries. DataJoint makes computation a native part of its data model, bridging the gap between databases and numerical analysis in automated workflows. We will showcase the elegance of the relational data model and its versatility through neuroscience research examples. We will also introduce the DataJoint SciViz library, enabling scientists to build web apps for data visualization and unlocking further potential for data-driven discovery.


Research teams work on complex scientific data with many contributors. They execute quickly evolving and complex computational pipelines around such data. This requires a systematic approach to structuring data with clarity and transparency, linking it with distributed computation. Relational databases solve many of these problems; they support data integrity and facilitate queries in large, collaborative repositories. However, working with relational databases through SQL from Python can be awkward. As a result, many data scientists have dismissed relational databases and missed out on their great capabilities. Enter DataJoint, an open-source framework designed explicitly for managing scientific data.

DataJoint uses a relational database system as its backend but utilizes Python programming constructs to define and query the database, similar to object-relational mappers commonly used in web development. It is specifically designed from the ground up for supporting complex data and distributed computations, making it an ideal tool for data scientists.

One of the most significant advantages of DataJoint is that it allows you to design complex databases directly from a Jupyter notebook. It provides its own sublanguage for defining database schemas to capture relationships between data elements, including beautiful diagrams for convenient navigation. DataJoint also provides a convenient query language that reduces the complexity of SQL select statements into an algebra of five operators. Data operations are well integrated with other data science tools such as numpy and pandas.

Most importantly, DataJoint makes computations a first-class citizen in its data model. Computational dependencies are encoded as part of the database design, so the database schema serves to specify the computational data pipeline and workflow.

DataJoint has been in continuous development and use for about 14 years and is currently used in approximately a hundred research labs. A rich collection of standardized workflows, DataJoint Elements, has been in development by the research community.

In this talk, we will introduce the basic principles of scientific databases, including how to create a database, how to visualize its structure, how to enter and delete data, and how to define and execute computational dependencies. We will also showcase examples from past and current neuroscience projects. For large-scale computations, DataJoint can be combined with job orchestration tools for scalable computing.

Furthermore, we will introduce the new DataJoint SciViz library that provides a low-code approach for creating websites for data visualization to show off your work. DataJoint has become a part of the data science tool stack for working with scientific databases, providing the full rigor of relational databases for maintaining data integrity and consistency, especially in dynamic collaborative projects.

Finally, we will share some glimpses of our future developments and invite diverse teams to contribute and collaborate, making DataJoint an even more powerful tool for managing scientific data. With DataJoint, scientists can bring relational databases into the modern era of data science and streamline their data management and computational workflows.

Dimitri Yatsenko has a PhD in Neuroscience (Baylor College of Medicine) and Masters in Computational Engineering and Science (University of Utah). As CEO at DataJoint, he leads a team of scientists and engineers to develop tools for analyzing and managing neuroscience data for advanced collaborative projects. He serves as Principal Investigator on NIH grants to develop open-source software and a cloud platform supporting standardized data pipelines for common types of neuroscience experiments.