07-09, 10:45–11:15 (US/Pacific), Room 318
Climate models generate a lot of data - and this can make it hard for researchers to efficiently access and use the data they need. The solutions of yesteryear include standardised file structures, sqlite databases, and just knowing where to look. All of these work - to varying degrees - but can leave new users scratching their heads. In this talk, I'll outline how ACCESS-NRI built tooling around Intake and Intake-ESM to make it easy for climate researchers to access available data, share their own, and avoid writing the custom scripts over and over to work with the data their experiments generate.
Looking at a folder on a command line for the first time, scratching your head, and writing a bunch of for loops, regexes or globs to interrogate some data is a common rite of passage for new PhD students working with climate data. Unfortunately, this rite of passage can be slow, duplicates effort, and can produce suboptimal results.
Enter Intake, and the Intake-ESM plugin – Python packages which allow us to systematically index and catalog the outputs of Earth System Models (ESM’s), as well as other similarly structured datasets. Using these tools, we’re able to efficiently generate a single catalog, allowing researchers to seamlessly access and analyse petabytes of data.
In this talk, I’ll outline:
- The difficulties of writing your own code to collate and analyse climate data output.
- How the intake ecosystem can abstract away this issue, freeing up scientists to do science.
- What’s necessary to make the tools your users want and need – and how to make your solutions easy to adopt.
Expect to learn:
- How we leverage Intake and Intake-ESM to make Australia’s climate data tractable to researchers.
- The challenges and pitfalls of maintaining a catalog comprising over 1000 datasets from different sources.
- How the plugin architecture of intake makes it possible for us to hide the necessary complexity, providing a simple and consistent interface to scientific users, who want to find climate datasets, not become a data engineer.
- How we expose these tools to the Australian (and global) climate community, making data access easier, workflows faster, and results more reproducible.
- How we train users in these tools, collect feedback, and produce the features they need and want.
This talk is aimed at:
- The international climate science community.
- People looking to index and share their large datasets.
- People interested in reproducibility for scientific workflows with large and complex datasets
Source Code:
- https://github.com/ACCESS-NRI/access-nri-intake-catalog
- https://github.com/ACCESS-NRI/intake-dataframe-catalog
Charles is a Research Software Engineer at ACCESS-NRI, where he works in the Model Evaluation and Diagnostics team, helping make it easier to access and analyse climate data. He previously worked in Air Quality, where he produced tools to analyse air pollution data, and has a PhD in Oceanography.
When not in front of a computer, he enjoys routinely injuring himself in a variety of sports.