07-10, 16:05–16:35 (US/Pacific), Ballroom
With datasets growing in both complexity and volume, the demand for more efficient data processing has never been higher. Pandas and NetworkX, the go-to Python libraries for tabular and graph data processing, are very popular for their ease of use and flexibility. However, they often struggle to keep pace with the demands of large-scale data analysis.
This talk introduces new open-source GPU accelerators from the NVIDIA RAPIDS project for Pandas and NetworkX, and will demonstrate how you can enable them for your workflows to experience massive speedups – up to 150x in pandas and 600x in NetworkX – without code changes.
Many data science workflows include Pandas and NetworkX for good reasons; they are well-understood and documented, easy to use, flexible, and have a large community of users ready to assist. However, as datasets continue to grow, these libraries can sometime struggle to perform. In the case of NetworkX, a large but realistic dataset can cause algorithms such as betweenness centrality to run for hours [1].
Solutions to these problems exist, but often require the user to incorporate additional libraries with different interfaces that can require time to not only learn but integrate into the workflow. GPUs are excellent options for users that need significant performance increases, but these too can require code changes that, if not carefully done, can break workflows that must continue to run in CPU-only environments.
In this talk, we will demonstrate how to use two of the newest components in the NVIDIA RAPIDS suite – cudf.pandas and nx-cugraph - to GPU accelerate data science workflows that use Pandas and/or NetworkX, without requiring any code changes. Users can now run their workflows on CPU-only systems as they always have, but then realize massive performance increases when running on a system where a GPU is available.
cudf.pandas and nx-cugraph, like every other project in the RAPIDS suite, are open-source software licensed under the Apache 2.0 license. cudf.pandas and nx-cugraph are easily installed using pip, conda, or from source obtained at https://github.com/rapidsai/cudf for cudf.pandas and https://github.com/rapidsai/cugraph for nx-cugraph.
This talk will show example workflows that involve loading, cleaning, and analyzing a large graph dataset using Pandas and NetworkX. We will run the same workflow on both a CPU-only system and a system with GPUs and compare the performance of each.
We will explain the different approaches used to implement cudf.pandas and nx-cugraph, as well as the various features, caveats, and best practices to use them effectively.
This talk is intended for data scientists who want to take advantage of GPUs to speed up their data science workflows, but do not want to rewrite their existing code, learn new libraries, or sacrifice the ability to run on non-GPU systems. By using cudf.pandas and nx-cugraph, you can easily switch between CPU and GPU systems and enjoy the benefits of both.