ArcGIS Blog

Announcements

ArcGIS Pro

Enhancing support for Parquet files

By Nana Dei and Diana Muresan

ArcGIS Pro 3.5 includes new and updated capabilities for geospatial visualization and analysis. One feature of this release is the enhanced support for Apache Parquet files. Parquet is a cloud-friendly columnar storage format optimized for efficient data access and analytics. It is highly compressible and self-describing, with embedded metadata and column-level statistics allowing for selective reading and quicker analysis. Widely adopted in cloud analytics workflows, Parquet is used by frameworks like Apache Spark to enable parallel processing in distributed environments.

Given Parquet’s increasing use across industries, ArcGIS Pro 3.5 supports it natively with folder connections and cloud storage connections, allowing you to visualize, query, and analyze Parquet data directly within ArcGIS.

You can work with Parquet files using two methods:

  • Create a multifile feature connection (MFC) to analyze or visualize one or more Parquet files with the same schema. This method allows you to treat multiple Parquet files as a single dataset, which is particularly useful for analyzing large volumes of data. This option has been supported since the ArcGIS Pro 3.3 release as part of the GeoAnalytics Desktop toolbox.
Create Multifile Feature Connection geoprocessing tool
Create Multifile Feature Connection geoprocessing tool
  • Create a folder connection to the location where a Parquet file is stored. This can be from a local folder, network share, or an Amazon Simple Storage Service (S3) bucket. Then, add the file to the map to visualize, query, and analyze features. This is the new method added to ArcGIS Pro 3.5. The rest of this blog article will focus on this option.
Add a Parquet file to map
Add a Parquet file to map

The data is added as a new feature layer to the map. Behind the scenes, ArcGIS Pro creates a local cache of the data.

Parquet file as a feature layer in the map
Parquet file as a feature layer in the map

Capabilities with Parquet files

With the new drag-and-drop functionality, you can do the following:

  • Visualize data—Quickly add a Parquet file to a map or scene for immediate visualization with support for changing the symbology. For point datasets with more than 10,000 rows, the layer will draw as bins, allowing you to see patterns and trends in your data.
Map with geosquare bins
Map with geosquare bins

You can then use the Binning context menu on the ribbon to change some of its properties, such as the scale threshold or summary statistics, or you can completely disable binning.

The Binning option on the ribbon
The Binning option on the ribbon
  • Query data—Perform spatial queries on the data contained in a Parquet file.
Query Parquet data
Query Parquet data
  • Analyze data—Use the map layer that is pointing to the Parquet file as an input in the geoprocessing tool or ArcPy script.
Perform geospatial analysis
Perform geospatial analysis

For more information about support of Parquet files, check out the FAQs page.

How caching works

When you access data from a Parquet file in ArcGIS Pro, the software creates a local cache on disk to enhance performance.

 

Create a cache file
Create a cache file

This caching occurs when you:

  • add the data to a map or scene
  • open the Fields view
  • access the Properties dialog box from the Parquet file in the Catalog pane.

The amount of time it takes to create the cache and the size of the local cache depends on the amount of data stored within the Parquet file. These local caches are created per user and per machine, ensuring that each user has a dedicated cache that improves the efficiency of data queries and map navigation.

If the source Parquet file is modified, ArcGIS Pro automatically re-creates the cache to reflect the changes. Smaller caches (1 GB or less) are automatically deleted if not accessed within 30 days, while larger caches are retained due to the time required to create them.

Optimize Parquet file access

You can use the CreateParquetCache ArcPy function in Python to pre-generate the cache for a large Parquet file, which saves time by avoiding on-the-fly cache generation when adding data to a map in ArcGIS Pro. When a cache is generated for a Parquet file, subsequent additions of that file to a map by the same user will automatically tap into the cache, reducing loading time and enhancing data accessibility. To optimize Parquet file access, complete the following steps:

1.Store the Parquet file in a designated folder for data storage and processing.

2. To create Parquet cache, use the CreateParquetCache function in a script to automate the process of reading and creating caches for  a Parquet file. This script can be run outside of ArcGIS Pro at any time, ensuring the cache is ready before accessing the Parquet file in ArcGIS Pro.

Optimize Parquet file expression
Optimize Parquet file expression

3.  Once the local cache is created, open ArcGIS Pro, locate the Parquet file, and add it to the map.

ArcGIS Pro automatically accesses the pre-created cache, eliminating loading time.

The road ahead

Currently, ArcGIS Pro does not support publishing web layers directly from Parquet file data to ArcGIS Online or ArcGIS Enterprise. Additionally, cached Parquet file data cannot be included in packages, such as map packages or project packages. However, these functionalities are potential enhancements that may be considered for future software releases, enabling more seamless sharing and collaboration of geospatial data.

Share your feedback and suggestions on the Esri Community page. We hope you find this enhancement valuable and look forward to seeing how you utilize Parquet files in ArcGIS Pro.

Check out the What’s new in ArcGIS Pro 3.5 documentation for more details on new functionality in this release!

Share this article

OSZAR »