Detect ISO timestamps Use large_string for text Mark fields nullable

JSON to Parquet Guide

Understanding JSON to Parquet Conversion

Apache Parquet is a columnar storage format widely used in data engineering. Unlike JSON, which is row-oriented and schema-less, Parquet stores data column-by-column with an explicit schema and efficient compression. Converting JSON to a Parquet schema is the first step before writing Parquet files in your data pipeline.

This tool generates two outputs: a Parquet message schema (similar to protobuf format) and ready-to-run Python code using the pyarrow library.

JSON to Parquet Type Mapping

JSON Type         → Parquet / pyarrow Type
─────────────────────────────────────────
string            → pa.string() or pa.large_string()
integer number    → pa.int64()
float number      → pa.float64()
boolean           → pa.bool_()
null              → pa.null_() (or nullable field)
ISO date string   → pa.timestamp('us')    (auto-detected)
array             → pa.list_(inner_type)
nested object     → pa.struct([...fields])

Using the Generated Python Code

Paste your JSON object or array into the input panel.
Configure options: enable timestamp detection for ISO date strings, nullable for optional fields.
Click "Generate Parquet Schema" to get both the schema definition and Python code.
Copy the Python code and run it with pip install pyarrow installed in your environment.
The generated code reads data.json and writes to output.parquet — edit the filenames as needed.

Real-World Use Cases

Loading JSON into Snowflake, BigQuery, or Redshift

Data warehouses like Snowflake, BigQuery, and Amazon Redshift natively support Parquet as a load format. Converting your JSON data to Parquet before loading dramatically reduces load time and storage costs because columnar formats compress far better than row-oriented JSON. Use this tool to generate the schema, then adapt the pyarrow code to read your JSON source and write the Parquet file.

# After generating your output.parquet:
# BigQuery: bq load --source_format=PARQUET dataset.table output.parquet
# Snowflake: PUT file://output.parquet @my_stage; COPY INTO my_table...

Data Lake Ingestion with Apache Spark or Hive

Parquet is the de facto standard format for data lakes built on HDFS, S3, or Azure Data Lake Storage. When ingesting JSON data from APIs or log files into a data lake, converting to Parquet using the generated schema ensures the data is strongly typed and partitionable. The generated pa.schema() can also be used with PySpark's StructType as a reference.

Schema Documentation and API Contract Validation

Data engineers often need to document the schema of data flowing through pipelines. The Parquet message schema output provides a human-readable, versioned schema definition that can be committed to a repository alongside your pipeline code, making schema changes visible in code reviews.

Frequently Asked Questions

Q: What is Parquet and why is it better than JSON for data pipelines?

A: Apache Parquet is a columnar storage format. Unlike JSON (row-oriented), Parquet stores all values for a given column together, enabling much better compression (10x or more over JSON) and faster analytical queries that only read the needed columns. This is why data warehouses and analytics engines prefer it.

Q: Why does the tool use the first object in an array to infer the schema?

A: Parquet requires a fixed schema declared up front. The tool samples the first record to infer field types. If your dataset has missing fields in some records, enable the "Mark fields nullable" option to generate pa.field(..., nullable=True) for all fields, which tolerates missing values.

Q: What does "Detect ISO timestamps" do?

A: When enabled, strings matching the ISO 8601 pattern (e.g., "2024-01-15T10:30:00Z") are mapped to pa.timestamp('us') instead of pa.string(). This gives you a proper datetime column in Parquet, enabling time-based filtering and range queries in your data warehouse.

Q: What is "large_string" vs "string" in pyarrow?

A: pa.string() uses 32-bit offsets (supports strings up to ~2 GB total per column chunk). pa.large_string() uses 64-bit offsets and is needed for very large text columns. For most JSON data, the default pa.string() is sufficient.

Q: Can I use this schema with Apache Spark instead of pyarrow?

A: Yes. The Parquet message schema is framework-agnostic. For PySpark, translate the type mapping manually to StructType and StructField. The type correspondences are straightforward: pa.int64() → LongType(), pa.float64() → DoubleType(), pa.string() → StringType(), etc.

JSON to Parquet Schema