Read Schema of parquet file #276

ChrisMuki · 2022-09-30T09:37:34Z

First i want to thank you for this great library!

I need to merge hundreds of small parquet files into bigger ones. Sadly they are not all the same schema (e.g. missing columns), nor is the schema known at compile time.

I am just wondering what would be the most eficient way to get only the schema of a parquet file.
Currently i am looking into the first RowParquetRecord but as there might be NullValues....

Further, i am interested if there is a complete list of how to map scala types properly to fields, like this
Types.primitive(INT32, OPTIONAL).as(LogicalTypeAnnotation.dateType()).named(Birthday)

Thanks

The text was updated successfully, but these errors were encountered:

mjakubowski84 · 2022-10-02T11:45:23Z

Hi Chris!

Parquet4s doesn't expose file schema in its own API (it is a thing that could be added). However, you can easily access it by calling the original Java API that Parquet4s is using under the hood. Check org.apache.parquet.hadoop.ParquetFileReader, e.g.:

val reader = ParquetFileReader.open(inputFile, readerOptions)
try {
  val schema: MessageType = reader.getFileMetaData.getSchema
  ...
} finally reader.close()

mjakubowski84 · 2022-10-02T12:07:15Z

Regarding

a complete list of how to map scala types properly to fields

check the content of TypedSchemaDef

I mean... use this type class implicitly or explicitly to obtain type mapping. Check also a quite rich API of RowParquetRecord

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read Schema of parquet file #276

Read Schema of parquet file #276

ChrisMuki commented Sep 30, 2022 •

edited

Loading

mjakubowski84 commented Oct 2, 2022

mjakubowski84 commented Oct 2, 2022 •

edited

Loading

Read Schema of parquet file #276

Read Schema of parquet file #276

Comments

ChrisMuki commented Sep 30, 2022 • edited Loading

mjakubowski84 commented Oct 2, 2022

mjakubowski84 commented Oct 2, 2022 • edited Loading

ChrisMuki commented Sep 30, 2022 •

edited

Loading

mjakubowski84 commented Oct 2, 2022 •

edited

Loading