AWS Logo
Menu

Nested XML Processing - AWS Glue with DynamicFrames, Relationalize, and Databricks Spark-XML

This blog tackles efficient methods for reading complex XML structures with dynamic data types. We explore three approaches: 1. AWS Glue & DynamicFrame: Reads XML, adapts to data types (arrays, structs), but complexity increases with nested structures.2. AWS Glue, Databricks & Relationalize: Flattens nested data during reading using Databricks Spark-XML for precise control. 3. AWS Glue & Defined Schema: Declares schema upfront for straightforward reading of multi-nested data, but requires schema knowledge.

Published May 4, 2024

repository,Introduction:

Nowadays, XML files are the go-to option for storing and exchanging data for many industries, including finance, books, and healthcare. Though XML files are widely used, analysing and processing them can be difficult, particularly when dealing with highly nested schemas, various nested data structures, and dynamic changes in data types (also known as schema evolution).
This blog post will explore how we can address these challenges using AWS Glue, DynamicFrames, Relationalize, and Databricks Spark XML.

Solution Overview:

Method 1: By using AWS Glue and Dynamic Frame

To handle XML files, using AWS Glue and DynamicFrame is a common method. We may investigate the data types received from the source, whether they are ArrayType, StructType, or a combination of the two, by reading XML files with DynamicFrame. This approach, however, necessitates closely examining every tag and processing records appropriately. The code gets more complex as the files get bigger and the schema gets more complicated, especially with heavily nested schemas.

Prerequisites

1) Get the sample files from the link. GitRepo. Upload the sample files in the S3 Bucket.
Upload the sample files to the S3 bucket.
File Upload in S3
File Upload in S3
2) Launch Glue Studio and choose the Notebook option.
Glue Notebook
Glue Notebook
3) Here’s a Python code snippet using AWS Glue to read an XML file using DynamicFrame and print its schema structure:
Schema Output:
The below schema has both Array Type and Struct Type
Schema Structure
Schema Structure
After you’ve determined the schema structure, extract the values and display them in a tabular style using the following Python snippet. This simplified method will provide a smooth integration into your intended tables.
Result:
Final Output
Final Output
This method works well even when the schema structure isn’t stated explicitly. However, the coding complexity dramatically rises with the amount of tags and hierarchical schema structures.

Method 2: By using Aws Glue, Databricks library and Relationalize.

Relationalize will flatten the nested structure. While reading the XML files using the DataBricks Spark-Xml, it will provide more precise control over parsing complex XML structures. SampleFile

Prerequisite

1) Download the Databricks Spark-XML JAR File: Obtain the Databricks Spark-XML JAR file from the MVN site. MVNRepository
2) Upload the JAR File to an S3 Bucket: Upload the downloaded JAR file to an S3 bucket that your AWS Glue job can access.
S3_Databricks_jarfile
S3_Databricks_jarfile
3) Configure Your AWS Glue Notebook: In the first cell of your AWS Glue notebook, add the following code to ensure that the Databricks Spark-XML library is available to your job
You can observe how DataBricks Spark-XML reads XML more effectively than dynamicframe work in the example below.
Schema for Dynamic Frame Work
Dynamic Frame Schema Structure
Dynamic Frame Schema Structure
Schema for Databricks-Spark Xml
Databrick Schema Structure
Databrick Schema Structure
The below snippet is to read the XML file using DataBrick spark-XML
The relationalize function will flatten the nested structure, storing it as keys in a tabular format. Utilize the following snippet to achieve the flattening. The relationalize needs the S3 path to store the flattened structure
Output
Relationalize Key
Relationalize Key
Use the below snippet to read the root key and the subkeys.
Output
Root Result
Root Key Result
The next step is to join the root key and the other sub-keys. The root key will have the reference identifier in the sub-keys table
For example, the root key identifier “DescriptiveDetail.Language” and the subkey identifier “id”. In the below code, we will be joining the keys to extract the flattening data
Final Result
FinalResult
FinalResult
To include the Databricks Spark-XML JAR file in the AWS Glue job, you need to specify the S3 bucket path and provide configuration details in the job parameters as illustrated below.
”–conf”: “spark.jars.packages=com.databricks:spark-xml_2.12:0.13.0”
Glue Library
Glue Library

Method 3: Using AWS Glue,Dynamic Frame and Declaring the Schema.

In this approach, we’ll employ the dynamic frame framework to read the XML, wherein we’ll assign the nested schema to the defined schema structure. This allows us to understand the structure beforehand, facilitating the straightforward reading of multi-nested files.
Output
Schema Result
Schema Result

Summary:

This blog extensively explores the efficient methods for reading multi-nested schema structures, including dynamic data type changes, through the utilization of DynamicFrame, Relationalize, Databricks Spark-XML, and schema definition.
 

Comments