Handling Large XML Files in Python

Working with large XML files can be challenging due to memory limitations. In Python, the iterparse method from the ElementTree module enables incremental parsing of XML files, making it efficient for large datasets. This guide explains how to use iterparse to process large XML files in a memory-efficient way.

1. Sample Large XML File

Here’s an example of a large XML file containing multiple student records:

<Root>
    <Row>
        <id>1</id>
        <name>John Deo</name>
        <class>Four</class>
        <mark>75</mark>
        <gender>female</gender>
    </Row>
    <Row>
        <id>2</id>
        <name>Max Ruin</name>
        <class>Three</class>
        <mark>85</mark>
        <gender>male</gender>
    </Row>
    ...
</Root>

2. Using iterparse to Process Large XML Files

The iterparse method allows incremental processing of XML files by parsing them event by event, avoiding memory overload.

Code Example

import xml.etree.ElementTree as ET

# Path to the large XML file
large_xml_file = "large_file.xml"

# Initialize an empty list to store processed data
data = []

# Use iterparse to parse the XML file incrementally
for event, elem in ET.iterparse(large_xml_file, events=("start", "end")):
    # Process the start of an element
    if event == "start" and elem.tag == "Row":
        student_data = {}

    # Process the end of an element
    if event == "end" and elem.tag == "Row":
        data.append(student_data)
        elem.clear()  # Free memory

    # Add child elements of "Row" to the dictionary
    elif event == "end" and elem.tag in ("id", "name", "class", "mark", "gender"):
        student_data[elem.tag] = elem.text

# Print a summary of the processed data
print(f"Total students processed: {len(data)}")
print("Sample data:", data[:5])

Explanation:

ET.iterparse(): Parses the XML file incrementally using start and end events.
elem.clear(): Frees memory by clearing processed elements.
student_data: A dictionary to store each row's data.

3. Advantages of iterparse for Large Files

Processes files incrementally, avoiding memory overload.
Efficient for datasets that cannot fit into memory.
Allows parsing and processing in chunks.

4. Creating a Large XML File from a Small File

In some cases, you may need to generate a large XML file for testing or other purposes. This example demonstrates how to use Python to extend a small XML file by duplicating and modifying its records to create a large file with approximately 10,000 rows.

Code Example

import xml.etree.ElementTree as ET
import random

# Path to the existing small XML file
input_file = "small_file.xml"

# Path to save the extended large XML file
output_file = "large_file.xml"

# Load the existing XML file
tree = ET.parse(input_file)
root = tree.getroot()

# Count the current number of records
current_count = len(root.findall("Row"))

# Calculate the number of new records needed
required_count = 10000
additional_records = required_count - current_count

# Function to generate new Row data
def generate_new_row(row_id, base_row):
    new_row = ET.Element("Row")
    for child in base_row:
        new_child = ET.SubElement(new_row, child.tag)
        if child.tag == "id":
            new_child.text = str(row_id)
        elif child.tag == "name":
            new_child.text = child.text + f" {row_id}"
        elif child.tag == "class":
            new_child.text = random.choice(["One", "Two", "Three", "Four", "Five"])
        elif child.tag == "mark":
            new_child.text = str(random.randint(50, 100))
        elif child.tag == "gender":
            new_child.text = random.choice(["male", "female"])
    return new_row

# Get the first Row as a base for generating new records
base_row = root.find("Row")

# Generate and append new records
for i in range(current_count + 1, required_count + 1):
    new_row = generate_new_row(i, base_row)
    root.append(new_row)

# Save the extended XML file
tree.write(output_file, encoding="utf-8", xml_declaration=True)

print(f"XML file '{output_file}' created successfully with {required_count} records!")

Explanation:

Base Row: The first <Row> is used as a template for generating new rows.
Unique IDs: Each new row is assigned a unique <id>.
Randomized Data: The <name>, <class>, <mark>, and <gender> fields are randomized to make the data appear unique.
Appending Rows: New rows are added to the root element (<Root>), and the XML file is saved with the updated structure.

Sample Output:

For a small input file:

<Root>
    <Row>
        <id>1</id>
        <name>John Deo</name>
        <class>Four</class>
        <mark>75</mark>
        <gender>female</gender>
    </Row>
</Root>

The output will be:

<Root>
    <Row>
        <id>1</id>
        <name>John Deo</name>
        <class>Four</class>
        <mark>75</mark>
        <gender>female</gender>
    </Row>
    <Row>
        <id>2</id>
        <name>John Deo 2</name>
        <class>Five</class>
        <mark>88</mark>
        <gender>male</gender>
    </Row>
    ...
</Root>

Conclusion

Using iterparse from ElementTree, Python can handle large XML files efficiently by processing them incrementally. This technique is ideal for working with datasets that exceed memory limits.

XML in Python Python Tutorials

Subhendu Mohapatra

Author

🎥 Join me live on YouTube

Passionate about coding and teaching, I publish practical tutorials on PHP, Python, JavaScript, SQL, and web development. My goal is to make learning simple, engaging, and project‑oriented with real examples and source code.

Subscribe to our YouTube Channel here