read_xml() Function: Read Data from XML Files into Pandas DataFrame

The `read_xml()` function in Pandas allows you to efficiently read data from XML files into a DataFrame. It supports parsing both simple and complex XML structures with options like XPath, attributes, and hierarchical data management.
Download sample XML file: student.xml
---

Basic Example

import pandas as pd
df = pd.read_xml('E:\\testing\\data\\student.xml', parser='etree')
print(df.head())

### Output:


   id        name   class mark  gender
0   1    John Deo   Four    75  female
1   2    Max Ruin  Three    85    male
2   3      Arnold  Three    55    male
3   4  Krish Star   Four    60  female
4   5   John Mike   Four    60  female

---

Attributes of read_xml()

The table below lists important attributes supported by `read_xml()`:

Attribute	Default Value	Description
`path_or_buffer`	None	The file path or buffer containing the XML data.
`xpath`	'.'	An XPath expression to select specific nodes from the XML.
`parser`	'lxml'	The XML parser to use. Options include `lxml` or `etree`
`attrs_only`	False	Parse only the attributes at the specified xpath.
`namespaces`	None	A dictionary of namespaces used in the XML.

Using XPath

Use XPath to extract only the `<Row>` nodes from the XML:

df = pd.read_xml('student.xml', xpath='.//Row')
print(df)

### Output:


   id       name    class  mark  gender
0   1    John Deo   Four    75  female
1   2    Max Ruin  Three    85    male
2   3      Arnold  Three    55    male
3   4  Krish Star   Four    60  female
4   5   John Mike   Four    60  female

---

Including Attributes

If the XML nodes contain attributes or specific data fields, you can extract them using lxml and convert the data into a Pandas DataFrame.


from lxml import etree
import pandas as pd

# Parse the XML file
tree = etree.parse('student.xml')

# Extract data from <Row> elements
students = []
for row in tree.xpath('.//Row'):  # Select all <Row> nodes
    students.append({
        'id': row.findtext('id'),         # Extract the text of <id>
        'class': row.findtext('class')   # Extract the text of <class>
    })

# Convert the extracted data to a DataFrame
df = pd.DataFrame(students)

print(df)

### Output:


   id   class
0   1   Four
1   2  Three
2   3  Three
3   4   Four
4   5   Four

---

Using attrs_only to Extract Only Attributes

The attrs_only parameter in read_xml() extracts only the attributes of XML elements, ignoring their text content. This is especially useful when XML nodes contain both attributes and text. Here's an example:


import pandas as pd

# Example XML with attributes and text content
xml_data = '''
<Students>
    <Student id="1" name="John Deo" class="Four">Passed</Student>
    <Student id="2" name="Max Ruin" class="Three">Failed</Student>
    <Student id="3" name="Arnold" class="Three">Passed</Student>
</Students>
'''


# Reading XML with attrs_only=True
df_attrs_only = pd.read_xml(xml_data, parser='etree', attrs_only=True)
print("With attrs_only=True:")
print(df_attrs_only)


# Reading XML without attrs_only (default behavior)
df_default = pd.read_xml(xml_data, parser='etree')
print("\nWithout attrs_only (default):")
print(df_default)

### Output: With attrs_only=True:


   id       name   class
0   1   John Deo    Four
1   2   Max Ruin   Three
2   3     Arnold   Three

Without attrs_only (default behavior):


   id       name   class   Student
0   1   John Deo    Four    Passed
1   2   Max Ruin   Three    Failed
2   3     Arnold   Three    Passed

---

Explanation:

XML Structure: Each <Student> element contains attributes (id, name, class) and text content (Passed or Failed).
With attrs_only=True:
- Only the attributes (id, name, class) are included in the DataFrame.
- The text content (Passed or Failed) is ignored.
Without attrs_only:
- Both the attributes and the text content are included.
- The text content is added to a new column named after the XML element (Student).
Use Case: Use attrs_only=True when the text content of XML nodes is not relevant, and only the attributes are needed.

---

Using namespaces to Handle Namespaced XML

The namespaces parameter in read_xml() allows you to parse XML files that use namespaces. Here's an example:


import pandas as pd

# Example XML with namespaces
xml_data = '''
<ns:Students xmlns:ns="http://example.com/ns">
    <ns:Student id="1" name="John Deo" class="Four">Passed</ns:Student>
    <ns:Student id="2" name="Max Ruin" class="Three">Failed</ns:Student>
    <ns:Student id="3" name="Arnold" class="Three">Passed</ns:Student>
</ns:Students>
'''


# Define the namespace
namespaces = {"ns": "http://example.com/ns"}

# Reading XML with the namespace
df = pd.read_xml(xml_data, xpath=".//ns:Student", namespaces=namespaces)
print(df)

### Output:


   id       name   class Student
0   1   John Deo    Four  Passed
1   2   Max Ruin   Three  Failed
2   3     Arnold   Three  Passed

Explanation:

XML with Namespace:
- The XML uses the namespace prefix ns with the URI http://example.com/ns.
- Elements like <ns:Student> and <ns:Students> are namespaced.
Defining the Namespace:
- The namespaces parameter is a dictionary where keys are prefixes (e.g., ns) and values are their respective URIs (e.g., http://example.com/ns).
Using XPath with Namespace:
- Use the xpath parameter with the namespace prefix (e.g., .//ns:Student) to target specific elements.
Output:
- The resulting DataFrame includes attributes (id, name, class) and text content (Passed or Failed).

Handling Nested XML

When the XML contains nested tags, such as scores grouped within a `` tag, `read_xml()` can parse these nested elements:

import pandas as pd
df = pd.read_xml('nested_student.xml')
print(df)

### Output:


   id         name  marks.score1  marks.score2
0   1    John Deo           75.0           NaN
1   2    Max Ruin           85.0          80.0
2   3      Arnold           55.0          90.0

---

Questions

What is the purpose of the read_xml() function in Pandas?
How can you use XPath to extract specific nodes from an XML file?
What are some common attributes of the read_xml() function?
Provide an example of reading XML data with specific attributes.
What is the purpose of the attrs_only attribute in read_xml()?
In which scenarios would you use attrs_only=True?
Provide an example where XML nodes contain both attributes and text. What changes in the DataFrame when attrs_only is used?
How does the attrs_only attribute help in simplifying XML data extraction?
What is the purpose of the namespaces parameter in read_xml()?
Provide an example of reading XML with namespaces using read_xml().
How does the xpath parameter work in conjunction with namespaces?
Why is defining namespaces important when working with namespaced XML?

Back to Pandas Input/Output to_xml() Function Learn More About XML
Data Import Export using Tkitner GUI and Pandas DataFrame

Subhendu Mohapatra

Author

🎥 Join me live on YouTube

Passionate about coding and teaching, I publish practical tutorials on PHP, Python, JavaScript, SQL, and web development. My goal is to make learning simple, engaging, and project‑oriented with real examples and source code.