Pandas read_xml(): Import XML Data into a DataFrame

XML file to DataFrame by read_xml()
The `read_xml()` function in Pandas allows you to efficiently read data from XML files into a DataFrame. It supports parsing both simple and complex XML structures with options like XPath, attributes, and hierarchical data management.
Download sample XML file: student.xml
---

Basic Example

import pandas as pd
df = pd.read_xml('E:\\testing\\data\\student.xml', parser='etree')
print(df.head())
### Output:

   id        name   class mark  gender
0   1    John Deo   Four    75  female
1   2    Max Ruin  Three    85    male
2   3      Arnold  Three    55    male
3   4  Krish Star   Four    60  female
4   5   John Mike   Four    60  female
---

Attributes of read_xml()

The table below lists important attributes supported by `read_xml()`:
Attribute Default Value Description
path_or_buffer None The file path or buffer containing the XML data.
xpath '.' An XPath expression to select specific nodes from the XML.
parser 'lxml' The XML parser to use. Options include lxml or etree
attrs_only False Parse only the attributes at the specified xpath.
namespaces None A dictionary of namespaces used in the XML.

Using XPath

Use XPath to extract only the `<Row>` nodes from the XML:
df = pd.read_xml('student.xml', xpath='.//Row')
print(df)
### Output:

   id       name    class  mark  gender
0   1    John Deo   Four    75  female
1   2    Max Ruin  Three    85    male
2   3      Arnold  Three    55    male
3   4  Krish Star   Four    60  female
4   5   John Mike   Four    60  female
---

Including Attributes

If the XML nodes contain attributes or specific data fields, you can extract them using lxml and convert the data into a Pandas DataFrame.

from lxml import etree
import pandas as pd

# Parse the XML file
tree = etree.parse('student.xml')

# Extract data from <Row> elements
students = []
for row in tree.xpath('.//Row'):  # Select all <Row> nodes
    students.append({
        'id': row.findtext('id'),         # Extract the text of <id>
        'class': row.findtext('class')   # Extract the text of <class>
    })

# Convert the extracted data to a DataFrame
df = pd.DataFrame(students)

print(df)
### Output:

   id   class
0   1   Four
1   2  Three
2   3  Three
3   4   Four
4   5   Four
---

Using attrs_only to Extract Only Attributes

The attrs_only parameter in read_xml() extracts only the attributes of XML elements, ignoring their text content. This is especially useful when XML nodes contain both attributes and text. Here's an example:

import pandas as pd

# Example XML with attributes and text content
xml_data = '''
<Students>
    <Student id="1" name="John Deo" class="Four">Passed</Student>
    <Student id="2" name="Max Ruin" class="Three">Failed</Student>
    <Student id="3" name="Arnold" class="Three">Passed</Student>
</Students>
'''


# Reading XML with attrs_only=True
df_attrs_only = pd.read_xml(xml_data, parser='etree', attrs_only=True)
print("With attrs_only=True:")
print(df_attrs_only)


# Reading XML without attrs_only (default behavior)
df_default = pd.read_xml(xml_data, parser='etree')
print("\nWithout attrs_only (default):")
print(df_default)

### Output: With attrs_only=True:

   id       name   class
0   1   John Deo    Four
1   2   Max Ruin   Three
2   3     Arnold   Three
Without attrs_only (default behavior):

   id       name   class   Student
0   1   John Deo    Four    Passed
1   2   Max Ruin   Three    Failed
2   3     Arnold   Three    Passed
---

Explanation:

  • XML Structure: Each <Student> element contains attributes (id, name, class) and text content (Passed or Failed).
  • With attrs_only=True:
    • Only the attributes (id, name, class) are included in the DataFrame.
    • The text content (Passed or Failed) is ignored.
  • Without attrs_only:
    • Both the attributes and the text content are included.
    • The text content is added to a new column named after the XML element (Student).
  • Use Case: Use attrs_only=True when the text content of XML nodes is not relevant, and only the attributes are needed.
---

Using namespaces to Handle Namespaced XML

The namespaces parameter in read_xml() allows you to parse XML files that use namespaces. Here's an example:

import pandas as pd

# Example XML with namespaces
xml_data = '''
<ns:Students xmlns:ns="http://example.com/ns">
    <ns:Student id="1" name="John Deo" class="Four">Passed</ns:Student>
    <ns:Student id="2" name="Max Ruin" class="Three">Failed</ns:Student>
    <ns:Student id="3" name="Arnold" class="Three">Passed</ns:Student>
</ns:Students>
'''


# Define the namespace
namespaces = {"ns": "http://example.com/ns"}

# Reading XML with the namespace
df = pd.read_xml(xml_data, xpath=".//ns:Student", namespaces=namespaces)
print(df)

### Output:

   id       name   class Student
0   1   John Deo    Four  Passed
1   2   Max Ruin   Three  Failed
2   3     Arnold   Three  Passed

Explanation:

  • XML with Namespace:
    • The XML uses the namespace prefix ns with the URI http://example.com/ns.
    • Elements like <ns:Student> and <ns:Students> are namespaced.
  • Defining the Namespace:
    • The namespaces parameter is a dictionary where keys are prefixes (e.g., ns) and values are their respective URIs (e.g., http://example.com/ns).
  • Using XPath with Namespace:
    • Use the xpath parameter with the namespace prefix (e.g., .//ns:Student) to target specific elements.
  • Output:
    • The resulting DataFrame includes attributes (id, name, class) and text content (Passed or Failed).

Handling Nested XML

When the XML contains nested tags, such as scores grouped within a `` tag, `read_xml()` can parse these nested elements:
import pandas as pd
df = pd.read_xml('nested_student.xml')
print(df)
### Output:

   id         name  marks.score1  marks.score2
0   1    John Deo           75.0           NaN
1   2    Max Ruin           85.0          80.0
2   3      Arnold           55.0          90.0
---

Questions

Back to Pandas Input/Output to_xml() Function Learn More About XML
Data Import Export using Tkitner GUI and Pandas DataFrame
Subscribe to our YouTube Channel here


Subscribe

* indicates required
Subscribe to plus2net

    plus2net.com







    Python Video Tutorials
    Python SQLite Video Tutorials
    Python MySQL Video Tutorials
    Python Tkinter Video Tutorials
    We use cookies to improve your browsing experience. . Learn more
    HTML MySQL PHP JavaScript ASP Photoshop Articles FORUM . Contact us
    ©2000-2024 plus2net.com All rights reserved worldwide Privacy Policy Disclaimer