Python print xml tree structure

Hi,

I have a large multi-level XML document of a complicated structure, without any namespace definition.
I would like to generate a simplified tree view of its structure, so that every possible element from the XML is shown and only once.

As a simplified example take this XML:

<data>
	<timestamp>...</timestamp>
	<people>
		<person>
			<name>...</name>
			<age>...</age>
		</person>
		<person>
			<name>...</name>
			<age>...</age>
			<degree />
		</person>
		<person>
			<name>...</name>
			<age>...</age>
			<degree />
			<siblings>
				<brother>...</brother>
				<brother>...</brother>
				<sister>...</sister>
			</siblings>			
		</person>
	</people>
	<cities>
		<city>
			<name>...</name>
			<country>...</country>
			<continent>...</continent>
			<capital />
		</city>
		<city>
			<name>...</name>
			<country>...</country>
			<continent>...</continent>
		</city>
	</cities>
</data>

Using Python I would like to generate a view of its structure, looking something like this:

-data-
	-timestamp-
	-people-
		-person-
			-name-
			-age-
			-degree-
			-siblings-
				-brother-
				-sister-
	-cities-
		-city-
			-name-
			-country-
			-continent-
			-capital-

So, basically I am not interested in the values, or how many elements of the same type are in the XML, etc.
I only want to see which elements are in there.

I know there might be visual tools to achieve this, but I need to be able to generate such tree view also directly inside python script.

Thanks for any ideas.

Posts: 11,306

Threads: 429

Joined: Sep 2016

Reputation: 442

lxml has etree (prettyprint option) see: http://lxml.de/api.html

Posts: 13

Threads: 6

Joined: Oct 2017

Reputation: 0

prettyprint does not help me as it shows everything as is in the XML. That's exactly what I want to avoid. I need no duplicates and no values or attributes. Only the very basic tree structure.

Posts: 11,306

Threads: 429

Joined: Sep 2016

Reputation: 442

You can look here to see what's available as packages: https://pypi.python.org/pypi?%3Aaction=s...mit=search
You may have to write it yourself, it you can't find what you're looking for

Posts: 6,378

Threads: 115

Joined: Sep 2016

Reputation: 481

Just getting name of tags work fine in both lxml and BeautifulSoup.
Keeping the structure in output can be a challenge,
as both pretty print()lxml and prettify()BS i do not think work for text output.

Example getting tag names:

from lxml import etree
from bs4 import BeautifulSoup

xml = '''\
<data>
    <timestamp>...</timestamp>
    <people>
        <person>
            <name>...</name>
            <age>...</age>
        </person>
        <person>
            <name>...</name>
            <age>...</age>
            <degree />
        </person>
        <person>
            <name>...</name>
            <age>...</age>
            <degree />
            <siblings>
                <brother>...</brother>
                <brother>...</brother>
                <sister>...</sister>
            </siblings>
        </person>
    </people>
    <cities>
        <city>
            <name>...</name>
            <country>...</country>
            <continent>...</continent>
            <capital />
        </city>
        <city>
            <name>...</name>
            <country>...</country>
            <continent>...</continent>
        </city>
    </cities>
</data>
'''

root = etree.fromstring(xml)
soup = BeautifulSoup(xml, 'lxml')

# lxml
for node in root.iter('*'):
    print(node.tag)

# BS
for tag in soup.findChildren():
    print(tag.name)

Output:

data timestamp people person name age person name age degree person name age degree siblings brother brother sister cities city name country continent capital city name country continent

wavic
So-and-so of the Yard

Posts: 2,908

Threads: 46

Joined: Sep 2016

Reputation: 89

Pretty straight away:

from lxml import etree
from collections import Counter

xml = '''\
<data>
    <timestamp>...</timestamp>
    <people>
        <person>
            <name>...</name>
            <age>...</age>
        </person>
        <person>
            <name>...</name>
            <age>...</age>
            <degree />
        </person>
        <person>
            <name>...</name>
            <age>...</age>
            <degree />
            <siblings>
                <brother>...</brother>
                <brother>...</brother>
                <sister>...</sister>
            </siblings>
        </person>
    </people>
    <cities>
        <city>
            <name>...</name>
            <country>...</country>
            <continent>...</continent>
            <capital />
        </city>
        <city>
            <name>...</name>
            <country>...</country>
            <continent>...</continent>
        </city>
    </cities>
</data>
'''

root = etree.fromstring(xml)

for tag in root.iter():
    path = tree.getpath(tag)
    path = path.replace('/', '    ')
    spaces = Counter(path)
    tag_name = path.split()[-1].split('[')[0]
    tag_name = ' ' * (spaces[' '] - 4) + tag_name
    print(tag_name)

Output:

data     timestamp     people         person             name             age         person             name             age             degree         person             name             age             degree             siblings                 brother                 brother                 sister     cities         city             name             country             continent             capital         city             name             country             continent

wavic
So-and-so of the Yard

Posts: 2,908

Threads: 46

Joined: Sep 2016

Reputation: 89

I have missed to put tree = etree.ElementTree(root) before the for loop

Posts: 13

Threads: 6

Joined: Oct 2017

Reputation: 0

Thanks to all of you for the tips!
They helped me to achieve my goal.

wavic - My aim was to have no duplicates, so your code was almost perfect, but I reworked it a bit also to include the attributes.

Here is my final code in case it helps somebody else as well.
I will use it any time I need to see clearly the structure of any XML file, to know all tags/attributes which I need to consider.

import re, collections
from lxml import etree
 
xml = '''\
<data>
    <timestamp>not important</timestamp>
    <people>
        <person name="Blue" given="John">
            <occupation>not important</occupation>
            <age>not important</age>
        </person>
        <person name="Green" given="Peter">
            <occupation>not important</occupation>
            <age>not important</age>
            <degree />
        </person>
        <person name="Red" given="Angela" maiden="Orange">
            <occupation fulltime="yes">not important</occupation>
            <age>not important</age>
            <birthday>not important</birthday>
            <degree />
            <siblings >
                <brother attrib1="no" attrib2="yes">not important</brother>
                <brother attrib1="yes">not important</brother>
                <sister>not important</sister>
            </siblings>
        </person>
    </people>
    <cities>
        <city name="Tokyo">
            <country>not important</country>
            <continent>not important</continent>
            <capital />
        </city>
        <city name="Atlanta">
            <country>not important</country>
            <continent>not important</continent>
            <olympics count="1">
            	<year>1996</year>
            	<season>summer</season>
            </olympics>
        </city>
    </cities>
</data>
'''

xml_root = etree.fromstring(xml)
raw_tree = etree.ElementTree(xml_root)
nice_tree = collections.OrderedDict()

for tag in xml_root.iter():
	path = re.sub('\[[0-9]+\]', '', raw_tree.getpath(tag))
	if path not in nice_tree:
		nice_tree[path] = []
	if len(tag.keys()) > 0:
		nice_tree[path].extend(attrib for attrib in tag.keys() if attrib not in nice_tree[path])			

for path, attribs in nice_tree.items():
	indent = int(path.count('/') - 1)
	print('{0}{1}: {2} [{3}]'.format('    ' * indent, indent, path.split('/')[-1], ', '.join(attribs) if len(attribs) > 0 else '-'))

Which gives me following result:

Output:

0: data [-] 1: timestamp [-] 1: people [-] 2: person [name, given, maiden] 3: occupation [fulltime] 3: age [-] 3: degree [-] 3: birthday [-] 3: siblings [-] 4: brother [attrib1, attrib2] 4: sister [-] 1: cities [-] 2: city [name] 3: country [-] 3: continent [-] 3: capital [-] 3: olympics [count] 4: year [-] 4: season [-]

wavic
So-and-so of the Yard

Posts: 2,908

Threads: 46

Joined: Sep 2016

Reputation: 89

Good! At first, I was thinking that this will be a difficult task but it seems that xpath is of great help.

Posts: 3

Threads: 0

Joined: Aug 2020

Reputation: 0

Aug-12-2020, 08:51 AM (This post was last modified: Aug-12-2020, 08:51 AM by mreshko.)

Hi sonicblind.

Great code! Very useful. Thank you.

It would be great if you could add these two feature to the code:

(1) show the child's' number after the level, e.g.
3.0: occupation [fulltime]
3.1: age [-]
3.2: degree [-]
3.3: birthday [-]
3.4: siblings [-]

(2) show the number of identical siblings, for example, if there were, say, 100 "person" elements, it would
display it as
2: person [name, given, maiden] [100]

Many thanks

How do you visualize an XML tree?

Here are some options for viewing your XML in a tree structure: Open the XML in a web browser and get an outline view with collapsible elements. Open the XML in graphics view in Oxygen, QTAssistant, or XMLSpy. Use Graphviz or DotML ant build to create your own visual representations.

How do I print XML data in Python?

You have a few options..
xml. etree. ElementTree. indent().
BeautifulSoup. prettify().
lxml. etree. parse().
xml. dom. minidom. parse().

How do I print a pretty XML string in Python?

Use lxml. etree. parse(source) to parse the XML file source and return an ElementTree object. Call lxml. etree. tostring(element_or_tree, encoding="unicode" pretty_print=True) to pretty print the contents of the XML file, with element_or_tree as the result of the previous step.

How do I parse XML in ElementTree?

To parse XML file.
ElementTree() This function is overloaded to read the hierarchical structure of elements to a tree objects. ... .
getroot() This function returns root element of the tree root = tree.getroot().
getchildren() This function returns the list of sub-elements one level below of an element..

Tải thêm tài liệu liên quan đến bài viết Python print xml tree structure