Hi,
I have a large multi-level XML document of a complicated structure, without any namespace definition.
I would like to generate a simplified tree view of its structure, so that every possible element from the XML is shown and only once.
As a simplified example take this XML:
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Using Python I would like to generate a view of its structure, looking something like this:
-data- -timestamp- -people- -person- -name- -age- -degree- -siblings- -brother- -sister- -cities- -city- -name- -country- -continent- -capital-
So, basically
I am not interested in the values, or how many elements of the same type are in the XML, etc.
I only want to see which elements are in there.
I know there might be visual tools to achieve this, but I need to be able to generate such tree view also directly inside python script.
Thanks for any ideas.
Posts: 11,306
Threads: 429
Joined: Sep 2016
Reputation: 442
lxml has etree [prettyprint option] see: //lxml.de/api.html
Posts: 13
Threads: 6
Joined: Oct 2017
Reputation: 0
prettyprint does not help me as it shows everything as is in the XML. That's exactly what I want to avoid. I need no duplicates and no values or attributes. Only the very basic tree structure.
Posts: 11,306
Threads: 429
Joined: Sep 2016
Reputation: 442
You can look here to see what's available as packages: //pypi.python.org/pypi?%3Aaction=s...mit=search
You may have to write it yourself, it you can't find what you're looking for
Posts: 6,378
Threads: 115
Joined: Sep 2016
Reputation: 481
Just getting name of tags work fine in both lxml and BeautifulSoup.
Keeping the structure in output can be a challenge,
as both pretty print[]lxml
and prettify[]BS
i do not think work for text output.
Example getting tag names:
from lxml import etree from bs4 import BeautifulSoup xml = '''\ ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ''' root = etree.fromstring[xml] soup = BeautifulSoup[xml, 'lxml'] # lxml for node in root.iter['*']: print[node.tag] # BS for tag in soup.findChildren[]: print[tag.name]
Output:
data
timestamp
people
person
name
age
person
name
age
degree
person
name
age
degree
siblings
brother
brother
sister
cities
city
name
country
continent
capital
city
name
country
continent
wavic
So-and-so of the Yard
Posts: 2,908
Threads: 46
Joined: Sep 2016
Reputation: 89
Pretty straight away:
from lxml import etree from collections import Counter xml = '''\ ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ''' root = etree.fromstring[xml] for tag in root.iter[]: path = tree.getpath[tag] path = path.replace['/', ' '] spaces = Counter[path] tag_name = path.split[][-1].split['['][0] tag_name = ' ' * [spaces[' '] - 4] + tag_name print[tag_name]
Output:
data
timestamp
people
person
name
age
person
name
age
degree
person
name
age
degree
siblings
brother
brother
sister
cities
city
name
country
continent
capital
city
name
country
continent
wavic
So-and-so of the Yard
Posts: 2,908
Threads: 46
Joined: Sep 2016
Reputation: 89
I have missed to put tree = etree.ElementTree[root]
before the for loop
Posts: 13
Threads: 6
Joined: Oct 2017
Reputation: 0
Thanks to all of you for the tips!
They helped me to achieve my goal.
wavic - My aim was to have no duplicates, so your code was almost perfect, but I reworked it a bit also to include the attributes.
Here is my final code in case it helps somebody else as well.
I will use it any time I need to see clearly the structure of any XML file, to know all tags/attributes which I need to consider.
import re, collections from lxml import etree xml = '''\ not important not important not important not important not important not important not important not important not important not important not important not important not important not important not important 1996 summer ''' xml_root = etree.fromstring[xml] raw_tree = etree.ElementTree[xml_root] nice_tree = collections.OrderedDict[] for tag in xml_root.iter[]: path = re.sub['\[[0-9]+\]', '', raw_tree.getpath[tag]] if path not in nice_tree: nice_tree[path] = [] if len[tag.keys[]] > 0: nice_tree[path].extend[attrib for attrib in tag.keys[] if attrib not in nice_tree[path]] for path, attribs in nice_tree.items[]: indent = int[path.count['/'] - 1] print['{0}{1}: {2} [{3}]'.format[' ' * indent, indent, path.split['/'][-1], ', '.join[attribs] if len[attribs] > 0 else '-']]
Which gives me following result:
Output:
0: data [-]
1: timestamp [-]
1: people [-]
2: person [name, given, maiden]
3: occupation [fulltime]
3: age [-]
3: degree [-]
3: birthday [-]
3: siblings [-]
4: brother [attrib1, attrib2]
4: sister [-]
1: cities [-]
2: city [name]
3: country [-]
3: continent [-]
3: capital [-]
3: olympics [count]
4: year [-]
4: season [-]
wavic
So-and-so of the Yard
Posts: 2,908
Threads: 46
Joined: Sep 2016
Reputation: 89
Good! At first, I was thinking that this will be a difficult task but it seems that xpath is of great help.
Posts: 3
Threads: 0
Joined: Aug 2020
Reputation: 0
Aug-12-2020, 08:51 AM [This post was last modified: Aug-12-2020, 08:51 AM by mreshko.]
Hi sonicblind.
Great code! Very useful. Thank you.
It would be great if you could add these two feature to the code:
[1] show the child's' number after the level, e.g.
3.0: occupation [fulltime]
3.1: age [-]
3.2: degree [-]
3.3: birthday [-]
3.4: siblings [-]
[2] show the number of identical siblings, for
example, if there were, say, 100 "person" elements, it would
display it as
2: person [name, given, maiden] [100]
Many thanks