python reporting line/column of origin of XML node

By | January 12, 2018

I’m currently using xml.dom.minidom to parse some XML in python. After parsing, I’m doing some reporting on the content, and would like to report the line (and column) where the tag started in the source XML document, but I don’t see how that’s possible.

I’d like to stick with xml.dom / xml.dom.minidom if possible, but if I need to use a SAX parser to get the origin info, I can do that — ideal in that case would be using SAX to track node location, but still end up with a DOM for my post-processing.

Any suggestions on how to do this? Hopefully I’m just overlooking something in the docs and this extremely easy.


By monkeypatching the minidom content handler I was able to record line and column number for each node (as the ‘parse_position’ attribute). It’s a little dirty, but I couldn’t see any “officially sanctioned” way of doing it 🙂 Here’s my test script:

from xml.dom import minidom
import xml.sax

doc = """\

def set_content_handler(dom_handler):
    def startElementNS(name, tagName, attrs):
        orig_start_cb(name, tagName, attrs)
        cur_elem = dom_handler.elementStack[-1]
        cur_elem.parse_position = (

    orig_start_cb = dom_handler.startElementNS
    dom_handler.startElementNS = startElementNS

parser = xml.sax.make_parser()
orig_set_content_handler = parser.setContentHandler
parser.setContentHandler = set_content_handler

dom = minidom.parseString(doc, parser)
pos = dom.firstChild.parse_position
print("Parent: '{0}' at {1}:{2}".format(
    dom.firstChild.localName, pos[0], pos[1]))
for child in dom.firstChild.childNodes:
    if child.localName is None:
    pos = child.parse_position
    print "Child: '{0}' at {1}:{2}".format(child.localName, pos[0], pos[1])

It outputs the following:

Parent: 'File' at 1:0
Child: 'name' at 2:2
Child: 'pos' at 3:2


A different way to hack around the problem is by patching line number information into the document before parsing it. Here’s the idea:

LINE_DUMMY_ATTR = '_DUMMY_LINE' # Make sure this string is unique!
def parseXml(filename):
  f =, 'r')
  l = 0
  content = list ()
  for line in f:
    l += 1
    content.append(re.sub(r'<(\w+)', r'<\1 ' + LINE_DUMMY_ATTR + '="' + str(l) + '"', line))
  f.close ()

  return minidom.parseString ("".join(content))

Then you can retrieve the line number of an element with

int (element.getAttribute (LINE_DUMMY_ATTR))

Quite clearly, this approach has its own set of drawbacks, and if you really need column numbers, too, patching that in will be somewhat more involved. Also, if you want to extract text nodes or comments or use Node.toXml(), you’ll have to make sure to strip out LINE_DUMMY_ATTR from any accidental matches, there.

The one advantage of this solution over aknuds1’s answer is that it does not require messing with minidom internals.

Leave a Reply

Your email address will not be published. Required fields are marked *