Using SAX (Java) to parse multiple XML messages from a single TCP-stream

January 12, 2018

I’m in a position where I use Java to connect to a TCP port and am streamed XML documents one after another, each delimited with the <?xml start of document tag. An example which demonstrates the format:

<?xml version="1.0"?>
    <name>Fred Bloggs</name>
<?xml version="1.0"?>
    <name>Peter Jones</name>

I’m using the org.xml.sax.* api. The SAX parsing works perfectly for the first document but throws an exception when it comes across the start of the second document:

Exception in thread "main" org.xml.sax.SAXParseException: The processing instruction 
target matching "[xX][mM][lL]" is not allowed.

The following skeleton class demonstrates the setup I’m using:

import org.xml.sax.InputSource;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;


public class XMLTest extends DefaultHandler {

  public XMLTest() {

  public static void main(String[] args) throws Exception {
    XMLReader xr = XMLReaderFactory.createXMLReader();

    XMLTest handler = new XMLTest();

    xr.parse(new InputSource(new Socket("", 4555).getInputStream()));

I have no control over the format of the xml (it’s a financial data feed), but I need to be able to parse it efficiently, and parse all the documents. I’ve spent the afternoon/evening trying different things but none have yielded results. Any help would be greatly appreciated.


You’d like to split the stream on every <?xml version="1.0"?> and parse them all separately. The BufferedReader may be helpful in this. Kickoff example:

reader = new BufferedReader(new InputStreamReader(input, "UTF-8"));
StringBuilder builder = null;
for (String line; (line = reader.readLine()) != null;) {
    if (line.startsWith("<?xml")) {
        if (builder != null) {
            xr.parse(new InputSource(builder.toString()));
        builder = new StringBuilder();

