Getting encoding type of a XML in java

By | January 12, 2018
Questions:

I am parsing XML using DocumentBuilder in java 1.4.
XML has first line as

xml version="1.0" encoding="GBK"

I want to get encoding type of the XML and use it. How can I get “GBK”
Basically i will be making one more XML where i want encoding="GBK" to be retained.
Currently it is getting lost and set to default UTF-8
There are many XML with different encoding I need to read encoding od source and so necessary things.

Please help

Answers:

One way to this works like this

final XMLStreamReader xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader( new FileReader( testFile ) );

//running on MS Windows fileEncoding is "CP1251"
String fileEncoding = xmlStreamReader.getEncoding(); 

//the XML declares UTF-8 so encodingFromXMLDeclaration is "UTF-8"
String encodingFromXMLDeclaration = xmlStreamReader.getCharacterEncodingScheme(); 

Questions:
Answers:

This one works for various encodings, taking into concern both the BOM and the XML declaration. Defaults to UTF-8 if neither applies:

String encoding;
FileReader reader = null;
XMLStreamReader xmlStreamReader = null;
try {
    InputSource is = new InputSource(file.toURI().toASCIIString());
    XMLInputSource xis = new XMLInputSource(is.getPublicId(), is.getSystemId(), null);
    xis.setByteStream(is.getByteStream());
    PropertyManager pm = new PropertyManager(PropertyManager.CONTEXT_READER);
    for (Field field : PropertyManager.class.getDeclaredFields()) {
        if (field.getName().equals("supportedProps")) {
            field.setAccessible(true);
            ((HashMap<String, Object>) field.get(pm)).put(
                    Constants.XERCES_PROPERTY_PREFIX + Constants.ERROR_REPORTER_PROPERTY,
                    new XMLErrorReporter());
            break;
        }
    }
    encoding = new XMLEntityManager(pm).setupCurrentEntity("[xml]".intern(), xis, false, true);
    if (encoding != "UTF-8") {
        return encoding;
    }

    // From @matthias-heinrich’s answer:
    reader = new FileReader(file);
    xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(reader);
    encoding = xmlStreamReader.getCharacterEncodingScheme();

    if (encoding == null) {
        encoding = "UTF-8";
    }
} catch (RuntimeException e) {
    throw e;
} catch (Exception e) {
    throw new UndeclaredThrowableException(e);
} finally {
    if (xmlStreamReader != null) {
        try {
            xmlStreamReader.close();
        } catch (XMLStreamException e) {
        }
    }
    if (reader != null) {
        try {
            reader.close();
        } catch (IOException e) {
        }
    }
}
return encoding;

Tested on Java 6 with:

  • UTF-8 XML file with BOM, with XML declaration ✓
  • UTF-8 XML file without BOM, with XML declaration ✓
  • UTF-8 XML file with BOM, without XML declaration ✓
  • UTF-8 XML file without BOM, without XML declaration ✓
  • ISO-8859-1 XML file (no BOM), with XML declaration ✓
  • UTF-16LE XML file with BOM, without XML declaration ✓
  • UTF-16BE XML file with BOM, without XML declaration ✓

Standing on the shoulders of these giants:

import java.io.*;
import java.lang.reflect.*;
import java.util.*;
import javax.xml.stream.*;
import org.xml.sax.*;
import com.sun.org.apache.xerces.internal.impl.*;
import com.sun.org.apache.xerces.internal.xni.parser.*;

Questions:
Answers:

Using javax.xml.stream.XMLStreamReader to parse your file, then you can call getEncoding().

Leave a Reply

Your email address will not be published. Required fields are marked *