strange encoding problem

Discussion:

Andreas Schildbach

2002-11-01 11:57:03 UTC

i've got an utf-8 encoded xml file (test.xml) with an umlaut character, like
this:

<?xml version="1.0" encoding="UTF-8"?>
<a>ue</a> 

i want to apply a simple xsl transformation (test.xml) to html, like this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="html" indent="yes" />
</xsl:stylesheet>

when i use the xslt-task of jakarta ant

<xslt in="test.xml" style="test.xsl" out="test.html" />

the result (test.html) is ü
which is correct.

PROBLEM:

when i use tomcat, jsp and the jstl (java standard tag library) to apply the
transformation

<%@ taglib prefix="x" uri="http://java.sun.com/jstl/xml" %>
<c:import url="test.xml" var="xml"/>
<c:import url="test.xsl" var="xsl"/>
<x:transform xml="${xml}" xslt="${xsl}"/>

the result is Ã¼
which is NOT correct in my opinion.

- i made sure that the utf-8 encoded files are really utf-8 encoded (textpad
4.5.0 save-as encoding utf-8)
- i updated my software to the latest revisions:
jdk 1.4.1_01
tomcat 4.1.12
jstl 1.02
jakarta ant 1.5.1
- i searched for this problem on the internet/faqs

none of this helped. what am i doing wrong?

andreas

XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list

Jeni Tennison

2002-11-01 12:20:09 UTC

Permalink

Hi Andreas,

Post by Andreas Schildbach
when i use tomcat, jsp and the jstl (java standard tag library) to apply the
transformation
<c:import url="test.xml" var="xml"/>
<c:import url="test.xsl" var="xsl"/>
<x:transform xml="${xml}" xslt="${xsl}"/>
the result is Ã¼
which is NOT correct in my opinion.

When you say it's Ã¼, do you mean that when you open up
the result you actually see those entity references, or do you see the
actual characters ü?

I suspect it's the latter, in which case make sure that the text
editor (or whatever) that you're using to look at the result of the
transformation is reading in that result as UTF-8 rather than as
ISO-8859-1.

If the former, then something really weird's going on -- it looks as
though the result is being serialised as UTF-8, then read as
ISO-8859-1 and then serialised again using HTML entity references.
Perhaps knowing that's what's going on will help you track down the
bug...

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/

XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list

Mike Brown

2002-11-01 17:12:39 UTC

Permalink

Post by Jeni Tennison

I'm surprised Tommie isn't scolding you both for straying off topic.

Post by Jeni Tennison
When you say it's Ã¼, do you mean that when you open up
the result you actually see those entity references, or do you see the
actual characters ü?
I suspect it's the latter, in which case make sure that the text
editor (or whatever) that you're using to look at the result of the
transformation is reading in that result as UTF-8 rather than as
ISO-8859-1.
If the former, then something really weird's going on -- it looks as
though the result is being serialised as UTF-8, then read as
ISO-8859-1 and then serialised again using HTML entity references.
Perhaps knowing that's what's going on will help you track down the
bug...

Yes, it's very typical in servlet/JSP applications to do something like this:

1. The client requests page via HTTP.

2. The server sends an HTML form, wherein the Unicode chars of the document
have been serialized in the HTTP response as (iso-8859-1 or local platform
default encoding). The response may or may not indicate that this is the
encoding, via the charset parameter in the Content-Type header.
The client may or may not use the indicated encoding to know how to
decode the document and present the form (the user can usually override
the decoding on their end, because there is a long history of Japanese
and Chinese multibyte character sets being misrepresented as iso-8859-1).

3. Due to convention, not formal standard, the client will try to use the same
encoding when it submits the form data, no matter how it is sent (GET,
POST, x-www-form-urlencoded or multipart/form-data ... doesn't matter).
Unencodable characters in the form data might be first translated to
numeric character references... again, there is no standard, so browser
behavior varies. The browser most likely will *not* indicate what encoding
was used in the form data submission, "for backward compatibility".

The form data in the HTTP request may look like this,
for example, if you entered a copyright notice consisting of
"<copyright symbol U+00A9> 2002 Acme Inc." into a form field named
foo on a utf-8 encoded HTML page: foo=%C2%A9%202002%20Acme%20Inc.
Note that the copyright symbol character in utf-8 is 0xC2 0xA9, while
in iso-8859-1 it is just 0xA9. I recommend monitoring the HTTP
traffic so you can see the raw request before making any assumptions.
Use a proxy server with extended logging options like Proximitron, or
use a packet sniffer like tcpdump and your favorite binary file viewer.
e.g., on my BSD box I can use "tcpdump -s 0 -w - port 80 | hexdump -C"

4. The server (servlet/JSP engine like Tomcat or Weblogic) will make an
assumption about what encoding was used in the form data. Most likely,
it will choose to use iso-8859-1 or whatever the platform default
encoding is. Thus it will give you access to what it calls a
"parameter" (bad name.. URIs and MIME headers have parameters too,
but they aren't the same thing) named foo, containing the Unicode
string you get if you decode the URL-encoded bytes as iso-8859-1:
roughly, <capital A with carat: U+00C2> <copyright symbol: U+00A9>
+ "2002 Acme Inc.".

Now you can see how things start to go awry. It snowballs from there.
The solution I recommend is this:

1. Always know the encoding of the HTML form that you send to the browser. For
maximum predictability and Unicode support I recommend using utf-8. Ensure
that the HTML declares itself as utf-8 in a meta tag and/or in the HTTP
response headers.

2. Make it a requirement for using your application that the browser be set to
auto-detect encoding, not override it, so you can assume the form data will
come back using the same encoding as the form. OR you can look at the
Accept-Charset and/or Accept-Language headers in the HTTP requests to make an
intelligent *guess* as to what encoding the browser is using. I don't
recommend this because, well, it's still a guess, and you probably wouldn't
know when to choose utf-8.

3. If you sent out the form in utf-8, your response is probably coming back
utf-8, so take your decoded "parameter", re-encode it as iso-8859-1 bytes, and
decode those bytes back as if they were utf-8. Something like this, in Java,
plus the appropriate try-catch for the possible UnsupportedEncodingException.

String badString = request.getParameter("foo");
byte[] bytes = badString.getBytes("ISO-8859-1");
String goodString = new String(bytes, "UTF-8");

Now, that just covers the general stuff.. I think if you can understand this
much of it, you can get to a point where you can figure out how the XSLT
transformation output gets munged. Like I said, it really helps if you can
peek into the data as it is going back and forth, if you know how to spot
faulty data... put lots of System.out.println()s in and get yourself something
to look at the HTTP messages with.

- Mike
____________________________________________________________________________
mike j. brown | xml/xslt: http://skew.org/xml/
denver/boulder, colorado, usa | resume: http://skew.org/~mike/resume/

XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list

Gregory Murphy

2002-11-01 20:10:33 UTC

Permalink

Post by Andreas Schildbach
i've got an utf-8 encoded xml file (test.xml) with an umlaut character, like
<?xml version="1.0" encoding="UTF-8"?>
<a>ue</a> 
[...]
when i use tomcat, jsp and the jstl (java standard tag library) to apply the
transformation
<c:import url="test.xml" var="xml"/>
<c:import url="test.xsl" var="xsl"/>
<x:transform xml="${xml}" xslt="${xsl}"/>
the result is Ã¼
which is NOT correct in my opinion.

The following hack might help you to work around the problem. Redefine the
character entity so that it refers to a numeric character entity. In other
words, make your XML look something like

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html [
<!ENTITY uuml "ü">
]>

I have found that, in general, numeric character entity references survive
repeated processing better than do the HTML named references.

// Gregory Murphy <***@sun.com>
// Software Engineer
// Customer Network Platform, Sun Microsystems

XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list