Problem(Abstract)
When converting contents from a file or string using WebSphere Application Server, numbers may be converted to their word equivalents, especially if using PDFBOX to extract text, along with sun.io.MalformedInputExceptions.
Symptom
Text extracted from UTF-8 sources, such as PDFs, are displayed incorrectly.
For example, when "123 Hello Motto" is extracted from a PDF, the text "onetwothreespaceHellospaceMottospace" is output.
Diagnosing the problem
When extracting text from a source that using UTF-8, you may find that numbers and non-alpha characters are transforming into word equivalents. This is a problem seen on Linux; however, MalformedInputExceptions are likely to be seen on other operating systems.
We have a stand-alone test case that can confirm if the text transformations are occurring. Unzip the contents of PDF_Test_Case.zip into a temporary location and execute the following against the Java™ executable that is bundled with your WebSphere Application Server.
[JAVA_EXECUTABLE] -jar pdfproblem.jar "123_Hello_Motto.pdf"
If the test fails, you will see output similar to the following:
onetwothreespaceHellospaceMottospace
Resolving the problem
Because of the IBM SDK's use of Java IO for text and font conversion, these transformation issues occur. The solution is to force the Java Virtual Machine (JVM) to use the Java NIO libraries for extracting text. Add this JVM argument to resolve the problem:
-Dibm.stream.nio=true
I am getting a MalformedInputException. How can I resolve this?
This exception does not alter the resulting string, which is output after the exception. Java IO is designed to throw exceptions when errors are reported. By switching to NIO, these exceptions would be caught and not reported to the log.
You can resolve these errors by forcing NIO, but there is an alternative. Check the environment variable LANG to see if it set to UTF-8. It may read something like this:
# echo $LANG
en_US.UTF-8
Alter the variable and remove the .UTF-8 appended to the end of the string. From the command prompt on UNIX and Linux, you can type the following:
# export LANG=en_US
Alternatively, you can add this environment variable from the administration console.
MalformedInputException may also occur when running your application on WebSphere Application Server and would be output to the standard error.
Why is Java IO used for converting text?
Java IO is retained in the IBM SDK for performance reasons instead of using NIO, or New IO. By design, Java IO will throw exceptions when errors are encountered, such as the MalformedInputExcpetion error, while NIO will not.
The JVM can be forced to use NIO if the JVM argument is used as stated above.
Does the Oracle JDK suffer similar problems?
Since the Oracle JDK uses NIO by default, this issue does not occur when running WebSphere Application Server on Solaris and HP-UX.