• 【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents



    Problem(Abstract)

    When converting contents from a file or string using WebSphere Application Server, numbers may be converted to their word equivalents, especially if using PDFBOX to extract text, along with sun.io.MalformedInputExceptions. 

    Symptom

    Text extracted from UTF-8 sources, such as PDFs, are displayed incorrectly.

    For example, when "123 Hello Motto" is extracted from a PDF, the text "onetwothreespaceHellospaceMottospace" is output.



    Diagnosing the problem

    When extracting text from a source that using UTF-8, you may find that numbers and non-alpha characters are transforming into word equivalents. This is a problem seen on Linux; however, MalformedInputExceptions are likely to be seen on other operating systems.

     

    We have a stand-alone test case that can confirm if the text transformations are occurring. Unzip the contents of PDF_Test_Case.zip into a temporary location and execute the following against the Java™ executable that is bundled with your WebSphere Application Server.

    [JAVA_EXECUTABLE] -jar pdfproblem.jar "123_Hello_Motto.pdf"


    PDF_Test_Case.zip

    If the test fails, you will see output similar to the following:
    onetwothreespaceHellospaceMottospace

    Resolving the problem

    Because of the IBM SDK's use of Java IO for text and font conversion, these transformation issues occur. The solution is to force the Java Virtual Machine (JVM) to use the Java NIO libraries for extracting text. Add this JVM argument to resolve the problem:


    -Dibm.stream.nio=true


    I am getting a MalformedInputException. How can I resolve this?

    This exception does not alter the resulting string, which is output after the exception. Java IO is designed to throw exceptions when errors are reported. By switching to NIO, these exceptions would be caught and not reported to the log. 

    You can resolve these errors by forcing NIO, but there is an alternative. Check the environment variable LANG to see if it set to UTF-8. It may read something like this: 

    # echo $LANG
    en_US.UTF-8

    Alter the variable and remove the .UTF-8 appended to the end of the string. From the command prompt on UNIX and Linux, you can type the following: 

    # export LANG=en_US

    Alternatively, you can add this environment variable from the administration console. 

    MalformedInputException may also occur when running your application on WebSphere Application Server and would be output to the standard error. 


     


    Why is Java IO used for converting text?
    Java IO is retained in the IBM SDK for performance reasons instead of using NIO, or New IO. By design, Java IO will throw exceptions when errors are encountered, such as the MalformedInputExcpetion error, while NIO will not. 

    The JVM can be forced to use NIO if the JVM argument is used as stated above. 



    Does the Oracle JDK suffer similar problems?
    Since the Oracle JDK uses NIO by default, this issue does not occur when running WebSphere Application Server on Solaris and HP-UX.
  • 相关阅读:
    Oracle 安装及其遇到的问题
    集合与Iterator
    Java 基本数据类型长度
    TextFile 类的创写
    Base64编码通过URL传值的问题
    HttpUrlConnection访问Servlet进行数据传输
    Servlet 的认识
    高聚合低耦合
    Exception loading sessions from persistent storage 这个问题的解决
    ARTS打卡计划第六周
  • 原文地址:https://www.cnblogs.com/zhangxsh/p/3494510.html
Copyright © 2020-2023  润新知