I happened upon a nice-looking Lisp implementation, ABCL (Armed Bear Common Lisp), which runs in and compiles to the Java JVM, integrating with the JSR-223 Java Scripting API.
Unfortunately, I found on their blog, this article.
[Note: Corrected URL]Basically, they're trying to handle "bad UTF-8 input" by substituting a character ('?' in this case) for what they don't understand.
I tried to post a response, but my first attempt wouldn't OpenID authenticate against my typepad account (this blog), and resulted in my finely-crafted prose, better than this in many ways, being discarded.So I rewrote it with what you see below, and tried again with my Google account. First it complained that my HTML was > 4096 characters -- and submitting a smaller partial input also made the same complaint, despite being well below that limit. Presumably some rewriting to HTML was pushing it over the limit.
Finally, I go it to accept it -- only to get a Google error message that the URI was too long to process. Integration failure!
So I post this on my own blog, and direct it to any and all who would consider trying to be nice by "handling" the "bad input".
Please, please do not do this!
Your intentions are good, and this seems like an obvious good thing -- but it's not. It's a trap.
This is such a serious mistake as to completely disqualify your system from usage for many purposes (unless one bypasses your text input stream handling entirely!). What Java does here is correct and vital. Do not bypass it.
I quite understand the instinct to just "make things work" in the face of invalid input. For example, Interlisp's DWIM -- Do What I Mean. More aptly titled, "Do what Warren Teitelman taught Interlisp to guess that I might mean."
This had the unfortunate effect of allowing bugs to slip into systems undetected.
You're thinking along the lines of Jon Postel's "be conservative in what you do, be liberal in what you accept from others", but that's a different situation, seeking robustness in establishing communication, not integrity of the data -- and even there, it had a lot of unfortunate consequences and has largely been abandoned in favor of more strict compliance.
Or consider the history of HTML, and the myriad messes which have resulted from browsers being liberal in what they accept coupled with page authors being sloppy.
I consider what you are doing here to be EXTREMELY seriously bad. You are, of your own initiative, silently corrupting data. There are few worse things a computer can do!
I cannot begin to tell you how often I have had to track down bad data caused by EXACTLY this behavior. Such data is usually detected FAR from the bad input. Often, extensive and expensive manual fixup has been required. Often, repair has simply been impossible.
You state "not everybody produces strict ascii or strict utf8 sources." Well, kind of true. People often produce strict ISO-8859-1, for example. That's the situation here, apparently. Rather than fix your real bug (reading an ISO-8859-1 file as UTF-8, thus potentially corrupting the data, if you're unlucky, or getting an error if you are lucky, as in this case) -- you are simply eliminating the error -- guaranteeing you'll be unlucky and the mistake will go undetected, until someone has to investigate why there's this funny '?' in some output somewhere, when some quite different appeared in the source code!
Input files can be in strict ISO-8859-2, -3, etc. or strict shift-JIS, or various strict combinations of {EUC, ISO-2022} and national coding systems.
The reality is, interpreted in the correct format, malformed input is extremely rare. When it does occur, it is nearly always the result of concatenation of raw byte-data that encodes different formats. For example, a source file that has been edited as UTF-8 by one author, and ISO-8859-1 by another, and then merged in a revision control system.
That's something to fix (the guy with the ISO-8859-1 editor needs to fix his setup), not paper over and ignore.
The solution here is simple: If you find you are reading the file in the wrong format, provide a restart to select the correct format, and reread it -- from the beginning, restarting all processing.
In most cases, the correct action will be to convert the input to UTF-8. In this day and age, nobody should be using any lesser character set, nor any other format for data interchange. (UTF-16 or full 32-bit unicode are suitable for internal processing, but will cause problems for interchange as support is limited in most environments).
And yes, I am aware that a lot of tools still default the charset to the platform's encoding, which for legacy reasons often defaults to some national encoding. Eclipse, for example. This is unfortunate, but data corruption is not the solution.
The only exceptions I see to not doing substitution: (1) Providing visual context to the user about where the problem lies, to aid in investigating and correcting the input -- e.g. selecting the right encoding, or re-encoding the input to the expected encoding, or (2) situations where proceeding without failure is vital, and data integrity is secondary to that goal. (Reading source files does not begin to qualify!)
So please, please, do not do this.
--- Argument by Authority section :=) ---
Developer on:
- MacLisp
- Macsyma
- Symbolics (including Japanese support)
- ANSI X3J13
- CLIM
- Various character-set/encoding libraries.
- Emacs's Edict library
My distaste for this strategy has been honed and reinforced over decades of experience with it.
And you have a lot of company in this, so I've had a lot of opportunity to get experience with it, unfortunately.
So please do not do this!