Is There A New Way to Correct Errors Anxiao (Andrew) Jiang Jehoshua Bruck Computer Science and Eng. Dept. Texas A&M University [email protected] Electrical Engineering Dept. California Institute of Technology [email protected] Abstract—The classic approach for error correction is to add controlled external redundancy to data. This approach, called error-correcting codes, has been studied extensively. And the rates of ECCs are approaching theoretical limits. We explore a second approach for error correction in this work, which is to use the redundancy inside data, even if it is just the residual redundancy after data compression. We focus on text data, and show that this approach based on language processing can significantly improve the error correction performance. I. I NTRODUCTION In the recent work [4], the approach of using the internal redundancy of data for error correction has been explored. It shows that for text data, which have rich features, a language-based error correction algorithm can obtain useful soft information on the noisy bits, which can be provided to error-correcting codes (ECCs) for significantly improved performance. For example, for binary symmetric channel (BSC) with bit error rates in the range [0.2%, 1.3%] (a useful range of BER for data storage), a reduction of 70% paritycheck bits was shown to be achievable by combining languagebased decoding with ECC-based decoding [4]. The study was extended to its application in flash memories, where experiments show that the new approach substantially improves data retention [5]. The algorithm used in [4], [5] for languagebased error correction focuses on a basic method named word recognition, namely, to recognize probable words in texts. It is expected that as more techniques based on natural language processing are used, the error correction performance will be further, and substantially, improved. Note that text data form an important part of the big data in storage systems [1]. They are concise representations of knowledge, and have low tolerance of errors. Text data are mainly natural languages, in addition to characters for format control, etc. In data storage systems, text data are either uncompressed, or compressed using Huffman coding, arithmetic coding, etc. [3] The compression is often at the character level or byte level, because the complexity of the compression scheme grows exponentially in the codeword length of the code book [1]. The algorithms typically view texts as Markov processes of characters (sometimes combinations of characters), and do not utilize the many features of languages (e.g, syntax, POS tagging, etc.) [6]. Consequently, in storage systems, there exists plenty of redundancy in stored data, which forms the basis for language-based error correction. II. E XTENSIONS In addition to natural languages, many text files have extra structural redundancy that can be used for error correction. An important class of files is webpages, which are mostly in the HTML language and are highly structured. Example 1. Here is a simple example of a webpage (written in one line): <!DOCTYPE html> <html> <body> <h1> The First Heading </h1> <p> The first paragraph. </p> · · · · · · </body> </html> The tags (such as “<html>” and “</html>”) usually come in pairs and together form a tree structure [7]. They appear frequently in webpages, and their regular structures enable effective error correction using efficient algorithms, e.g., dynamic programming. Another important class of files is databases, including both relational databases and semistructured databases [2]. Relational databases use highlystructured tables with well-defined schemas to store large amounts of data. Semi-structured databases use high-structured languages including XML, etc. to store big data with flexible forms. The structural redundancy is suitable for languagebased decoding, and the techniques can be combined with those for natural languages. The language-based decoding has some fundamental differences from error-correcting codes. First of all, languages are open systems. Unlike ECCs, where valid codewords are predefined and form a closed set, languages are well known to be open [6]. New words, expressions and grammatical rules continue to appear, and rare/new cases are abundant. Second, unlike ECCs, where codewords form a regular (often linear) space, the codewords for language tokens (e.g., words, punctuation marks) are much more chaotic, but not totally random. So new decoding algorithms with efficient searching processes need to be developed. R EFERENCES [1] EMC Education Services, Information Storage and Management, 2nd edition, Wiley, 2012. [2] H. G ARCIA -M OLINA , J. D. U LLMAN AND J. W IDOM, Database Systems: The Complete Book, 2nd edition, 2008. [3] Hitachi Data Systems Academy, Storage Concepts, Amazon Digital Services, 2014. [4] A. Jiang, Y. Li and J. Bruck, “Error correction through language processing,” to appear in Proc. IEEE Information Theory Workshop (ITW), April 2015. [5] A. Jiang, Y. Li and J. Bruck, “Enhanced error correction via language processing,” to appear in Proc. Non-Volatile Memories Workshop (NMVW), March 2015. Full version available at http : //f aculty.cs.tamu.edu/ajiang/enhanced.pdf . [6] C. Manning and H. Schutze, Foundations of Statistical Natural Language Processing, 1st edition, the MIT Press, 1999. [7] J. N. Robbins, Learning Web Design, 4th edition, O’Reilly Media, 2012.