- Character encoding in corpus construction (2004) - Anthony McEnery, Zhonghua Xiao
- Page 10: “While UTF-32 is wasteful of memory and disk space for all languages, UTF-16 also doubles the size of a file containing single-byte characters (such as English), though for CJK languages that have already used 2-byte encodings traditionally, the file size remains more or less the same.”