November 11, 2018

Not all quotation marks are made equal

As part of moving special collections stock we needed to identify item records containing a non-public note (MARC field 852 $x or MARC field 952 $x in Koha). Historically, this free text field has been used to input details about the precise location of each item, such as section, folder, etc. The initial report drawn up by the systems librarian, however, showed only a fraction of the items housed in the special collection location. Upon closer examination we discovered that the records which did not show up on the report had double quotation marks around the text which has carried through into the MARC data. Clearly Koha was not happy about that. The systems librarian opened a single record and manually replaced the double quotes. This seemed to work and the record showed up on the report… excep the record length in the MARC record leader (LDR) has changed. This was unexpected as the number of characters in the MARC record has remained the same, yet there was a difference of 4 between the original and the edited record length.

Greeted with this peculiar news the next morning, I had another look at the record. The graphic design education must have kicked in and I noticed that the original record had right (U+201C) and left (U+201D) double quotation marks, which in binary data are much longer than the standard double quotation marks (U+0022).

"   U+0022 QUOTATION MARK               11100010:10000000:10011101  = 24 bits
“   U+201C LEFT DOUBLE QUOTATION MARK   11100010:10000000:10011100  = 24 bits
”   U+201D RIGHT DOUBLE QUOTATION MARK  00100010                    = 8 bits

When you do the simple math, replacing the right and left double quotation marks with the standard quotation marks has shortened the data with 32 bits. Cross-referencing with the latest MARC 21 Specifications for Record Structure I have found that the term length refers to “a measure of the size of a data element, field, or record and is expressed in number of octets”. In turn octet is “a string of 8 bits. In some cases (e.g., ASCII) each octet represents a character; in other cases (e.g., Unicode) multiple octets may represent a character”. Mystery solved! My record length difference of 4 is actually 4 octets, 4 bytes or 32 bits.

Lessons learned:

  • Do not use punctuation unnecessarily in MARC records
  • If you need to use it, stick to ASCII characters (0-255) when possible
  • Use digraphs in vim to insert characters that are not available on your keyboard. You can enter the 256 characters from the digraph-table using the CTRL-K (a question mark will appear) followed by the digraph value, while in insert mode. However, if you want to enter the infamous right and left double quotations, you need to have vim compiled with multibyte support and use a multibyte encoding in order to input the enhanced digraphs set from the digraph-table-mbyte. To check if you have multibyte encoding use :echo has(‘multi-byte’) if it returns 1 you are good to go. Then you need to be in insert mode, and enter CTRL-K, a question mark will appear, then enter digraph value. For example, right double quotation mark is “9 and left double quotation mark is “6. If this doesn’t work, most likely in Windows gVim, make sure you check the following is in your _vimrc file
if has('gui_running')
  set guifont=Lucida_Console:h11
endif

if has("multi_byte")
  if &termencoding == ""
    let &termencoding = &encoding
  endif
  set encoding=utf-8
  setglobal fileencoding=utf-8
  "setglobal bomb
  set fileencodings=ucs-bom,utf-8,latin1
endif

Hopefully that will enable you to see the typographical quotes on the gui. You can check if the encoding is correct encoded by positioning the cursor over the character and running the :as command which will show you the code point of the character. Use :w ++enc=utf-8 when saving your file, just to be on the safe side.

Useful articles on troubleshooting the digraphs:

As I created the _vimrc file in gVim with the above contents, all of a sudden the backspace stopped working in insert mode. Apparently this is a feature, not a bug, and could be fixed by putting set backspace=indent,eol,start in your _vimrc file. Check here for the long explanation.

Oh wait, another thing, as soon as you create a _vimrc file on gVim (Edit>Startup Settings) the syntax highlighting disappears. You can set it up with :syntax on or just put syntax on in your _vimrc file to enable it upon start up. Custom syntax highlighting in another post.

© 2018 Miglena Minkova