Summer of Code 2012 - Text processing
Last week I finally merged the test code for my Summer of Code project into Mapnik’s code.
The code currently doesn’t render any output image but it already does this:
- Load correct font
- Process each part of the text with correct direction, script and format settings
- Print a list of all glyphs
I also added test cases from the bug reports and removed lots of code that is no longer needed.
Line breaking was not added yet, it turned out to be even more complex than I initially thought. First I want to get my code to actually render output because this makes verifying correct operation of the line breaking algorithm much easier.
I was asked why line breaking is harder and why special libraries (like liblinebreak or ICU) don’t help much. Here is why:
In simple text processing one byte is assumed to be one character and result one glyph. These assumptions are wrong. As 256 possible values for one byte are a much to small value to represent all characters in all languages some characters need two or more bytes. This is handled by ICU by storing them in 16 bit values (UTF-16). However even 16 bits sometimes are to little and several UTF16 codepoints are needed. One has to keep track of this and make sure to never break a sequence.
This is the easy part and mostly handled by ICU. However for line breaking one has to know how long the line is. To do this one has to determine the width of each character. But there is no such thing as the “width of a character”. Only glyphs have widths. And here it starts to get tricky: HarfBuzz does the shaping (i.e. returning glyphs for a given text), but one can’t use glyphs for line breaking because the don’t directly map to characters.
One glyph can represent several characters (like the ﬁ-ligature) and one character can result in several glyphs. Another big problem is that glyphs are in visual order (left-to-right) after shaping but line breaking has to be done in logical order (left-to-right or right-to-left depending on text direction).
As the algorithm should not reorder the glyphs several times or build a character to glyph mapping to be reasonable fast it has to be careful to keep track which glyphs go to which line.
As written above I will work on rendering next. This is a big task and I don’t expect to get more than rendering done this week.Posted by Hermann Kraus on 02 July 2012.