qt6-bb10/util/unicode
Ievgenii Meshcheriakov 1f73d4b87c Unicode line breaking: Implement rules LB15a and LB15b
The new rules were added in Unicode 15.1 (TR #14, revision 51).

The rules read:

    LB15a: (sot | BK | CR | LF | NL | OP | QU | GL | SP | ZW)
           [\p{Pi}&QU] SP* ×
    LB15b: × [\p{Pf}&QU] (SP | GL | WJ | CL | QU | CP | EX
           | IS | SY | BK | CR | LF | NL | ZW | eot)

Add two new line breaking classes LineBreak_QU_Pi and _QU_Pf to
represent quotation characters with context that matches left
side of LB15a and right side of LB15b respectively. This way
it is still possible to use the line breaking classes table.

Also add a coment about the original source of the line
break table.

Task-number: QTBUG-121529
Change-Id: Ib35f400e39e76819cd1c3299691f7b040ea37178
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
2024-02-08 17:43:58 +01:00
..
data unicode: Import version 15.1 (UCD version 32) 2024-02-08 16:43:58 +00:00
x11 Initial import from the monolithic Qt. 2011-04-27 12:05:43 +02:00
.gitattributes Initial import from the monolithic Qt. 2011-04-27 12:05:43 +02:00
README Unicode: Add script to facilitate UCD update 2021-10-05 20:38:02 +02:00
main.cpp Unicode line breaking: Implement rules LB15a and LB15b 2024-02-08 17:43:58 +01:00
unicode.pro unicode: More compact IDNA mapping tables implementation 2021-09-03 14:43:16 +02:00
update_ucd.sh Use SPDX license identifiers 2022-05-16 16:37:38 +02:00
writingSystems.sh Use SPDX license identifiers 2022-05-16 16:37:38 +02:00

README

Unicode is used to generate the unicode data in src/corelib/text/.

To update:
* Run `./update_ucd.sh $Version`. This automates the following steps:
  * Find the data (UAX #44, UCD; not the XML version) at
    https://www.unicode.org/Public/zipped/$Version/
  * Unpack the zip file; for each file in data/, replace with the new
    version; find the *BreakProperty.txt in auxiliary/ and emoji-data.txt
    in emoji/.
  * In tst_QTextBoundaryFinder's data/ sub-directory, update its files
    from the auxiliary/ sub-directory of the UCD data.
  * Download https://www.unicode.org/Public/idna/$Version/IdnaMappingTable.txt
    and put it into data/.
  * Download https://www.unicode.org/Public/idna/$Version/IdnaTestV2.txt
    and put it into tests/auto/corelib/io/qurluts46/testdata.
* If needed, add an entry to enum QChar::UnicodeVersion for the new
  Unicode version
* In that case, also update main.cpp's initAgeMap and DATA_VERSION_S*
  to match
* Build this project. Its binary, unicode, ignores command-line
  options and assumes it is being run from this directory. When run,
  it produces lots of output. If it gets as far as updating
  qunicodetables.cpp the output hopefully doesn't matter.
* It'll end prematurely with a qFatal() message if it needs updates,
  either in main.cpp or in QChar:
  * "unassigned or unhandled age value:" initAgeMap() and
    QChar::UnicodeVersion;
  * "Unhandled script property value:" initScriptMap(), QChar::Script,
    qharfbuzzng.cpp's _qtscript_to_hbscript[] array and
    qfontconfigdatabase.cpp's specialLanguages.
  * "unassigned word break class:" enum WordBreakClass,
    word_break_class_string and initWordBreak();
* Assertions or other qFatal()s may trigger: if so, study code and
  understand what's more complicated about this update; talk to folk
  named in the git logs, maybe push a WIP to gerrit to solicit
  advice. Some bit-field may need to be expanded, for example. In some
  cases QChar may need additions to some of its enums.
* Build with the modified code, fix any compilation issues, make check
  in suitable directories, including tst_QTextBoundaryFinder.
* That may have updated qtbase/src/corelib/text/qunicodetables.cpp; if
  so the update matters; be sure to commit the changes to data/ at the
  same time and update text/qt_attribution.json to match; use the UCD
  Revision number, rather than the Unicode standard number, as the
  Version, for all that qunicodetables.cpp uses the latter (see the
  'UAX #44, UCD' page linked from https://www.unicode.org/ucd/ for the
  table with this).
* If there are enum additions in qchar.h (public API), be sure to also
  update the documentation in qchar.cpp for each affected enum,
  respecting the existing ordering.
* If you don't normally build in the source tree, remember to delete
  qtbase/.qmake.stash while you're cleaning up.

The script writingSystems.sh generates a list of writing systems,
ostensibly as a the basis for updating QFontDatabase::WritingSystem
enum; however, the Release 20 output of it contains many more writing
systems than are present in that enum, suggesting it has not been run
in a very long time. Further research needed.