From b14ebce47dc0a1c1ecc676661e12e12a1b7d1f47 Mon Sep 17 00:00:00 2001 From: Reuben Thomas Date: Mon, 15 Jan 2018 15:40:52 +0000 Subject: [PATCH] Move maintainer documentation into HACKING.org Minor updates, and add how to release with woger --- HACKING.org | 152 ++++++++++++++++++++++++++++++++++++++++++++++++ README.org | 163 +--------------------------------------------------- 2 files changed, 155 insertions(+), 160 deletions(-) create mode 100644 HACKING.org diff --git a/HACKING.org b/HACKING.org new file mode 100644 index 0000000..53264b4 --- /dev/null +++ b/HACKING.org @@ -0,0 +1,152 @@ +#+TITLE: HACKING for Recode +#+OPTIONS: H:2 toc:2 + +* Building from git + +#+BEGIN_SRC sh + git clone https://github.com/rrthomas/Recode.git + cd Recode + ./bootstrap + ./configure + make check + make install +#+END_SRC + +* Maintenance tools +Beyond the usual Unix programs needed for configuring and installing +any GNU package, you need Cython, Flex and Python to achieve simple +modifications to Recode. + +For more encompassing modifications, you might also need recent +versions of Autoconf, automake, Flex, Gettext, Help2man, libtool, m4, +GNU Make, Perl, tar and wget. Just make sure you install m4 before +Autoconf, and Perl before automake or Help2man. + +** Making a release + +To make a release, you'll need [[https://github.com/rrthomas/woger][woger]] and [[https://github.com/aktau/github-release][github-release]], suitably configured. Having tested and pushed all the changes, update the version number in =configure.ac= and write the =NEWS= entry and push those too. + +Then execute: ~make release~ + +* The future +** Motivation +Recode is due for a major overhaul. I want to add a run-time dependency +between Recode and Python, with the admitted goal of shifting the internals +of Recode from C to Python. + +For experimenting what Recode might become and experimenting new +concepts more easily, I created a subsidiary and standalone Python +project named [[https://github.com/pinard/Recodec][Recodec]], which reproduces a good part of Recode +functionality. My goal is now to merge Recodec back into Recode, +rather than slowly stretching the distance between Recode and Recodec. +Recode is going to be a mix of Python, C and Cython. + +** Overall plan +Recode 4 should be organised thus: + +- The main program is written in Python, and through a Cython + interface, calls the existing C API for doing the real work. +- The C API gets merely able to use Cython written steps internally, + besides the actual C steps, but with no Cython steps yet. +- New Cython steps wrap many standard Python codecs, with some + trickery to force Python codecs over actual, older Recode steps. +- Recode library initialization is moved from C to Python, and gets + called through Cython from the C API. +- Initialization is extended to cover the Recodec Python API, which + uses different tables and descriptive data. +- More steps from Recodec get moved into Recode, either coexisting + with or taking over the previous wrapping of Python codecs. +- The remaining code from the Recodec engine gets moved into Recode, + replacing C code having the same fonctionality. +- Special care is given to GNU *libc* or *libiconv* support, maybe going + from the C side to the Python side. +- Proper documentation and decisions follow extensive comparison and + diagnostic of multiple implementations of same charsets or surfaces. +- Profiling allows to fine tune when and how Cython gets used over + Python; standard Python codecs might even be cythonized in Recode. +- Program and library initialization get revised to spare disk + accesses and building descriptive structures, whenever possible. +- The main program directly links to the Python API rather than + through the C API, while the C API becomes a separate facility. + +** Planned differences +Whenever the Python library offers a charset or a surface which Recode +also has, the Python library codec is used. In some cases, this +introduces differences, those will have to be resolved one by one, +either by accepting that the Python library does better, getting the +Python team to improve some codecs, or overriding these from Recodec. + +Other differences may occur, especially in the Asian charset area, +from the fact *libiconv*, GNU *libc* recoding facilities, and various +contributors to the Python codecs project, do not fully agree on how +things should be done. Recodec is likely to offer configuration +mechanisms to choose among various possibilities, but will not likely +attempt to rule out who is right and who is wrong! ☺ + +Issues about reversibility and canonicity, which were much present in +Recode 3.X, are fading out. While some of these were moderately easy +to implement, other cases stayed pending as fairly difficult to solve +without a significant loss of efficiency. I think these issues are +better abandoned than forever kept as half-hearted and not wholly +dependable. Any user concerned about such things might try the +reverse coding to find out if the original file is recoverable, some +new option might automate a (costly) reversibility test. + +One drawback of the whole move is that the Global Interpreter Lock in +Python gets in the way of parallel execution of the code. This would +have been more of a concern if GNU *libc* recoding facilities were +relying on the Recode library, but as things stand by now, I'm +guessing that users will not be much impacted in practice. + +* Other pointers +** Documentation +- IETF references + + - [[ftp://nic.ddn.mil/rfc/rfc1345.txt][Character Mnemonics & Character Sets]], by [[mailto:keld@dkuug.dk][Keld Simonsen]], 1992-06. + - [[ftp://nic.ddn.mil/rfc/rfc1642.txt][UTF-7 - A Mail-Safe Transformation Format of Unicode]], by [[mailto:david_goldsmith@taligent.com][David + Goldsmith]] and [[mailto:mark_davis@taligent.com][Mark Davis]], 1994-07. + - [[ftp://nic.ddn.mil/rfc/rfc2044.txt][UTF-8, a transformation format of Unicode and ISO 10646]], by [[mailto:yergeau@alis.com][François Yergeau]], 1997-10. + +- Various references + + - [[ftp://ftp.unicode.org:/Public/MAPPINGS/][Unicode charset mappings]]. The Unicode consortium makes available + plenty of charset mappings for converting /legacy/ charsets to + Unicode. + - [[ftp://ftp.iro.umontreal.ca/pub/contrib/pinard/accents/oqil-tome1.ps.gz][Normalisation et internationalisation: Inventaire et prospectives + des normes clefs pour le traitement informatique du français.]] + (392p.) or [[http://www.ceveil.qc.ca/Normes][this other copy]]. This is a report, written in French, + discussing charset issues and many other topics as well. [[mailto:bourbeau@progiciels-bpi.ca][Laurent + Bourbeau]] and [[mailto:pinard@iro.umontreal.ca][François Pinard]], 1995-10. + +- Recode specific + + - ETL presentation + + In 1999, the organisers of the [[http://www.m17n.org/conference/m17n99_all_but_registration/welcome.en.html][m17n99 conference]] in Tsukuba, + Japan, were kind enough to invite me. This has been for me a + fabulous trip and experience, and I met many extraordinary people + in there. At the conference, I presented the Translation Project, + and Recode. The Recode [[http:/m17n99.html][presentation slides]] are available. + +** Programs +- libiconv :: This comprehensive [[http://www.gnu.org/software/libiconv/][charset converter library]], by [[mailto:haible@ilog.fr][Bruno + Haible]], revolves around Unicode, and support Asian + encodings among many others. Even Recode uses it! +- tcs :: Here is the [[ftp://research.att.com/dist/tcs.shar.Z][main recoding tool]] from the Plan9 project. +- yuedit :: This [[ftp://sunsite.unc.edu/pub/Linux/apps/editors/X/yudit-1.2.tar.gz][GUI editor]], by [[mailto:gsinai@iname.com][Gaspar Sinai]], 1999-01, handles many + encodings, among which UTF-8. It also installs *uniconv*, a + recoding program, and *uniprint*, a printing tool. +- ucs-fonts :: These [[http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz][6x13 fonts]], by [[mailto:Markus.Kuhn@cl.cam.ac.uk][Markus Kuhn]], 1998-11, covering + Unicode characters besides the Asian sets, merely + replace the Linux fixed 6x13 font. Works nicely with + *yudit*. +- MtRecode :: This [[http://www.lpl.univ-aix.fr/projects/multext/MtRecode/][charset converter]] is oriented towards SGML text + manipulation. It may be freely downloaded for + non-commercial, non-military use. Pointer given by [[mailto:veronis@univ-aix.fr][Jean + Véronis]], 1996-06. +- sp :: This quite nice SGML [[ftp://ftp.jclark.com/pub/sp/sp-1.3.tar.gz][structure analyser]], by [[mailto:jjc@jclark.com][James Clark]], + contains internal C++ modules for handling many charsets. +- b2c :: This [[http://research.de.uu.net:8080/~gnu/b2c/b2c-2.1.tar.gz][program]], by [[mailto:Joerg.Heitkoetter@de.uu.net][Jörg Heitkötter]], 1997-11, is able to + generate interpreted character dumps, but properly embedded + within complete C header files. +- PyRecode :: This [[http://www.suxers.de/PyRecode.tgz][wrapper]], by [[mailto:ajung@server.python.net][Andreas Jung]], provides Recode functionality to Python programs. Also see [[http://www.vex.net/parnassus/apyllo.py?find%3Drecode][this link]] and [[http://www.suxers.de/python/pyrecode.htm][this other link]]. diff --git a/README.org b/README.org index 088816a..6fb0b2e 100644 --- a/README.org +++ b/README.org @@ -44,16 +44,7 @@ past, pretest or official, as well as individual files. * Installation ** In a hurry? -You may then try: - -#+BEGIN_SRC sh - git clone https://github.com/rrthomas/Recode.git - cd Recode - ./bootstrap - ./configure - make check - make install -#+END_SRC +See =INSTALL=. More fine-grained instructions follow. @@ -61,7 +52,7 @@ More fine-grained instructions follow. Simple installation of Recode requires the usual tools and facilities as those needed for most GNU packages. If not already bundled with your system, you also need to pre-install [[http://www.python.org][Python]], version 2.2 or -better. +later. It is also convenient to have some *iconv* library already present on your system, this much extends Recode capabilities, especially in the @@ -89,157 +80,9 @@ Once you have an unpacked distribution, see files: |-------------+------------------------------------------------| Besides those configure options documented in files =INSTALL= and -=ABOUT-NLS=, a few extra options may be accepted after =./configure=: - -- Options =--disable-shared= or =--disable-static= - - to inhibit the building of shared libraries or static libraries; the - default is to always build static libraries, and to attempt building - shared libraries if there is some known recipe for this. - -- Option =--with-gnu-ld= - - to force the assumption that the C compiler uses GNU *ld*. +=ABOUT-NLS=, a few extra options may be accepted by =./configure=: - Option =--with-dmalloc= to trigger a debugging feature for looking at memory management problems, it pre-requires Gray Watson's [[ftp://ftp.letters.com/src/dmalloc/dmalloc.tar.gz][dmalloc package]]. - -** Installation hints -Here are a few hints which might help installing Recode on some -systems. Many may be applied by temporary presetting environment -variables while calling =./configure=. File =INSTALL= explains this. - -* Maintenance tools -Beyond the usual Unix programs needed for configuring and installing -any GNU package, you need Cython, Flex and Python to achieve simple -modifications to Recode. - -For more encompassing modifications, you might also need recent -versions of Autoconf, automake, Flex, Gettext, Help2man, libtool, m4, -GNU Make, Perl, tar and wget. Just make sure you install m4 before -Autoconf, and Perl before automake or Help2man. - -* The future -** Motivation -Recode is due for a major overhaul. I want to add a run-time dependency -between Recode and Python, with the admitted goal of shifting the internals -of Recode from C to Python. - -For experimenting what Recode might become and experimenting new -concepts more easily, I created a subsidiary and standalone Python -project named [[https://github.com/pinard/Recodec][Recodec]], which reproduces a good part of Recode -functionality. My goal is now to merge Recodec back into Recode, -rather than slowly stretching the distance between Recode and Recodec. -Recode is going to be a mix of Python, C and Cython. - -** Overall plan -Recode 4 should be organised thus: - -- The main program is written in Python, and through a Cython - interface, calls the existing C API for doing the real work. -- The C API gets merely able to use Cython written steps internally, - besides the actual C steps, but with no Cython steps yet. -- New Cython steps wrap many standard Python codecs, with some - trickery to force Python codecs over actual, older Recode steps. -- Recode library initialization is moved from C to Python, and gets - called through Cython from the C API. -- Initialization is extended to cover the Recodec Python API, which - uses different tables and descriptive data. -- More steps from Recodec get moved into Recode, either coexisting - with or taking over the previous wrapping of Python codecs. -- The remaining code from the Recodec engine gets moved into Recode, - replacing C code having the same fonctionality. -- Special care is given to GNU *libc* or *libiconv* support, maybe going - from the C side to the Python side. -- Proper documentation and decisions follow extensive comparison and - diagnostic of multiple implementations of same charsets or surfaces. -- Profiling allows to fine tune when and how Cython gets used over - Python; standard Python codecs might even be cythonized in Recode. -- Program and library initialization get revised to spare disk - accesses and building descriptive structures, whenever possible. -- The main program directly links to the Python API rather than - through the C API, while the C API becomes a separate facility. - -** Planned differences -Whenever the Python library offers a charset or a surface which Recode -also has, the Python library codec is used. In some cases, this -introduces differences, those will have to be resolved one by one, -either by accepting that the Python library does better, getting the -Python team to improve some codecs, or overriding these from Recodec. - -Other differences may occur, especially in the Asian charset area, -from the fact *libiconv*, GNU *libc* recoding facilities, and various -contributors to the Python codecs project, do not fully agree on how -things should be done. Recodec is likely to offer configuration -mechanisms to choose among various possibilities, but will not likely -attempt to rule out who is right and who is wrong! ☺ - -Issues about reversibility and canonicity, which were much present in -Recode 3.X, are fading out. While some of these were moderately easy -to implement, other cases stayed pending as fairly difficult to solve -without a significant loss of efficiency. I think these issues are -better abandoned than forever kept as half-hearted and not wholly -dependable. Any user concerned about such things might try the -reverse coding to find out if the original file is recoverable, some -new option might automate a (costly) reversibility test. - -One drawback of the whole move is that the Global Interpreter Lock in -Python gets in the way of parallel execution of the code. This would -have been more of a concern if GNU *libc* recoding facilities were -relying on the Recode library, but as things stand by now, I'm -guessing that users will not be much impacted in practice. - -* Other pointers -** Documentation -- IETF references - - - [[ftp://nic.ddn.mil/rfc/rfc1345.txt][Character Mnemonics & Character Sets]], by [[mailto:keld@dkuug.dk][Keld Simonsen]], 1992-06. - - [[ftp://nic.ddn.mil/rfc/rfc1642.txt][UTF-7 - A Mail-Safe Transformation Format of Unicode]], by [[mailto:david_goldsmith@taligent.com][David - Goldsmith]] and [[mailto:mark_davis@taligent.com][Mark Davis]], 1994-07. - - [[ftp://nic.ddn.mil/rfc/rfc2044.txt][UTF-8, a transformation format of Unicode and ISO 10646]], by [[mailto:yergeau@alis.com][François Yergeau]], 1997-10. - -- Various references - - - [[ftp://ftp.unicode.org:/Public/MAPPINGS/][Unicode charset mappings]]. The Unicode consortium makes available - plenty of charset mappings for converting /legacy/ charsets to - Unicode. - - [[ftp://ftp.iro.umontreal.ca/pub/contrib/pinard/accents/oqil-tome1.ps.gz][Normalisation et internationalisation: Inventaire et prospectives - des normes clefs pour le traitement informatique du français.]] - (392p.) or [[http://www.ceveil.qc.ca/Normes][this other copy]]. This is a report, written in French, - discussing charset issues and many other topics as well. [[mailto:bourbeau@progiciels-bpi.ca][Laurent - Bourbeau]] and [[mailto:pinard@iro.umontreal.ca][François Pinard]], 1995-10. - -- Recode specific - - - ETL presentation - - In 1999, the organisers of the [[http://www.m17n.org/conference/m17n99_all_but_registration/welcome.en.html][m17n99 conference]] in Tsukuba, - Japan, were kind enough to invite me. This has been for me a - fabulous trip and experience, and I met many extraordinary people - in there. At the conference, I presented the Translation Project, - and Recode. The Recode [[http:/m17n99.html][presentation slides]] are available. - -** Programs -- libiconv :: This comprehensive [[http://www.gnu.org/software/libiconv/][charset converter library]], by [[mailto:haible@ilog.fr][Bruno - Haible]], revolves around Unicode, and support Asian - encodings among many others. Even Recode uses it! -- tcs :: Here is the [[ftp://research.att.com/dist/tcs.shar.Z][main recoding tool]] from the Plan9 project. -- yuedit :: This [[ftp://sunsite.unc.edu/pub/Linux/apps/editors/X/yudit-1.2.tar.gz][GUI editor]], by [[mailto:gsinai@iname.com][Gaspar Sinai]], 1999-01, handles many - encodings, among which UTF-8. It also installs *uniconv*, a - recoding program, and *uniprint*, a printing tool. -- ucs-fonts :: These [[http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz][6x13 fonts]], by [[mailto:Markus.Kuhn@cl.cam.ac.uk][Markus Kuhn]], 1998-11, covering - Unicode characters besides the Asian sets, merely - replace the Linux fixed 6x13 font. Works nicely with - *yudit*. -- MtRecode :: This [[http://www.lpl.univ-aix.fr/projects/multext/MtRecode/][charset converter]] is oriented towards SGML text - manipulation. It may be freely downloaded for - non-commercial, non-military use. Pointer given by [[mailto:veronis@univ-aix.fr][Jean - Véronis]], 1996-06. -- sp :: This quite nice SGML [[ftp://ftp.jclark.com/pub/sp/sp-1.3.tar.gz][structure analyser]], by [[mailto:jjc@jclark.com][James Clark]], - contains internal C++ modules for handling many charsets. -- b2c :: This [[http://research.de.uu.net:8080/~gnu/b2c/b2c-2.1.tar.gz][program]], by [[mailto:Joerg.Heitkoetter@de.uu.net][Jörg Heitkötter]], 1997-11, is able to - generate interpreted character dumps, but properly embedded - within complete C header files. -- PyRecode :: This [[http://www.suxers.de/PyRecode.tgz][wrapper]], by [[mailto:ajung@server.python.net][Andreas Jung]], provides Recode functionality to Python programs. Also see [[http://www.vex.net/parnassus/apyllo.py?find%3Drecode][this link]] and [[http://www.suxers.de/python/pyrecode.htm][this other link]]. -- 2.40.0