Ticket #28 (closed defect: worksforme)

Opened 7 years ago

Last modified 7 years ago

Encoding problems when converting svn -> darcs (or bzr).

Reported by: Luca <luca@…> Owned by: lele
Priority: critical Milestone: VersionOne
Component: svn Version: 0.9
Keywords: svn svndump non ascii error í Cc:

Description

I'm having some problems with the encoding of the character í ('i' with acute accent, &iacute; as HTML entity). When converting a svn repository to darcs I get this message:

00:37:07 [I] Changeset "43"
00:37:07 [I] Log message: - Nuevo nivel de logging CRITICAL (L_CRI) para concordar con python.
- Mínimo cambio en el formato de logging.
- Cambio de sección de configuración de DB_DataObject a DBO.
00:37:07 [I] 110 pending changesets in state file
00:37:07 [C] Upstream change application failed
Configuration error: 'ascii' codec can't encode character u'\xed' in position 216: ordinal not in range(128): it seems that current encoding "UTF-8" cannot properly represent at least one of the characters in the upstream changelog. You need to use a wider character set, using "encoding" option.

My locale is UTF-8, but I even used the encoding option with no results. The weir thing is other non-ascii characters seems to work fine (á, é, ó, ú). When I use the svndump as the source, I've got no errors, but 'í' characters are not encoded properly:

Fri Feb  4 12:19:47 ART 2005  luca
  * - Nuevo nivel de logging CRITICAL (L_CRI) para concordar con python.
  - MÃ\adnimo cambio en el formato de logging.
  - Cambio de sección de configuración de DB_DataObject a DBO.

As you can see, ó in configuración is just fine, but í in Mínimo is encoded as \adnimo, which is wrong.

It's easy to reproduce the problem:

cd /tmp
svnadmin create testrepo
svn co file:///tmp/testrepo testwc
touch testwc/test
svn add testwc/test
svn ci -m 'í' testwc

Now you can tailor this repository to convert it to darcs with svn as repo and you'll get the error, or 'svnadmin dump'it and use svndump as repo to get the wrong encoding.

Versions:

  • Subversion: 1.2.3 (r15833)
  • Darcs: 1.0.4
  • Tailor: 0.9.19

Change History

comment:1 Changed 7 years ago by Lele Gaifax

Uhm, I did a quick test, and everything is working smooth here. With this configuration

[project]
source = svn:source
target = darcs:target
root-directory = /tmp/tt/luca
start-revision = INITIAL

[svn:source]
repository=file:///tmp/tt/testrepo
module=/

[darcs:target]

I obtained a darcs repository where the following happens:

$ darcs changes
Tue Jan  3 10:18:35 CET 2006  lele
  * [project @ 1]
  í
$ echo $DARCS_DONT_ESCAPE_8BIT
1
$ unset DARCS_DONT_ESCAPE_8BIT
$ darcs changes
Tue Jan  3 10:18:35 CET 2006  lele
  * [project @ 1]
  \c3\ad

My environment says:

$ locale
LANG=it_IT.UTF-8
LC_CTYPE="it_IT.UTF-8"
...
LC_IDENTIFICATION="it_IT.UTF-8"
LC_ALL=
$ python -m locale
Locale aliasing:

Locale defaults as determined by getdefaultlocale():
------------------------------------------------------------------------
Language:  it_IT
Encoding:  utf-8
...

So, it must be something in your setup: I'll do whatever needed to make it easier spotting this kind of problems, that are the most annoying misfeature of the millenium. In particular I'd like to understand what causes the selection of the ascii codec instead of the utf8 one...

comment:2 Changed 7 years ago by blindglobe@…

  • Summary changed from Encoding problems when converting svn -> darcs to Encoding problems when converting svn -> darcs (or bzr).

I'm getting the same problem, but it's ae (a-umlaut), from a swiss keyboard setting. I tried setting LANG to a UTF-8 setting, but it still gives a similar error to the above (it thinks I'm trying to use an ASCII codec). So I think it's the same problem. Is there anyway to force python/tailor to use a UTF-8 codec?

Tailor 0.9.19 Debian unstable (upgraded today)

repository:  https://svn.r-project.org/ESS/trunk/ revision: 1641 is the bad one...

(I'm converting from SVN to BZR).

comment:3 Changed 7 years ago by lele

Uhm, so maybe it's still that old feature of Subversion of not escaping properly its own XML output that led to the filter-badchars (see option first option).

Could you try to enable such option, and report back the result?

comment:4 Changed 7 years ago by lele

  • Component changed from tailor to svn

comment:5 Changed 7 years ago by lele

  • Status changed from new to closed
  • Resolution set to worksforme

I'm closing this, as the following config file works for me

[DEFAULT]
encoding=utf-8

[project]
source = svn:source
target = darcs:target
root-directory = /tmp/test#28
start-revision = 1640

[svn:source]
repository=https://svn.r-project.org/ESS
module=/trunk

[darcs:target]
Note: See TracTickets for help on using tickets.