Gitwebマルチバイトコメントの文字化け対応
Gitweb使っててデフォルトでマルチバイト文字が正常に表示さるから便利って思ってたら一部のコメントが文字化けしてたよう。
文字化け起こしてるコメントの共通点探すと文字数が多いコメントが文字化け対象のようだったんで「gitweb.cgi」の文字列カット関数「chop_str」を確認。
内部エンコーディングはutf-8でそのまま正規表現と文字数でカット処理してたのが原因ぽい。
ので、
で対応。
編集後の「chop_str」関数はこんな感じ。
# Try to chop given string on a word boundary between position # $len and $len+$add_len. If there is no word boundary there, # chop at $len+$add_len. Do not chop if chopped part plus ellipsis # (marking chopped part) would be longer than given string. sub chop_str { my $str = shift; my $len = shift; my $add_len = shift || 10; my $where = shift || 'right'; # 'left' | 'center' | 'right' # allow only $len chars, but don't cut a word if it would fit in $add_len # if it doesn't fit, cut it if it's still longer than the dots we would add # remove chopped character entities entirely # when chopping in the middle, distribute $len into left and right part # return early if chopping wouldn't make string shorter if ($where eq 'center') { return $str if ($len + 5 >= length($str)); # filler is length 5 $len = int($len/2); } else { return $str if ($len + 4 >= length($str)); # filler is length 4 } # regexps: ending and beginning with word part up to $add_len # delete 2 line, add 2 line. # my $endre = qr/.{$len}\w{0,$add_len}/; # my $begre = qr/\w{0,$add_len}.{$len}/; my $endre = ".{0,$len}\w{0,$add_len}"; my $begre = "\w{0,$add_len}.{0,$len}"; # add 2 line. Encode::from_to($str, 'utf-8', 'euc-jp'); if ($where eq 'left') { $str =~ m/^(.*?)($begre)$/; my ($lead, $body) = ($1, $2); if (length($lead) > 4) { $body =~ s/^[^;]*;// if ($lead =~ m/&[^;]*$/); $lead = " ..."; } # delete 1 line, add 1 line. # return "$lead$body"; $str = "$lead$body"; } elsif ($where eq 'center') { $str =~ m/^($endre)(.*)$/; my ($left, $str) = ($1, $2); $str =~ m/^(.*?)($begre)$/; my ($mid, $right) = ($1, $2); if (length($mid) > 5) { $left =~ s/&[^;]*$//; $right =~ s/^[^;]*;// if ($mid =~ m/&[^;]*$/); $mid = " ... "; } # delete 1 line, add 1 line. # return "$left$mid$right"; $str = "$left$mid$right"; } else { $str =~ m/^($endre)(.*)$/; my $body = $1; my $tail = $2; if (length($tail) > 4) { $body =~ s/&[^;]*$//; $tail = "... "; } # delete 1 line, add 1 line. # return "$body$tail"; $str = "$body$tail"; } # add 2 line. Encode::from_to($str, 'euc-jp', 'utf-8'); return $str; }
修正箇所にはインデント0でコメント付けてます。
以上!