Attachment #13461 for bug #46493

View | Details | Raw Unified | Return to bug 46493
Collapse All | Expand All




Section: utils
Priority: optional
Maintainer: Radovan Garabík <garabik@kassiopeia.juls.savba.sk>
Build-Depends: debhelper (>= 4), dh-python, python3
Standards-Version: 4.3.0

Package: unicode
Architecture: all
Depends: ${misc:Depends}, ${python3:Depends}
Suggests: bzip2
Recommends: unicode-data
Description: display unicode character properties
 unicode is a simple command line utility that displays

Lines 7-11 The sources and package can be downloade Link Here

(-)unicode.orig/./debian/copyright (-1 / +1 lines)
7	http://kassiopeia.juls.savba.sk/~garabik/software/unicode/	7	http://kassiopeia.juls.savba.sk/~garabik/software/unicode/
8		8
9		9
10	Copyright: © 2003-2016 Radovan Garabík <garabik @ kassiopeia.juls.savba.sk>	10	Copyright: © 2003-2022 Radovan Garabík <garabik @ kassiopeia.juls.savba.sk>
11	released under GPL v3, see /usr/share/common-licenses/GPL	11	released under GPL v3, see /usr/share/common-licenses/GPL




unicode (2.9-1) unstable; urgency=low

  * better protection against changed/corrpupted data files (closes: #932846)

 -- Radovan Garabík <garabik@kassiopeia.juls.savba.sk  Fri, 03 Jun 2022 16:09:26 +0200

unicode (2.8-1) unstable; urgency=low

  * display ASCII table (either traditional or the EU–UK Trade and Cooperation
    Agreement version)
  * tidy up manpage (closes: #972047) (closes:#972063)
  * fix decoding paracode arguments (closes: #939196)

 -- Radovan Garabík <garabik@kassiopeia.juls.savba.sk>  Wed, 30 Dec 2020 17:13:32 +0100

unicode (2.7-1) unstable; urgency=low

  * add East Asian width




paracode \- command line Unicode conversion tool
.SH SYNOPSIS
.B paracode
.RB [ \-t
.IR tables ]
string
.SH DESCRIPTION
This manual page documents the
.B paracode
command.
.PP
\fBparacode\fP exploits the full power of the Unicode standard to convert
the text into visually similar stream of glyphs, while using completely
different codepoints.
It is an excellent didactic tool demonstrating the principles and advanced
use of the Unicode standard.
.PP
\fBparacode\fP is a command line tool working as
a filter, reading standard input in UTF-8 encoding and writing to
standard output.
.
.SH OPTIONS
.TP
.BI \-t tables
.BI \-\-tables tables

Use given list of conversion tables, separated by a plus sign.

Special name 'all' selects all the tables.

Note that selecting 'other', 'cyrillic_plus' and 'cherokee' tables (and 'all')
makes use of rather esoteric characters, and not all fonts contain them.


Special table 'mirror' uses quite different character substitution,
is not selected automatically with 'all' and does not work well
with anything except plain ascii alphabetical characters.

Example:

paracode \-t cyrillic+greek+cherokee

paracode \-t cherokee  <input >output

paracode \-r \-t mirror  <input >output




cherokee

all
.
.TP
.B \-r

Display text in reverse order after conversion, best used together with -t mirror.

Display text in reverse order after conversion,
best used together with \-t mirror.
.
.SH SEE ALSO
.BR iconv (1)
.

.SH AUTHOR
Radovan Garab\('ik <garabik @ kassiopeia.juls.savba.sk>





unicode \- command line unicode database query tool
.SH SYNOPSIS
.B unicode
.RI [ options ]
string
.SH DESCRIPTION
This manual page documents the


.SH OPTIONS
.TP
.B \-h
.B \-\-help

Show help and exit.

.TP
.B \-x
.B \-\-hexadecimal

Assume
.I string
to be a hexadecimal number

.TP
.B \-d
.B \-\-decimal

Assume
.I string
to be a decimal number

.TP
.B \-o
.B \-\-octal

Assume
.I string
to be an octal number

.TP
.B \-b
.B \-\-binary

Assume
.I string
to be a binary number

.TP
.B \-r
.B \-\-regexp

Assume
.I string
to be a regular expression

.TP
.B \-s
.B \-\-string

Assume
.I string
to be a sequence of characters

.TP
.B \-a
.B \-\-auto

Try to guess type of
.I string
from one of the above (default)

.TP
.BI \-m MAXCOUNT
.BI \-\-max= MAXCOUNT

Maximal number of codepoints to display, default: 20; use 0 for unlimited

.TP
.BI \-i CHARSET
.BI \-\-io= IOCHARSET

I/O character set. For maximal pleasure, run \fBunicode\fP on UTF-8
capable terminal and specify IOCHARSET to be UTF-8. \fBunicode\fP

locale, you should not need to specify it.

.TP
.BI \-\-fcp= CHARSET
.BI \-\-fromcp= CHARSET

Convert numerical arguments from this encoding, default: no conversion.
Multibyte encodings are supported. This is ignored for non-numerical



.TP
.BI \-c ADDCHARSET
.BI \-\-charset\-add= ADDCHARSET

Show hexadecimal reprezentation of displayed characters in this additional charset.

.TP
.BI \-C USE_COLOUR
.BI \-\-colour= USE_COLOUR

USE_COLOUR is one of
.B on
.B off
.B auto

.B \-\-colour=on
will use ANSI colour codes to colourise the output

.B \-\-colour=off
won't use colours.

.B \-\-colour=auto
will test if standard output is a tty, and use colours only when it is.

.B \-\-color
is a synonym of
.B \-\-colour

.TP
.B \-v
.B \-\-verbose

Be more verbose about displayed characters, e.g. display Unihan information, if available.

.TP
.B \-w
.B \-\-wikipedia

Spawn browser pointing to English Wikipedia entry about the character.

.TP
.B \-\-wt
.B \-\-wiktionary

Spawn browser pointing to English Wiktionary entry about the character.

.TP
.B \-\-brief

Display character information in brief format

.TP
.BI \-\-format= fmt

Use your own format for character information display. See the README for details.


.TP
.B \-\-list

List (approximately) all known encodings.

.TP
.B \-\-download

Try to download UnicodeData.txt into ~/.unicode/

.TP
.B \-\-ascii

Display ASCII table

.TP
.B \-\-brexit\-ascii
.B \-\-brexit

Display ASCII table (EU–UK Trade and Cooperation Agreement 2020 version)


.SH USAGE

\fBunicode\fP tries to guess the type of an argument. In particular,
if the arguments looks like a valid hexadecimal representation of a
Unicode codepoint, it will be considered to be such. Using


and it will not search for 'face' in character descriptions \- for the latter,
use:

\fBunicode\fP \-r face


For example, you can use any of the following to display information


You can specify a range of characters as argumets, \fBunicode\fP will
show these characters in nice tabular format, aligned to 256-byte boundaries.
Use two dots ".." to indicate the range, e.g.

\fBunicode\fP 0450..0520

will display the whole cyrillic and hebrew blocks (characters from U+0400 to U+05FF)

\fBunicode\fP 0400..

will display just characters from U+0400 up to U+04FF

Use \-\-fromcp to query codepoints from other encodings:

\fBunicode\fP \-\-fromcp cp1250 \-d 200

Multibyte encodings are supported:
\fBunicode\fP \-\-fromcp big5 \-x aff3

and multi-char strings are supported, too:

\fBunicode\fP \-\-fromcp utf-8 \-x c599c3adc5a5

.SH BUGS
Tabular format does not deal well with full-width, combining, control




#!/usr/bin/python3

from __future__ import unicode_literals, print_function

import os, glob, sys, unicodedata, locale, gzip, re, traceback, encodings, io, codecs, shutil
import webbrowser, textwrap, struct

#from pprint import pprint

# bz2 was introduced in 2.3, but we want this to work even if for some

    import subprocess as cmd
    from urllib.parse import quote as urlquote
    import io
    from urllib.request import urlopen

    def out(*args):
        "pring args, converting them to output charset"

    import commands as cmd

    from urllib import quote as urlquote
    from urllib import urlopen

    def out(*args):
        "pring args, converting them to output charset"


from optparse import OptionParser

VERSION='2.9'


# list of terminals that support bidi

    for line in f:
        if line.startswith('#') or ';' not in line or '..' not in line:
            continue
        spl = line.split(';', 1)
        ran, desc = spl
        desc = desc.strip()
        low, high = ran.split('..', 1)
        low = int(low, 16)
        high = int(high, 16)
        unicodeblocks[ (low,high) ] = desc

        proplist = ['codepoint', 'name', 'category', 'combining', 'bidi', 'decomposition', 'dummy', 'digit_value', 'numeric_value', 'mirrored', 'unicode1name', 'iso_comment', 'uppercase', 'lowercase', 'titlecase']
        for i, prop in enumerate(proplist):
            if prop!='dummy':
                if i<len(fields):
                    properties[prop] = fields[i]
        if properties['lowercase']:
            properties['lowercase'] = chr(int(properties['lowercase'], 16))
        if properties['uppercase']:

            line = l.strip()
            if not line:
                continue
            spl = line.strip().split('\t')
            if len(spl) != 3:
                continue
            char, key, value = spl
            if int(char[2:], 16) == ch:
                properties[key] = value
            elif int(char[2:], 16)>ch:
                break
    return properties

        fo = codecs.getreader('utf-8')(fo)
        return fo

def get_unicode_cur_version():
    # return current version of the Unicode standard, hardwired for now
    return '14.0.0'

def get_unicodedata_url():
    unicode_version = get_unicode_cur_version()
    url = 'http://www.unicode.org/Public/{}/ucd/UnicodeData.txt'.format(unicode_version)
    return url

def download_unicodedata():
    url = get_unicodedata_url()
    out('Downloading UnicodeData.txt from ', url, '\n')
    HomeDir = os.path.expanduser('~/.unicode')
    HomeUnicodeData = os.path.join(HomeDir, "UnicodeData.txt.gz")

    # we want to minimize the chance of leaving a corrupted file around
    tmp_file = HomeUnicodeData+'.tmp'
    try:
        if not os.path.exists(HomeDir):
            os.makedirs(HomeDir)
        response = urlopen(url)
        r = response.getcode()
        if r != 200:
            # this is handled automatically in python3, the exception will be raised by urlopen
            raise IOError('HTTP response code '+str(r))
        if os.path.exists(HomeUnicodeData):
            out(HomeUnicodeData, ' already exists, but downloading as requested\n')
        out('downloading...')
        shutil.copyfileobj(response, gzip.open(tmp_file, 'wb'))
        shutil.move(tmp_file, HomeUnicodeData)
        out(HomeUnicodeData, ' downloaded\n')
    finally:
        if os.path.exists(tmp_file):
            os.remove(tmp_file)

def GrepInNames(pattern, prefill_cache=False):
    f = None
    for name in UnicodeDataFileNames:

Cannot find UnicodeData.txt, please place it into
/usr/share/unidata/UnicodeData.txt,
/usr/share/unicode/UnicodeData.txt, ~/.unicode/ or current
working directory (optionally you can gzip, bzip2 or xz it).
Without the file, searching will be much slower.

You can donwload the file from {} (or replace {} with current Unicode version); or run {} --download

""".format(get_unicodedata_url(), get_unicode_cur_version(), sys.argv[0]))

    if prefill_cache:
        if f:

        if maxcount:
            counter += 1
        if counter > options.maxcount:
            sys.stdout.flush()
            sys.stderr.write("\nToo many characters to display, more than %s, use --max 0 (or other value) option to change it\n" % options.maxcount)
            return
        properties = get_unicode_properties(c)
        ordc = ord(c)

def unescape(s):
    return s.replace(r'\n', '\n')

ascii_cc_names = ('NUL', 'SOH', 'STX', 'ETX', 'EOT', 'ENQ', 'ACK', 'BEL', 'BS', 'HT', 'LF', 'VT', 'FF', 'CR', 'SO', 'SI', 'DLE', 'DC1', 'DC2', 'DC3', 'DC4', 'NAK', 'SYN', 'ETB', 'CAN', 'EM', 'SUB', 'ESC', 'FS', 'GS', 'RS', 'US')

def display_ascii_table():
    print('Dec Hex    Dec Hex    Dec Hex  Dec Hex  Dec Hex  Dec Hex   Dec Hex   Dec Hex')
    for row in range(0, 16):
        for col in range(0, 8):
            cp = 16*col+row
            ch = chr(cp) if 32<=cp else ascii_cc_names[cp]
            ch = 'DEL' if cp==127 else ch
            frm = '{:3d} {:02X} {:2s}'
            if cp < 32:
                frm = '{:3d} {:02X} {:4s}'
            elif cp >= 96:
                frm = '{:4d} {:02X} {:2s}'
            cell = frm.format(cp, cp, ch)
            print(cell, end='')
        print()

brexit_ascii_diffs = {
 30: ' ',
 31: ' ',
 34: "'",
123: '{}{',
125: '}}',
127: ' ',
128: ' ',
129: ' ',
        }

def display_brexit_ascii_table():
    print(' + | 0    1    2    3    4    5    6    7    8    9')
    print('---+-----------------------------------------------')
    for row in range(30, 130, 10):
        print('{:3d}'.format(row), end='|')
        for col in range(0, 10):
            cp = col+row
            ch = brexit_ascii_diffs.get(cp, chr(cp))
            cell = ' {:3s} '.format(ch)
            print(cell, end='')
        print()



format_string_default = '''{yellow}{bold}U+{ordc:04X} {name}{default}
{green}UTF-8:{default} {utf8} {green}UTF-16BE:{default} {utf16be} {green}Decimal:{default} {decimal} {green}Octal:{default} {octal}{opt_additional}
{pchar}{opt_flipcase}{opt_uppercase}{opt_lowercase}

          action="store", dest="format_string", type="string",
          default=format_string_default,
          help="formatting string")
    parser.add_option("--brief", "--terse", "--br",
          action="store_const", dest="format_string",
          const='{pchar} U+{ordc:04X} {name}\n',
          help="Brief format")
    parser.add_option("--download",
          action="store_const", dest="download_unicodedata",
          const=True,
          help="Try to dowload UnicodeData.txt")
    parser.add_option("--ascii",
          action="store_const", dest="ascii_table",
          const=True,
          help="Display ASCII table")
    parser.add_option("--brexit-ascii", "--brexit",
          action="store_const", dest="brexit_ascii_table",
          const=True,
          help="Display ASCII table (EU-UK Trade and Cooperation Agreement version)")

    global options
    (options, arguments) = parser.parse_args()

        print (textwrap.fill(' '.join(all_encodings)))
        sys.exit()

    if options.ascii_table:
        display_ascii_table()
        sys.exit()

    if options.brexit_ascii_table:
        display_brexit_ascii_table()
        sys.exit()

    if options.download_unicodedata:
        download_unicodedata()
        sys.exit()

    if len(arguments)==0:
        parser.print_help()
        sys.exit()

Lines 8-14 os.chdir(os.path.abspath(os.path.dirname Link Here

(-)unicode.orig/./setup.py (-1 / +1 lines)
8		8
9		9
10	setup(name='unicode',	10	setup(name='unicode',
11	version='2.7',	11	version='2.8',
12	scripts=['unicode', 'paracode'],	12	scripts=['unicode', 'paracode'],
13	# entry_points={'console_scripts': [	13	# entry_points={'console_scripts': [
14	# 'unicode = unicode:main',	14	# 'unicode = unicode:main',

Lines 201-207 def main(): Link Here

(-)unicode.orig/./paracode (-1 / +1 lines)
201	(options, args) = parser.parse_args()	201	(options, args) = parser.parse_args()
202		202
203	if args:	203	if args:
204	to_convert = ' '.join(args).decode('utf-8')	204	to_convert = decode(' '.join(args), 'utf-8')
205	else:	205	else:
206	to_convert = None	206	to_convert = None
207		207

Lines 4-10 To use unicode utility, you need: Link Here

(-)unicode.orig/./README (-1 / +1 lines)
4	- python >=2.6 (str format() method is needed), preferrably wide	4	- python >=2.6 (str format() method is needed), preferrably wide
5	unicode build, however, python3 is recommended	5	unicode build, however, python3 is recommended
6	- python optparse library (part of since python2.3)	6	- python optparse library (part of since python2.3)
7	- UnicodeData.txt file (http://www.unicode.org/Public) which	7	- UnicodeData.txt file (http://www.unicode.org/Public/13.0.0/ucd/UnicodeData.txt; or replace 13.0.0 with current Unicode version) which
8	you should put into /usr/share/unicode/, ~/.unicode/ or current	8	you should put into /usr/share/unicode/, ~/.unicode/ or current
9	working directory.	9	working directory.
10	- apt-get install unicode-data # Debian	10	- apt-get install unicode-data # Debian

Return to bug 46493

Lines 2-13 Source: unicode Link Here

(-)unicode.orig/./debian/control (-1 / +2 lines)
2	Section: utils	2	Section: utils
3	Priority: optional	3	Priority: optional
4	Maintainer: Radovan Garabík <garabik@kassiopeia.juls.savba.sk>	4	Maintainer: Radovan Garabík <garabik@kassiopeia.juls.savba.sk>
5	Build-Depends: debhelper (>= 4), dh-python	5	Build-Depends: debhelper (>= 4), dh-python, python3
6	Standards-Version: 4.3.0	6	Standards-Version: 4.3.0
7		7
8	Package: unicode	8	Package: unicode
9	Architecture: all	9	Architecture: all
10	Depends: ${misc:Depends}, ${python3:Depends}	10	Depends: ${misc:Depends}, ${python3:Depends}
		11	Suggests: bzip2
11	Recommends: unicode-data	12	Recommends: unicode-data
12	Description: display unicode character properties	13	Description: display unicode character properties
13	unicode is a simple command line utility that displays	14	unicode is a simple command line utility that displays

Lines 1-3 Link Here

(-)unicode.orig/./debian/changelog (+15 lines)
		1	unicode (2.9-1) unstable; urgency=low
		2
		3	* better protection against changed/corrpupted data files (closes: #932846)
		4
		5	-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk Fri, 03 Jun 2022 16:09:26 +0200
		6
		7	unicode (2.8-1) unstable; urgency=low
		8
		9	* display ASCII table (either traditional or the EU–UK Trade and Cooperation
		10	Agreement version)
		11	* tidy up manpage (closes: #972047) (closes:#972063)
		12	* fix decoding paracode arguments (closes: #939196)
		13
		14	-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Wed, 30 Dec 2020 17:13:32 +0100
		15
1	unicode (2.7-1) unstable; urgency=low	16	unicode (2.7-1) unstable; urgency=low
2		17
3	* add East Asian width	18	* add East Asian width

Lines 4-49 Link Here

(-)unicode.orig/./paracode.1 (-19 / +20 lines)
4	paracode \- command line Unicode conversion tool	4	paracode \- command line Unicode conversion tool
5	.SH SYNOPSIS	5	.SH SYNOPSIS
6	.B paracode	6	.B paracode
7	.RI [ -t tables ]	7	.RB [ \-t
		8	.IR tables ]
8	string	9	string
9	.SH DESCRIPTION	10	.SH DESCRIPTION
10	This manual page documents the	11	This manual page documents the
11	.B paracode	12	.B paracode
12	command.	13	command.
13	.PP	14	.PP
14	\fBparacode\fP exploits the full power of the Unicode standard to convert the text	15	\fBparacode\fP exploits the full power of the Unicode standard to convert
15	into visually similar stream of glyphs, while using completely different codepoints.	16	the text into visually similar stream of glyphs, while using completely
16	It is an excellent didactic tool demonstrating the principles and advanced use of	17	different codepoints.
17	the Unicode standard.	18	It is an excellent didactic tool demonstrating the principles and advanced
		19	use of the Unicode standard.
18	.PP	20	.PP
19	\fBparacode\fP is a command line tool working as	21	\fBparacode\fP is a command line tool working as
20	a filter, reading standard input in UTF-8 encoding and writing to	22	a filter, reading standard input in UTF-8 encoding and writing to
21	standard output.	23	standard output.
22		24	.
23	.SH OPTIONS	25	.SH OPTIONS
24	.TP	26	.TP
25	.BI \-t tables	27	.BI \-t tables
26	.BI \-\-tables	28	.BI \-\-tables tables
27		29
28	Use given list of conversion tables, separated by a plus sign.	30	Use given list of conversion tables, separated by a plus sign.
29		31
30	Special name 'all' selects all the tables.	32	Special name 'all' selects all the tables.
31		33
32	Note that selecting 'other', 'cyrillic_plus' and 'cherokee' tables (and 'all')	34	Note that selecting 'other', 'cyrillic_plus' and 'cherokee' tables (and 'all')
33	makes use of rather esoteric characters, and not all fonts contain them.	35	makes use of rather esoteric characters, and not all fonts contain them.
34		36
35
36	Special table 'mirror' uses quite different character substitution,	37	Special table 'mirror' uses quite different character substitution,
37	is not selected automatically with 'all' and does not work well	38	is not selected automatically with 'all' and does not work well
38	with anything except plain ascii alphabetical characters.	39	with anything except plain ascii alphabetical characters.
39		40
40	Example:	41	Example:
41		42
42	paracode -t cyrillic+greek+cherokee	43	paracode \-t cyrillic+greek+cherokee
43		44
44	paracode -t cherokee <input >output	45	paracode \-t cherokee <input >output
45		46
46	paracode -r -t mirror <input >output	47	paracode \-r \-t mirror <input >output
47		48
48		49
49		50
Lines 60-75 other Link Here
60	cherokee	61	cherokee
61		62
62	all	63	all
63		64	.
64	.TP	65	.TP
65	.BI \-r	66	.B \-r
66
67	Display text in reverse order after conversion, best used together with -t mirror.
68		67
		68	Display text in reverse order after conversion,
		69	best used together with \-t mirror.
		70	.
69	.SH SEE ALSO	71	.SH SEE ALSO
70	iconv(1)	72	.BR iconv (1)
71		73	.
72
73	.SH AUTHOR	74	.SH AUTHOR
74	Radovan Garab\('ik <garabik @ kassiopeia.juls.savba.sk>	75	Radovan Garab\('ik <garabik @ kassiopeia.juls.savba.sk>
75		76

Lines 4-10 Link Here

(-)unicode.orig/./unicode.1 (-61 / +77 lines)
4	unicode \- command line unicode database query tool	4	unicode \- command line unicode database query tool
5	.SH SYNOPSIS	5	.SH SYNOPSIS
6	.B unicode	6	.B unicode
7	.RI [ options ]	7	.RI [ options ]
8	string	8	string
9	.SH DESCRIPTION	9	.SH DESCRIPTION
10	This manual page documents the	10	This manual page documents the
Lines 15-90 command. Link Here
15		15
16	.SH OPTIONS	16	.SH OPTIONS
17	.TP	17	.TP
18	.BI \-h	18	.B \-h
19	.BI \-\-help	19	.B \-\-help
20		20
21	Show help and exit.	21	Show help and exit.
22		22
23	.TP	23	.TP
24	.BI \-x	24	.B \-x
25	.BI \-\-hexadecimal	25	.B \-\-hexadecimal
26		26
27	Assume	27	Assume
28	.I string	28	.I string
29	to be a hexadecimal number	29	to be a hexadecimal number
30		30
31	.TP	31	.TP
32	.BI \-d	32	.B \-d
33	.BI \-\-decimal	33	.B \-\-decimal
34		34
35	Assume	35	Assume
36	.I string	36	.I string
37	to be a decimal number	37	to be a decimal number
38		38
39	.TP	39	.TP
40	.BI \-o	40	.B \-o
41	.BI \-\-octal	41	.B \-\-octal
42		42
43	Assume	43	Assume
44	.I string	44	.I string
45	to be an octal number	45	to be an octal number
46		46
47	.TP	47	.TP
48	.BI \-b	48	.B \-b
49	.BI \-\-binary	49	.B \-\-binary
50		50
51	Assume	51	Assume
52	.I string	52	.I string
53	to be a binary number	53	to be a binary number
54		54
55	.TP	55	.TP
56	.BI \-r	56	.B \-r
57	.BI \-\-regexp	57	.B \-\-regexp
58		58
59	Assume	59	Assume
60	.I string	60	.I string
61	to be a regular expression	61	to be a regular expression
62		62
63	.TP	63	.TP
64	.BI \-s	64	.B \-s
65	.BI \-\-string	65	.B \-\-string
66		66
67	Assume	67	Assume
68	.I string	68	.I string
69	to be a sequence of characters	69	to be a sequence of characters
70		70
71	.TP	71	.TP
72	.BI \-a	72	.B \-a
73	.BI \-\-auto	73	.B \-\-auto
74		74
75	Try to guess type of	75	Try to guess type of
76	.I string	76	.I string
77	from one of the above (default)	77	from one of the above (default)
78		78
79	.TP	79	.TP
80	.BI \-mMAXCOUNT	80	.BI \-m MAXCOUNT
81	.BI \-\-max=MAXCOUNT	81	.BI \-\-max= MAXCOUNT
82		82
83	Maximal number of codepoints to display, default: 20; use 0 for unlimited	83	Maximal number of codepoints to display, default: 20; use 0 for unlimited
84		84
85	.TP	85	.TP
86	.BI \-iCHARSET	86	.BI \-i CHARSET
87	.BI \-\-io=IOCHARSET	87	.BI \-\-io= IOCHARSET
88		88
89	I/O character set. For maximal pleasure, run \fBunicode\fP on UTF-8	89	I/O character set. For maximal pleasure, run \fBunicode\fP on UTF-8
90	capable terminal and specify IOCHARSET to be UTF-8. \fBunicode\fP	90	capable terminal and specify IOCHARSET to be UTF-8. \fBunicode\fP
Lines 92-99 tries to guess this value from your loca Link Here
92	locale, you should not need to specify it.	92	locale, you should not need to specify it.
93		93
94	.TP	94	.TP
95	.BI \-\-fcp=CHARSET	95	.BI \-\-fcp= CHARSET
96	.BI \-\-fromcp=CHARSET	96	.BI \-\-fromcp= CHARSET
97		97
98	Convert numerical arguments from this encoding, default: no conversion.	98	Convert numerical arguments from this encoding, default: no conversion.
99	Multibyte encodings are supported. This is ignored for non-numerical	99	Multibyte encodings are supported. This is ignored for non-numerical
Lines 101-119 arguments. Link Here
101		101
102		102
103	.TP	103	.TP
104	.BI \-cADDCHARSET	104	.BI \-c ADDCHARSET
105	.BI \-\-charset\-add=ADDCHARSET	105	.BI \-\-charset\-add= ADDCHARSET
106		106
107	Show hexadecimal reprezentation of displayed characters in this additional charset.	107	Show hexadecimal reprezentation of displayed characters in this additional charset.
108		108
109	.TP	109	.TP
110	.BI \-CUSE_COLOUR	110	.BI \-C USE_COLOUR
111	.BI \-\-colour=USE_COLOUR	111	.BI \-\-colour= USE_COLOUR
112		112
113	USE_COLOUR is one of	113	USE_COLOUR is one of
114	.I on	114	.B on
115	.I off	115	.B off
116	.I auto	116	.B auto
117		117
118	.B \-\-colour=on	118	.B \-\-colour=on
119	will use ANSI colour codes to colourise the output	119	will use ANSI colour codes to colourise the output
Lines 121-170 will use ANSI colour codes to colourise Link Here
121	.B \-\-colour=off	121	.B \-\-colour=off
122	won't use colours.	122	won't use colours.
123		123
124	.B \-\-colour=auto	124	.B \-\-colour=auto
125	will test if standard output is a tty, and use colours only when it is.	125	will test if standard output is a tty, and use colours only when it is.
126		126
127	.BI \-\-color	127	.B \-\-color
128	is a synonym of	128	is a synonym of
129	.BI \-\-colour	129	.B \-\-colour
130		130
131	.TP	131	.TP
132	.BI \-v	132	.B \-v
133	.BI \-\-verbose	133	.B \-\-verbose
134		134
135	Be more verbose about displayed characters, e.g. display Unihan information, if available.	135	Be more verbose about displayed characters, e.g. display Unihan information, if available.
136		136
137	.TP	137	.TP
138	.BI \-w	138	.B \-w
139	.BI \-\-wikipedia	139	.B \-\-wikipedia
140		140
141	Spawn browser pointing to English Wikipedia entry about the character.	141	Spawn browser pointing to English Wikipedia entry about the character.
142		142
143	.TP	143	.TP
144	.BI \-\-wt	144	.B \-\-wt
145	.BI \-\-wiktionary	145	.B \-\-wiktionary
146		146
147	Spawn browser pointing to English Wiktionary entry about the character.	147	Spawn browser pointing to English Wiktionary entry about the character.
148		148
149	.TP	149	.TP
150	.BI \-\-brief	150	.B \-\-brief
151		151
152	Display character information in brief format	152	Display character information in brief format
153		153
154	.TP	154	.TP
155	.BI \-\-format=fmt	155	.BI \-\-format= fmt
156		156
157	Use your own format for character information display. See the README for details.	157	Use your own format for character information display. See the README for details.
158		158
159
160	.TP	159	.TP
161	.BI \-\-list	160	.B \-\-list
162		161
163	List (approximately) all known encodings.	162	List (approximately) all known encodings.
164		163
		164	.TP
		165	.B \-\-download
		166
		167	Try to download UnicodeData.txt into ~/.unicode/
		168
		169	.TP
		170	.B \-\-ascii
		171
		172	Display ASCII table
		173
		174	.TP
		175	.B \-\-brexit\-ascii
		176	.B \-\-brexit
		177
		178	Display ASCII table (EU–UK Trade and Cooperation Agreement 2020 version)
		179
		180
165	.SH USAGE	181	.SH USAGE
166		182
167	\fBunicode\fP tries to guess the type of an argument. In particular,	183	\fBunicode\fP tries to guess the type of an argument. In particular,
168	if the arguments looks like a valid hexadecimal representation of a	184	if the arguments looks like a valid hexadecimal representation of a
169	Unicode codepoint, it will be considered to be such. Using	185	Unicode codepoint, it will be considered to be such. Using
170		186
Lines 174-180 will display information about U+FACE CJ Link Here
174	and it will not search for 'face' in character descriptions \- for the latter,	190	and it will not search for 'face' in character descriptions \- for the latter,
175	use:	191	use:
176		192
177	\fBunicode\fP -r face	193	\fBunicode\fP \-r face
178		194
179		195
180	For example, you can use any of the following to display information	196	For example, you can use any of the following to display information
Lines 191-216 about U+00E1 LATIN SMALL LETTER A WITH Link Here
191		207
192	You can specify a range of characters as argumets, \fBunicode\fP will	208	You can specify a range of characters as argumets, \fBunicode\fP will
193	show these characters in nice tabular format, aligned to 256-byte boundaries.	209	show these characters in nice tabular format, aligned to 256-byte boundaries.
194	Use two dots ".." to indicate the range, e.g.	210	Use two dots ".." to indicate the range, e.g.
195		211
196	\fBunicode\fP 0450..0520	212	\fBunicode\fP 0450..0520
197		213
198	will display the whole cyrillic and hebrew blocks (characters from U+0400 to U+05FF)	214	will display the whole cyrillic and hebrew blocks (characters from U+0400 to U+05FF)
199		215
200	\fBunicode\fP 0400..	216	\fBunicode\fP 0400..
201		217
202	will display just characters from U+0400 up to U+04FF	218	will display just characters from U+0400 up to U+04FF
203		219
204	Use --fromcp to query codepoints from other encodings:	220	Use \-\-fromcp to query codepoints from other encodings:
205		221
206	\fBunicode\fP --fromcp cp1250 -d 200	222	\fBunicode\fP \-\-fromcp cp1250 \-d 200
207		223
208	Multibyte encodings are supported:	224	Multibyte encodings are supported:
209	\fBunicode\fP --fromcp big5 -x aff3	225	\fBunicode\fP \-\-fromcp big5 \-x aff3
210		226
211	and multi-char strings are supported, too:	227	and multi-char strings are supported, too:
212		228
213	\fBunicode\fP --fromcp utf-8 -x c599c3adc5a5	229	\fBunicode\fP \-\-fromcp utf-8 \-x c599c3adc5a5
214		230
215	.SH BUGS	231	.SH BUGS
216	Tabular format does not deal well with full-width, combining, control	232	Tabular format does not deal well with full-width, combining, control

Lines 1-9 Link Here

(-)unicode.orig/./unicode (-12 / +125 lines)
1	#!/usr/bin/python3	1	#!/usr/bin/python3
2		2
3	from __future__ import unicode_literals	3	from __future__ import unicode_literals, print_function
4		4
5	import os, glob, sys, unicodedata, locale, gzip, re, traceback, encodings, io, codecs	5	import os, glob, sys, unicodedata, locale, gzip, re, traceback, encodings, io, codecs, shutil
6	import webbrowser, textwrap, struct	6	import webbrowser, textwrap, struct
		7
7	#from pprint import pprint	8	#from pprint import pprint
8		9
9	# bz2 was introduced in 2.3, but we want this to work even if for some	10	# bz2 was introduced in 2.3, but we want this to work even if for some
Lines 31-36 if PY3: Link Here
31	import subprocess as cmd	32	import subprocess as cmd
32	from urllib.parse import quote as urlquote	33	from urllib.parse import quote as urlquote
33	import io	34	import io
		35	from urllib.request import urlopen
34		36
35	def out(*args):	37	def out(*args):
36	"pring args, converting them to output charset"	38	"pring args, converting them to output charset"
Lines 50-55 else: # python2 Link Here
50	import commands as cmd	52	import commands as cmd
51		53
52	from urllib import quote as urlquote	54	from urllib import quote as urlquote
		55	from urllib import urlopen
53		56
54	def out(*args):	57	def out(*args):
55	"pring args, converting them to output charset"	58	"pring args, converting them to output charset"
Lines 66-72 else: # python2 Link Here
66		69
67	from optparse import OptionParser	70	from optparse import OptionParser
68		71
69	VERSION='2.7'	72	VERSION='2.9'
70		73
71		74
72	# list of terminals that support bidi	75	# list of terminals that support bidi
Lines 230-238 def get_unicode_blocks_descriptions(): Link Here
230	for line in f:	233	for line in f:
231	if line.startswith('#') or ';' not in line or '..' not in line:	234	if line.startswith('#') or ';' not in line or '..' not in line:
232	continue	235	continue
233	ran, desc = line.split(';')	236	spl = line.split(';', 1)
		237	ran, desc = spl
234	desc = desc.strip()	238	desc = desc.strip()
235	low, high = ran.split('..')	239	low, high = ran.split('..', 1)
236	low = int(low, 16)	240	low = int(low, 16)
237	high = int(high, 16)	241	high = int(high, 16)
238	unicodeblocks[ (low,high) ] = desc	242	unicodeblocks[ (low,high) ] = desc
Lines 256-262 def get_unicode_properties(ch): Link Here
256	proplist = ['codepoint', 'name', 'category', 'combining', 'bidi', 'decomposition', 'dummy', 'digit_value', 'numeric_value', 'mirrored', 'unicode1name', 'iso_comment', 'uppercase', 'lowercase', 'titlecase']	260	proplist = ['codepoint', 'name', 'category', 'combining', 'bidi', 'decomposition', 'dummy', 'digit_value', 'numeric_value', 'mirrored', 'unicode1name', 'iso_comment', 'uppercase', 'lowercase', 'titlecase']
257	for i, prop in enumerate(proplist):	261	for i, prop in enumerate(proplist):
258	if prop!='dummy':	262	if prop!='dummy':
259	properties[prop] = fields[i]	263	if i<len(fields):
		264	properties[prop] = fields[i]
260	if properties['lowercase']:	265	if properties['lowercase']:
261	properties['lowercase'] = chr(int(properties['lowercase'], 16))	266	properties['lowercase'] = chr(int(properties['lowercase'], 16))
262	if properties['uppercase']:	267	if properties['uppercase']:
Lines 330-338 def get_unihan_properties_internal(ch): Link Here
330	line = l.strip()	335	line = l.strip()
331	if not line:	336	if not line:
332	continue	337	continue
333	char, key, value = line.strip().split('\t')	338	spl = line.strip().split('\t')
		339	if len(spl) != 3:
		340	continue
		341	char, key, value = spl
334	if int(char[2:], 16) == ch:	342	if int(char[2:], 16) == ch:
335	properties[key] = value.decode('utf-8')	343	properties[key] = value
336	elif int(char[2:], 16)>ch:	344	elif int(char[2:], 16)>ch:
337	break	345	break
338	return properties	346	return properties
Lines 412-417 def OpenGzip(fname): Link Here
412	fo = codecs.getreader('utf-8')(fo)	420	fo = codecs.getreader('utf-8')(fo)
413	return fo	421	return fo
414		422
		423	def get_unicode_cur_version():
		424	# return current version of the Unicode standard, hardwired for now
		425	return '14.0.0'
		426
		427	def get_unicodedata_url():
		428	unicode_version = get_unicode_cur_version()
		429	url = 'http://www.unicode.org/Public/{}/ucd/UnicodeData.txt'.format(unicode_version)
		430	return url
		431
		432	def download_unicodedata():
		433	url = get_unicodedata_url()
		434	out('Downloading UnicodeData.txt from ', url, '\n')
		435	HomeDir = os.path.expanduser('~/.unicode')
		436	HomeUnicodeData = os.path.join(HomeDir, "UnicodeData.txt.gz")
		437
		438	# we want to minimize the chance of leaving a corrupted file around
		439	tmp_file = HomeUnicodeData+'.tmp'
		440	try:
		441	if not os.path.exists(HomeDir):
		442	os.makedirs(HomeDir)
		443	response = urlopen(url)
		444	r = response.getcode()
		445	if r != 200:
		446	# this is handled automatically in python3, the exception will be raised by urlopen
		447	raise IOError('HTTP response code '+str(r))
		448	if os.path.exists(HomeUnicodeData):
		449	out(HomeUnicodeData, ' already exists, but downloading as requested\n')
		450	out('downloading...')
		451	shutil.copyfileobj(response, gzip.open(tmp_file, 'wb'))
		452	shutil.move(tmp_file, HomeUnicodeData)
		453	out(HomeUnicodeData, ' downloaded\n')
		454	finally:
		455	if os.path.exists(tmp_file):
		456	os.remove(tmp_file)
		457
415	def GrepInNames(pattern, prefill_cache=False):	458	def GrepInNames(pattern, prefill_cache=False):
416	f = None	459	f = None
417	for name in UnicodeDataFileNames:	460	for name in UnicodeDataFileNames:
Lines 428-437 def GrepInNames(pattern, prefill_cache=F Link Here
428	Cannot find UnicodeData.txt, please place it into	471	Cannot find UnicodeData.txt, please place it into
429	/usr/share/unidata/UnicodeData.txt,	472	/usr/share/unidata/UnicodeData.txt,
430	/usr/share/unicode/UnicodeData.txt, ~/.unicode/ or current	473	/usr/share/unicode/UnicodeData.txt, ~/.unicode/ or current
431	working directory (optionally you can gzip it).	474	working directory (optionally you can gzip, bzip2 or xz it).
432	Without the file, searching will be much slower.	475	Without the file, searching will be much slower.
433		476
434	""" )	477	You can donwload the file from {} (or replace {} with current Unicode version); or run {} --download
		478
		479	""".format(get_unicodedata_url(), get_unicode_cur_version(), sys.argv[0]))
435		480
436	if prefill_cache:	481	if prefill_cache:
437	if f:	482	if f:
Lines 635-641 def print_characters(clist, maxcount, fo Link Here
635	if maxcount:	680	if maxcount:
636	counter += 1	681	counter += 1
637	if counter > options.maxcount:	682	if counter > options.maxcount:
638	out("\nToo many characters to display, more than %s, use --max 0 (or other value) option to change it\n" % options.maxcount)	683	sys.stdout.flush()
		684	sys.stderr.write("\nToo many characters to display, more than %s, use --max 0 (or other value) option to change it\n" % options.maxcount)
639	return	685	return
640	properties = get_unicode_properties(c)	686	properties = get_unicode_properties(c)
641	ordc = ord(c)	687	ordc = ord(c)
Lines 809-814 def is_range(s, typ): Link Here
809	def unescape(s):	855	def unescape(s):
810	return s.replace(r'\n', '\n')	856	return s.replace(r'\n', '\n')
811		857
		858	ascii_cc_names = ('NUL', 'SOH', 'STX', 'ETX', 'EOT', 'ENQ', 'ACK', 'BEL', 'BS', 'HT', 'LF', 'VT', 'FF', 'CR', 'SO', 'SI', 'DLE', 'DC1', 'DC2', 'DC3', 'DC4', 'NAK', 'SYN', 'ETB', 'CAN', 'EM', 'SUB', 'ESC', 'FS', 'GS', 'RS', 'US')
		859
		860	def display_ascii_table():
		861	print('Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex')
		862	for row in range(0, 16):
		863	for col in range(0, 8):
		864	cp = 16*col+row
		865	ch = chr(cp) if 32<=cp else ascii_cc_names[cp]
		866	ch = 'DEL' if cp==127 else ch
		867	frm = '{:3d} {:02X} {:2s}'
		868	if cp < 32:
		869	frm = '{:3d} {:02X} {:4s}'
		870	elif cp >= 96:
		871	frm = '{:4d} {:02X} {:2s}'
		872	cell = frm.format(cp, cp, ch)
		873	print(cell, end='')
		874	print()
		875
		876	brexit_ascii_diffs = {
		877	30: ' ',
		878	31: ' ',
		879	34: "'",
		880	123: '{}{',
		881	125: '}}',
		882	127: ' ',
		883	128: ' ',
		884	129: ' ',
		885	}
		886
		887	def display_brexit_ascii_table():
		888	print(' + \| 0 1 2 3 4 5 6 7 8 9')
		889	print('---+-----------------------------------------------')
		890	for row in range(30, 130, 10):
		891	print('{:3d}'.format(row), end='\|')
		892	for col in range(0, 10):
		893	cp = col+row
		894	ch = brexit_ascii_diffs.get(cp, chr(cp))
		895	cell = ' {:3s} '.format(ch)
		896	print(cell, end='')
		897	print()
		898
		899
		900
812	format_string_default = '''{yellow}{bold}U+{ordc:04X} {name}{default}	901	format_string_default = '''{yellow}{bold}U+{ordc:04X} {name}{default}
813	{green}UTF-8:{default} {utf8} {green}UTF-16BE:{default} {utf16be} {green}Decimal:{default} {decimal} {green}Octal:{default} {octal}{opt_additional}	902	{green}UTF-8:{default} {utf8} {green}UTF-16BE:{default} {utf16be} {green}Decimal:{default} {decimal} {green}Octal:{default} {octal}{opt_additional}
814	{pchar}{opt_flipcase}{opt_uppercase}{opt_lowercase}	903	{pchar}{opt_flipcase}{opt_uppercase}{opt_lowercase}
Lines 880-889 def main(): Link Here
880	action="store", dest="format_string", type="string",	969	action="store", dest="format_string", type="string",
881	default=format_string_default,	970	default=format_string_default,
882	help="formatting string")	971	help="formatting string")
883	parser.add_option("--brief", "--terse",	972	parser.add_option("--brief", "--terse", "--br",
884	action="store_const", dest="format_string",	973	action="store_const", dest="format_string",
885	const='{pchar} U+{ordc:04X} {name}\n',	974	const='{pchar} U+{ordc:04X} {name}\n',
886	help="Brief format")	975	help="Brief format")
		976	parser.add_option("--download",
		977	action="store_const", dest="download_unicodedata",
		978	const=True,
		979	help="Try to dowload UnicodeData.txt")
		980	parser.add_option("--ascii",
		981	action="store_const", dest="ascii_table",
		982	const=True,
		983	help="Display ASCII table")
		984	parser.add_option("--brexit-ascii", "--brexit",
		985	action="store_const", dest="brexit_ascii_table",
		986	const=True,
		987	help="Display ASCII table (EU-UK Trade and Cooperation Agreement version)")
887		988
888	global options	989	global options
889	(options, arguments) = parser.parse_args()	990	(options, arguments) = parser.parse_args()
Lines 899-904 def main(): Link Here
899	print (textwrap.fill(' '.join(all_encodings)))	1000	print (textwrap.fill(' '.join(all_encodings)))
900	sys.exit()	1001	sys.exit()
901		1002
		1003	if options.ascii_table:
		1004	display_ascii_table()
		1005	sys.exit()
		1006
		1007	if options.brexit_ascii_table:
		1008	display_brexit_ascii_table()
		1009	sys.exit()
		1010
		1011	if options.download_unicodedata:
		1012	download_unicodedata()
		1013	sys.exit()
		1014
902	if len(arguments)==0:	1015	if len(arguments)==0:
903	parser.print_help()	1016	parser.print_help()
904	sys.exit()	1017	sys.exit()