Bug 31460

Summary: wp2git: Import Wikipedia page history to git
Product: New/proposed packages Reporter: Ivan Zakharyaschev <imz>
Component: Обычный репозиторийAssignee: Andrey Cherepanov <cas>
Status: NEW --- QA Contact: Andrey Cherepanov <cas>
Severity: normal    
Priority: P3 CC: viy
Version: не указана   
Hardware: all   
OS: Linux   
URL: http://blog.thecybershadow.net/2010/06/16/import-wikipedia-page-history-to-git/
Bug Depends on:    
Bug Blocks: 31414    

Description Ivan Zakharyaschev 2015-11-09 13:26:28 MSK
wp2git:  Import Wikipedia page history to git

https://github.com/CyberShadow/wp2git is the original in D.

https://github.com/dlenski/wp2git is a fork in Python (based on mwclient -- present in SIsyphus).
Comment 1 Ivan Zakharyaschev 2015-11-09 13:32:22 MSK
* It could be used also to import some ALT's documentation into Git repos which is being edited at http://altlinux.org

* As for me, I'm going to use it to import the text of the GOST which is implemented by the LaTeX package in https://bugzilla.altlinux.org/show_bug.cgi?id=31414 from wikisource (https://ru.wikisource.org/wiki/%D0%93%D0%9E%D0%A1%D0%A2_7.32%E2%80%942001 ), where it is collaboratively maintained.
Comment 2 Ivan Zakharyaschev 2015-11-09 22:14:26 MSK
BTW, when I try to use it, there are some problems.

I can't post an issue to the project at github, probably because it is a fork. Though it's the fork where I should post it to, because it looks Python-related.

Here are the errors I get (and the last run is successful -- with
English Wikipedia; perhaps, my default is Russian because of the
locale).

As for now, I have no ideas as to whether something can be fixed in this program
or in my environment.
Comment 3 Ivan Zakharyaschev 2015-11-09 22:15:26 MSK
$ wp2git.py --help
usage: wp2git.py [-h] [-n] [-o OUT] [--lang LANG | --site SITE] article_name

Create a git repository with the history of the specified Wikipedia article.

positional arguments:
  article_name

optional arguments:
  -h, --help         show this help message and exit
  -n, --no-import    Don't invoke git fast-import; only generate fast-import data stream
  -o OUT, --out OUT  Output directory or fast-import stream file
  --lang LANG        Wikipedia language code (default ru)
  --site SITE        Alternate site (e.g. http://commons.wikimedia.org[/w/])
$ wp2git.py --site https://ru.wikisource.org 'ГОСТ 7.32—2001'
Connected to https://ru.wikisource.org/w/
Traceback (most recent call last):
  File "/home/imz/bin/wp2git.py", line 110, in <module>
    main()
  File "/home/imz/bin/wp2git.py", line 63, in main
    page = site.pages[args.article_name]
  File "/usr/lib64/python2.7/site-packages/mwclient/listing.py", line 156, in __getitem__
    return self.get(name, None)
  File "/usr/lib64/python2.7/site-packages/mwclient/listing.py", line 166, in get
    namespace = self.guess_namespace(name)
  File "/usr/lib64/python2.7/site-packages/mwclient/listing.py", line 178, in guess_namespace
    if name.startswith(u'%s:' % self.site.namespaces[ns].replace(' ', '_')):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
$ locale
LANG=ru_RU.utf8
LC_CTYPE="ru_RU.utf8"
LC_NUMERIC="ru_RU.utf8"
LC_TIME="ru_RU.utf8"
LC_COLLATE="ru_RU.utf8"
LC_MONETARY="ru_RU.utf8"
LC_MESSAGES=POSIX
LC_PAPER="ru_RU.utf8"
LC_NAME="ru_RU.utf8"
LC_ADDRESS="ru_RU.utf8"
LC_TELEPHONE="ru_RU.utf8"
LC_MEASUREMENT="ru_RU.utf8"
LC_IDENTIFICATION="ru_RU.utf8"
LC_ALL=
$ wp2git.py --site http://ru.wikisource.org 'ГОСТ 7.32—2001'
Connected to http://ru.wikisource.org/w/
Traceback (most recent call last):
  File "/home/imz/bin/wp2git.py", line 110, in <module>
    main()
  File "/home/imz/bin/wp2git.py", line 63, in main
    page = site.pages[args.article_name]
  File "/usr/lib64/python2.7/site-packages/mwclient/listing.py", line 156, in __getitem__
    return self.get(name, None)
  File "/usr/lib64/python2.7/site-packages/mwclient/listing.py", line 166, in get
    namespace = self.guess_namespace(name)
  File "/usr/lib64/python2.7/site-packages/mwclient/listing.py", line 178, in guess_namespace
    if name.startswith(u'%s:' % self.site.namespaces[ns].replace(' ', '_')):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
$ wp2git.py Bear
Connected to http://ru.wikipedia.org/w/
Traceback (most recent call last):
  File "/home/imz/bin/wp2git.py", line 110, in <module>
    main()
  File "/home/imz/bin/wp2git.py", line 65, in main
    p.error('Page %s does not exist' % s)
NameError: global name 's' is not defined
$ wp2git.py Медведь
Connected to http://ru.wikipedia.org/w/
Traceback (most recent call last):
  File "/home/imz/bin/wp2git.py", line 110, in <module>
    main()
  File "/home/imz/bin/wp2git.py", line 63, in main
    page = site.pages[args.article_name]
  File "/usr/lib64/python2.7/site-packages/mwclient/listing.py", line 156, in __getitem__
    return self.get(name, None)
  File "/usr/lib64/python2.7/site-packages/mwclient/listing.py", line 166, in get
    namespace = self.guess_namespace(name)
  File "/usr/lib64/python2.7/site-packages/mwclient/listing.py", line 178, in guess_namespace
    if name.startswith(u'%s:' % self.site.namespaces[ns].replace(' ', '_')):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
$ wp2git.py --lang en Bear
Connected to http://en.wikipedia.org/w/
Initialized empty Git repository in /home/imz/tests/test-wp2git/Bear/
 >> Revision 239584 by TimShell at Wed Oct 10 21:50:27 2001: *
 >> Revision 346214979 by Alan Millar at Wed Oct 10 22:43:35 2001: Fixing panda back to giant panda
 >> Revision 50758 by Conversion script at Mon Feb 25 15:43:11 2002: Automated conversion
 >> Revision 87603 by Mirwin at Thu Apr 11 20:33:54 2002: Added grizzly bear to list
 >> Revision 88030 by 24.53.240.203 at Fri Jun  7 12:00:50 2002: *
 >> Revision 112194 by Stephen Gilbert at Fri Jun  7 17:39:15 2002: removing dictionary.com link
 >> Revision 132079 by PierreAbbat at Sun Jul  7 08:40:44 2002: restore accidentally deleted end of sentence
 >> Revision 192849 by Andre Engels at Wed Jul 31 07:41:08 2002: de-orphanizing an image
 >> Revision 227848 by Montrealais at Tue Sep  3 11:17:32 2002: 
 >> Revision 227861 by 203.48.160.12 at Wed Sep 18 23:44:43 2002: 
 >> Revision 398194 by Mav at Wed Sep 18 23:56:46 2002: REVERT from VANDALISM by 203.48.160.12
 >> Revision 398212 by Fred Bauder at Fri Nov  1 17:01:16 2002: further reading
 >> Revision 590676 by Stormwriter at Fri Nov  1 17:07:50 2002: 
 >> Revision 590687 by Karen Johnson at Thu Jan 16 11:07:36 2003: I'm not sure which type of bear this is, but uploading a pic I took
 >> Revision 590917 by MartinHarper at Thu Jan 16 11:24:04 2003: link [[bear market]]
 >> Revision 626017 by Robert Merkel at Thu Jan 16 13:28:02 2003: link to koala (mention it's *not* a bear
 >> Revision 626559 by Sannse at Tue Jan 28 12:39:35 2003: [[American]] -> [[United States|American]]
 >> Revision 629029 by 207.213.160.63 at Tue Jan 28 18:52:45 2003: 
 >> Revision 629038 by Bronco~enwiki at Wed Jan 29 18:50:37 2003: Our fifth graders have finished for the time being.
 >> Revision 629079 by Bronco~enwiki at Wed Jan 29 18:53:21 2003: Done?
 >> Revision 659547 by Fred Bauder at Wed Jan 29 19:14:46 2003: removed information about authors of the article
 >> Revision 660956 by Alan Peakall at Tue Feb 11 12:53:34 2003: Copy edit and rationalised links to the Panda articles
 >> Revision 735916 by Ahoerstemeier at Tue Feb 11 22:09:12 2003: cave bear
 >> Revision 748674 by Montrealais at Mon Mar 10 01:37:20 2003: 
 >> Revision 769708 by Kricxjo at Sat Mar 15 09:44:21 2003: eo:
 >> Revision 816500 by Fred Bauder at Sun Mar 23 12:31:46 2003: re use
 >> Revision 930991 by ArnoLagrange at Thu Apr 10 08:00:09 2003: de
 >> Revision 931028 by Tannin at Sat May 17 19:38:29 2003: 
 >> Revision 988458 by Tannin at Sat May 17 19:48:11 2003: 
 >> Revision 988462 by Eclecticology at Mon Jun  2 04:11:48 2003: fixing capitalization
 >> Revision 988465 by Eclecticology at Mon Jun  2 04:12:28 2003: 
 >> Revision 988504 by Tannin at Mon Jun  2 04:12:56 2003: revert to correct case
 >> Revision 988507 by Eclecticology at Mon Jun  2 04:25:27 2003: revert to correct capitalization
 >> Revision 988932 by Tannin at Mon Jun  2 04:26:22 2003: revert
 >> Revision 988936 by Eclecticology at Mon Jun  2 08:13:45 2003: revert
 >> Revision 1015549 by Tannin at Mon Jun  2 08:14:53 2003: revert to correct version
 >> Revision 1122344 by &#178;&#185;&#178; at Mon Jun  9 12:42:36 2003: 
 >> Revision 1122374 by TeunSpaans at Mon Jul  7 12:34:28 2003: +nl
 >> Revision 1122394 by Andre Engels at Mon Jul  7 12:46:16 2003: merged Ursidae in here
 >> Revision 1122415 by Andre Engels at Mon Jul  7 12:55:18 2003: 
 >> Revision 1122422 by Jimfbleak at Mon Jul  7 13:11:09 2003: treid to make text more grown-up
 >> Revision 1152613 by Rmhermen at Mon Jul  7 13:16:15 2003: typos
 >> Revision 1152615 by Andre Engels at Tue Jul 15 17:32:46 2003: made images wrap-around
 >> Revision 1160907 by Andre Engels at Tue Jul 15 17:33:18 2003: 
 >> Revision 1320504 by Baldhur at Thu Jul 17 17:16:36 2003: + taxobox, standardising classification
 >> Revision 1320681 by 81.203.98.109 at Wed Aug 20 20:46:26 2003: 
 >> Revision 1406742 by Rmhermen at Wed Aug 20 21:24:43 2003: 
 >> Revision 1406748 by 62.64.204.83 at Sun Sep  7 20:43:36 2003: 
 >> Revision 1411109 by 62.64.204.83 at Sun Sep  7 20:44:49 2003: 
 >> Revision 1411169 by Rmhermen at Mon Sep  8 18:51:45 2003: 
$
Comment 4 Ivan Zakharyaschev 2015-11-10 01:48:17 MSK
That error happens with python-module-mwclient-0.6.5-alt1.1 from t7. Sisyphus has a newer version. I shall try that one.
Comment 5 Ivan Zakharyaschev 2015-11-10 01:58:43 MSK
No, the same error happens with python-module-mwclient-0.7-alt1.dev.git20140622