Conversation
|
I'd like to support python 2.5 and 2.6 too, and so to have python3 support in a different branch. |
|
Python 2.5 and 2.6 are quite old, even Django or big libraries like this don't support them anymore… But yeah, that's your choice… But yeah, so why use str ? Make readability accept only unicode input, and then unicode everywhere in readability, that's way simpler. Ok ok, BTW I'm really interested in this python library. Before, I was using a custom solution, and I have discovered that's really not easy… |
Well, mostly I do updates for my users -- I have rarely a chance to use the package myself more than once a year (until this year). I mean, for more than several sites at one time.
Libxml, which is the base of lxml, uses utf-8 under the cover. You'll get automatic conversion to utf-8 anyway, it just really a matter if you would like to see that implicit or explicit. For older lxml and python 2 there were no implicit utf8/unicode conversions, that's why I used explicit one. Maybe things has changed a little.
Except that in real life requests package doesn't work for a lot of real pages.
Yes, I know this, but most of all I'm interested in the scalable approaches. If one parses only several sites and some pages from them -- that's ok to use almost any tool, but if one parses thousands of sites -- you need a tool that won't break and won't need much customization for every specific site.
|
|
Thanks a lot! |
Hi,
I'm very sad this library is not ported to python3.
I have made a port, that is quite different than the one of @Ftzeng as I have removed all the encoding stuff, and the tests still seems to pass with python2.7 & python3. I use requests for downloading the webpages and detecting the correct encoding.
I'm nearly sure there is still work to do, as my tests were very shallow, but I would really like to have a port done… (and that encoding.py is quite… bad, I think : you use utf-8 strings everywhere and there should be no problem).
What do you think ? :)