Skip to content

Rockys tweaks#8

Merged
rocky merged 32 commits intomasterfrom
rockys-tweaks
Feb 6, 2021
Merged

Rockys tweaks#8
rocky merged 32 commits intomasterfrom
rockys-tweaks

Conversation

@rocky
Copy link
Member

@rocky rocky commented Jan 31, 2021

  • Start a character table roundtrip test.
  • Document what is in the various directories
  • Add code to generate GNU Readline inputrc's
  • Split off generate code into its own directory.

@GarkGarcia I notice that already this is detecting a mismatch on what you recently commited today.

rocky added 4 commits January 22, 2021 22:25
* Start a character table roundtrip test.
* Document what is in the various directories
* Add code to generate GNU Readline inputrc's
* Split off generate code into its own directory.
@rocky
Copy link
Member Author

rocky commented Jan 31, 2021

Failure seen in recent commit:

yaml_data = {'AAcute': {'esc-alias': "a'", 'has-unicode-inverse': False, 'is-letter-like': False, 'unicode-equivalent': 'á', ...},...eDot': {'esc-alias': 'a"', 'has-unicode-inverse': False, 'is-letter-like': False, 'unicode-equivalent': 'ä', ...}, ...}
json_data = {'aliased-characters': {' =>': '⇒', '!': '¬', '!<': '≮', '!</': '\uf424', ...}, 'letterlikes': '\uf768\uf760∠Å˘•¸¢♣\uf...DoubleDot': 'ä', ...}, 'unicode-to-wl-dict': {' ̑': '\uf755', '#': '\uf724', '==': '\uf7d9', 'C̣': '\uf81c', ...}, ...}

    def check_roundtrip(yaml_data: dict, json_data: dict):
        wl_to_unicode = json_data["wl-to-unicode-dict"]
        unicode_to_wl = json_data["unicode-to-wl-dict"]
        for k, v in yaml_data.items():
            if v["has-unicode-inverse"]:
                u = v["wl-unicode"]
>               assert (
                    unicode_to_wl[wl_to_unicode[u]] == u
                ), f"key {k} unicode {u}, {wl_to_unicode[u]}"
E               KeyError: '\u2062'

test/test_roundtrip.py:12: KeyError
================================= short test summary info ==================================
FAILED test/test_roundtrip.py::test_roundtrip - KeyError: '\u2062'
=============================== 1 failed, 34 passed in 1.36s ===============================

@GarkGarcia
Copy link
Contributor

Failure seen in recent commit:

yaml_data = {'AAcute': {'esc-alias': "a'", 'has-unicode-inverse': False, 'is-letter-like': False, 'unicode-equivalent': 'á', ...},...eDot': {'esc-alias': 'a"', 'has-unicode-inverse': False, 'is-letter-like': False, 'unicode-equivalent': 'ä', ...}, ...}
json_data = {'aliased-characters': {' =>': '⇒', '!': '¬', '!<': '≮', '!</': '\uf424', ...}, 'letterlikes': '\uf768\uf760∠Å˘•¸¢♣\uf...DoubleDot': 'ä', ...}, 'unicode-to-wl-dict': {' ̑': '\uf755', '#': '\uf724', '==': '\uf7d9', 'C̣': '\uf81c', ...}, ...}

    def check_roundtrip(yaml_data: dict, json_data: dict):
        wl_to_unicode = json_data["wl-to-unicode-dict"]
        unicode_to_wl = json_data["unicode-to-wl-dict"]
        for k, v in yaml_data.items():
            if v["has-unicode-inverse"]:
                u = v["wl-unicode"]
>               assert (
                    unicode_to_wl[wl_to_unicode[u]] == u
                ), f"key {k} unicode {u}, {wl_to_unicode[u]}"
E               KeyError: '\u2062'

test/test_roundtrip.py:12: KeyError
================================= short test summary info ==================================
FAILED test/test_roundtrip.py::test_roundtrip - KeyError: '\u2062'
=============================== 1 failed, 34 passed in 1.36s ===============================

Ohh, I see.

@GarkGarcia
Copy link
Contributor

Hum, I was not able to reproduce the tests locally:

Python 3.6.9 (default, Oct  8 2020, 12:12:24)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from mathics_scanner.characters import replace_wl_with_plain_text as wl2uni
>>> from mathics_scanner.characters import replace_unicode_with_wl as uni2wl
>>> uni2wl(wl2uni("\u2062")) == "\u2062"
True
>>>

@GarkGarcia
Copy link
Contributor

GarkGarcia commented Jan 31, 2021

Failure seen in recent commit:

yaml_data = {'AAcute': {'esc-alias': "a'", 'has-unicode-inverse': False, 'is-letter-like': False, 'unicode-equivalent': 'á', ...},...eDot': {'esc-alias': 'a"', 'has-unicode-inverse': False, 'is-letter-like': False, 'unicode-equivalent': 'ä', ...}, ...}
json_data = {'aliased-characters': {' =>': '⇒', '!': '¬', '!<': '≮', '!</': '\uf424', ...}, 'letterlikes': '\uf768\uf760∠Å˘•¸¢♣\uf...DoubleDot': 'ä', ...}, 'unicode-to-wl-dict': {' ̑': '\uf755', '#': '\uf724', '==': '\uf7d9', 'C̣': '\uf81c', ...}, ...}

    def check_roundtrip(yaml_data: dict, json_data: dict):
        wl_to_unicode = json_data["wl-to-unicode-dict"]
        unicode_to_wl = json_data["unicode-to-wl-dict"]
        for k, v in yaml_data.items():
            if v["has-unicode-inverse"]:
                u = v["wl-unicode"]
>               assert (
                    unicode_to_wl[wl_to_unicode[u]] == u
                ), f"key {k} unicode {u}, {wl_to_unicode[u]}"
E               KeyError: '\u2062'

test/test_roundtrip.py:12: KeyError
================================= short test summary info ==================================
FAILED test/test_roundtrip.py::test_roundtrip - KeyError: '\u2062'
=============================== 1 failed, 34 passed in 1.36s ===============================

Ohh, I get it, the test isn't properly implemented: when the unicode equivalent of named character is precisely the WL unicode rerepresentation (c["unicode-equivalent"] == c["wl-unicode"]) the character isn't included in any of the JSON tables. This is implemented like so to make the JSON tables smaller (so that they load faster), since the conversion is redundant in this case.

For example, if we were to include \u2062 (\[InvisibleTimes]) in json_data["wl-to-unicode-dict"] and json_data["unicode-to-wl-dict"] then json_data["wl-to-unicode-dict"]["\u2062"] would be "\u2062" and json_data["unicode-to-wl-dict"]["\u2062"] would be "\u2062": we would be replacing "\u2062" by "\u2062" and then replacing "\u2062" by "\u2062" again. Including "\u2062" in the tables would not only make the tables larger, but it would also make wl_to_unicode_re and unicode_to_wl_re check for "\u2062" (which isn't needed since the conversion is redundant).

There's three way we can deal with this:

  1. Add notes in the documentation about this, so that users don't make the same mistake.
  2. Remove wl_to_unicode, wl_to_ascii and unicode_to_wl from our public API, so that users are forced to use replace_wl_with_plain_text and replace_unicode_with_wl (in which case roundtriping would work fine).
  3. Add the redundant keys to the JSON tables but avoid including redundant keys in the regex patterns.

Personally, I think we should go with the second solution. The JSON tables really are an implementation detail and users should not expect them to include each and every existing named character, some of them don't have unicode equivalents/inverses and some of them are redundant when it comes to the conversion scheme.

When it comes to the tests, we should test replace_wl_with_plain_text and replace_unicode_with_wl, not the raw tables.

@GarkGarcia
Copy link
Contributor

@rocky If you really want to tests the tables themselves, we should test the YAML table, not the JSON tables.

@rocky
Copy link
Member Author

rocky commented Jan 31, 2021

Ohh, I get it, the test isn't properly implemented: when the unicode equivalent of named character is precisely the WL unicode rerepresentation (c["unicode-equivalent"] == c["wl-unicode"]) the character isn't included in any of the JSON tables.

If that were the case then nothing would have failed. So this is not why the test is failing. In fact, if you go back to the data before your most recent changes the test was not failing. The most likely explanation of why things are failing is that something was marked invertable when it wasn't.

Add notes in the documentation about this, so that users don't make the same mistake.

Please do. In fact I had asked you before about going over adding comments in the YAML like the ones I added about subtllties or why a decision was made to do something in the way it was done.

Remove wl_to_unicode, wl_to_ascii and unicode_to_wl from our public API, so that users are forced to use replace_wl_with_plain_text and replace_unicode_with_wl (in which case roundtriping would work fine).

It is early enough that this is fine too. But aslo note as mentioned yesterday in a chat comment there are two kinds of unicode and there needs to be a way to indicate which unicode you want.

The data in the YAML should be designed to so that there isn't duplication of anything and so by definition it will be self consistent. What we are checking in the tests is the JSON generated from the YAML. There are probably some things that can be tested in the JSON alone, For example that escape sequences for in inputrc start and in in an escape. However there are going to be some things where we have to compare YAML against JSON.

Some of these could be counts for example: that the number of letterlike entires in the YAML is the same number of letterlike entries in JSON tables.

And you should be the one who should have been writing this all along. And it would be nice if I wouldn't have to remind you as many times as I have to do this. You seem to prioritize doing more tasks which potentially (and some cases do) break things over completing large breaking tasks undertook before. This makes it is unpleasant for me to work on this project.

Untill we have a release of everything would you please stop doing new things and focus on shoring up what we currently have. Make a pass over the developer docs (I still haven't merged in the character table stuff in that project but I will do so soon). And add more tests here.

Untill we have finished with the release of everything. please don't work on incorprating unicode into formatting.

@GarkGarcia
Copy link
Contributor

If that were the case then nothing would have failed. So this is not why the test is failing. In fact, if you go back to the data before your most recent changes the test was not failing. The most likely explanation of why things are failing is that something was marked invertable when it wasn't.

Take a look at the JSON data and you'll see \u2062 isn't a key in any of the tables. The problem is we're iteration through the keys of the YAML table, which does contain redundant entries (since it's not optimized for anything at all).

This is failing right now because I implemented this optimization yesterday. As I've shown in my previous comment, the roundtrip does work if you use replace_unicode_with_plain_text and replace_unicode_with_wl instead of the raw tables.

Please do. In fact I had asked you before about going over adding comments in the YAML like the ones I added about subtllties or why a decision was made to do something in the way it was done.

Ok, I'll add extra notes about this in build-tables.py.

It is early enough that this is fine too. But aslo note as mentioned yesterday in a chat comment there are two kinds of unicode and there needs to be a way to indicate which unicode you want.

Fair enough, but this is already handled by the use_unicode parameter. In fact, all use_unicode does is switch between the tables for the two "kinds of unicode" (strict ASCII and full unicode).

The data in the YAML should be designed to so that there isn't duplication of anything and so by definition it will be self consistent.

I disagree with this a bit. Yes the data should be self-consistent (in fact, it is), but even when a key is redundant it should be listed in the YAML table. The point of the YAML table is that we have all of the information in a single place, even if the data is redundant.

There are probably some things that can be tested in the JSON alone, For example that escape sequences for in inputrc start and in in an escape. However there are going to be some things where we have to compare YAML against JSON.

Absolutely. For example, we probably should check if there are redundant keys in any of the JSON tables (because there shouldn't be). I guess my point is that anything concerning conversions should be tested with the conversion routines, not the raw tables.

And you should be the one who should have been writing this all along.

I didn't implement the test because I don't know how pytest works and I didn't know how to set that up, but now you've added the tests I can work on those.

I'll work on implementing everything we discussed in here today.

@rocky
Copy link
Member Author

rocky commented Jan 31, 2021

Take a look at the JSON data and you'll see \u2062 isn't a key in any of the tables. The problem is we're iteration through the keys of the YAML table, which does contain redundant entries (since it's not optimized for anything at all).

Ok. Then the simple adjustment is to iterate over the keys of the JSON table. However we need to cross check against the YAML data (using the JSON key) to see if the character is invertable. If it is, then the assert as written should be done.

Ok, I'll add extra notes about this in build-tables.py.

Great! Also add some these notes to the YAML file as well. Don't assume people will look at Python files to understand how data is organized in the YAML file and sublteties the YAML data might have. It is okay in the YAML file to refer to the Python program docstring as well.

I didn't implement the test because I don't know how pytest works and I didn't know how to set that up, but now you've added the tests I can work on those.

Ok. Great do what you can.

But let's also learn from this. There is this pattern of starting something a bit ambitious and a little more than you can handle all on your own. That's okay and best done after a release. But right now we are trying to solidify what has already been done so we can get it out there. So moving onto new projects right now is a bad idea.

I really regret not having put out a relase after the Django separation was done. And then having this release which woudn't have included that as part of this release but just the things since then.

@GarkGarcia
Copy link
Contributor

I have to go now, but I'll work on more tests, documentation and fixing the errors later today. @rocky I'll also properly answer your last comment when I come back.

@GarkGarcia
Copy link
Contributor

Ok. Then the simple adjustment is to iterate over the keys of the JSON table. However we need to cross check against the YAML data (using the JSON key) to see if the character is invertable. If it is, then the assert as written should be done.

Hum, makes sense. I ended up using the conversion routines instead of key-lookup (as I had proposed), but I'll the checks you described too.

Great! Also add some these notes to the YAML file as well. Don't assume people will look at Python files to understand how data is organized in the YAML file and sublteties the YAML data might have. It is okay in the YAML file to refer to the Python program docstring as well.

Makes sense too. I'll add the notes to the YAML and the README too.

But let's also learn from this. There is this pattern of starting something a bit ambitious and a little more than you can handle all on your own. That's okay and best done after a release. But right now we are trying to solidify what has already been done so we can get it out there. So moving onto new projects right now is a bad idea.

I agree. There was also a communication deficit on my part. The algorithms for resolving translations ended up being quite involved, with settle optimizations, and I failed to communicate with you guys as it developed.

@GarkGarcia
Copy link
Contributor

GarkGarcia commented Jan 31, 2021

@rocky That's interesting: https://github.com/Mathics3/mathics-scanner/pull/8/checks?check_run_id=1801929677 led me to find a bug in the official Wolfram listing of named characters (both \[Sampi] and \[CapitalSampi] are supposedly represented by the unicode character GREEK LETTER SAMPI).

@GarkGarcia
Copy link
Contributor

@rocky I finally got the tests to pass. The sanity tests I implemented caught multiples little errors in the YAML table. I wasn't expecting so many, but I'm glad we caught them. Anyway, I wrote all the tests I thought would be relevant, is there something else you'd like to include in the tests?

I'll work on documentation later today.

@GarkGarcia
Copy link
Contributor

It's also worth noting this tests only guarantee the YAML data is consistent (they don't have any way of knowing if there's a better choice for translating a character or not).

@GarkGarcia
Copy link
Contributor

GarkGarcia commented Jan 31, 2021

Btw sorry for possibly spaming your mail-box with "GitHub run failed" messages, I couldn't get pytest to work locally so I had to keep testing things trough CI.

Um, ok. pip install pytest usually works. Before merging it would be good to squash all of the "Fix another tests" into one commit.

This isn't the first time I've noticed you use CI where others typically test locally. However, if that's going to be the case, it is much preferable that you do that in in a smaller module like this one rather than on Mathics which takes 15 minutes per run and consumes massive resources.

I suspect the one reason we are getting longer delay times when starting Travis is because it keeps an account of testing time we use. I don't know if it is still the situation, but they used to keep a running average of the resources used and if you exceeded a thresshold, your jobs would start getting delayed more often.

@rocky
Copy link
Member Author

rocky commented Jan 31, 2021

Ok. Thanks for doing all of this!

As I watched the commits flying by, one thing that struck me is that there is a lot of the same code that reads in YAML and JSON data. Maybe a helper function in a common module would be useful. Also, within a test maybe the data could be read in just once instead of in each test.

@GarkGarcia
Copy link
Contributor

This isn't the first time I've noticed you use CI where others typically test locally. However, if that's going to be the case, it is much preferable that you do that in in a smaller module like this one rather than on Mathics which takes 15 minutes per run and consumes massive resources.

Yeah, I have the habit of doing so, mainly because I'm quite unexperienced when it comes to developing software for production and because I never can get pytest to work on my machine. I just tried running pip3 install pytest but after install it complains about "yaml not having a FullLoader attribute" (which it clearly does, since I can access from the repo).

As I watched the commits flying by, one thing that struck me is that there is a lot of the same code that reads in YAML and JSON data. Maybe a helper function in a common module would be useful. Also, within a test maybe the data could be read in just once instead of in each test.

Sure, I'll work on that after the documentation.

@rocky
Copy link
Member Author

rocky commented Jan 31, 2021

because I never can get pytest to work on my machine.

Ok. Some weekend we should go over your setup to make sure you can run pytest.

@GarkGarcia
Copy link
Contributor

GarkGarcia commented Jan 31, 2021

@rocky There's still some things missing before we can release the package:

  • We need to properly document functions and classes in our public API.
  • We need to mark functions/constants that shouldn't be in the public API with a _ at the start.

Check-out implementation.rst for the internal documentation I promised. We may want to convert this to Sphinx before the release.

@GarkGarcia GarkGarcia mentioned this pull request Jan 31, 2021
@GarkGarcia
Copy link
Contributor

GarkGarcia commented Feb 1, 2021

@rocky I implemented all the tests you asked for and all of the tests I thought were relevant. Please let me know if there's something else you want me to implement. For now, I get to work on other things I'm interested in.

@rocky
Copy link
Member Author

rocky commented Feb 6, 2021

@GarkGarcia just had a chance to look at in more detail. Overall looks pretty good.

I updated a docstring and added a couple of tests. Look over if you have a chance. Tomorrow, I'll look at the other PR, try merging them in and cutting a release of this part.

@rocky rocky merged commit 83f430f into master Feb 6, 2021
@rocky rocky deleted the rockys-tweaks branch March 19, 2021 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants