Rockys tweaks by rocky · Pull Request #8 · Mathics3/Mathics3-scanner

rocky · 2021-01-31T02:35:37Z

Start a character table roundtrip test.
Document what is in the various directories
Add code to generate GNU Readline inputrc's
Split off generate code into its own directory.

@GarkGarcia I notice that already this is detecting a mismatch on what you recently commited today.

* Start a character table roundtrip test. * Document what is in the various directories * Add code to generate GNU Readline inputrc's * Split off generate code into its own directory.

rocky · 2021-01-31T02:37:03Z

Failure seen in recent commit:

yaml_data = {'AAcute': {'esc-alias': "a'", 'has-unicode-inverse': False, 'is-letter-like': False, 'unicode-equivalent': 'á', ...},...eDot': {'esc-alias': 'a"', 'has-unicode-inverse': False, 'is-letter-like': False, 'unicode-equivalent': 'ä', ...}, ...}
json_data = {'aliased-characters': {' =>': '⇒', '!': '¬', '!<': '≮', '!</': '\uf424', ...}, 'letterlikes': '\uf768\uf760∠Å˘•¸¢♣\uf...DoubleDot': 'ä', ...}, 'unicode-to-wl-dict': {' ̑': '\uf755', '#': '\uf724', '==': '\uf7d9', 'C̣': '\uf81c', ...}, ...}

    def check_roundtrip(yaml_data: dict, json_data: dict):
        wl_to_unicode = json_data["wl-to-unicode-dict"]
        unicode_to_wl = json_data["unicode-to-wl-dict"]
        for k, v in yaml_data.items():
            if v["has-unicode-inverse"]:
                u = v["wl-unicode"]
>               assert (
                    unicode_to_wl[wl_to_unicode[u]] == u
                ), f"key {k} unicode {u}, {wl_to_unicode[u]}"
E               KeyError: '\u2062'

test/test_roundtrip.py:12: KeyError
================================= short test summary info ==================================
FAILED test/test_roundtrip.py::test_roundtrip - KeyError: '\u2062'
=============================== 1 failed, 34 passed in 1.36s ===============================

GarkGarcia · 2021-01-31T12:25:22Z

Failure seen in recent commit:

yaml_data = {'AAcute': {'esc-alias': "a'", 'has-unicode-inverse': False, 'is-letter-like': False, 'unicode-equivalent': 'á', ...},...eDot': {'esc-alias': 'a"', 'has-unicode-inverse': False, 'is-letter-like': False, 'unicode-equivalent': 'ä', ...}, ...}
json_data = {'aliased-characters': {' =>': '⇒', '!': '¬', '!<': '≮', '!</': '\uf424', ...}, 'letterlikes': '\uf768\uf760∠Å˘•¸¢♣\uf...DoubleDot': 'ä', ...}, 'unicode-to-wl-dict': {' ̑': '\uf755', '#': '\uf724', '==': '\uf7d9', 'C̣': '\uf81c', ...}, ...}

    def check_roundtrip(yaml_data: dict, json_data: dict):
        wl_to_unicode = json_data["wl-to-unicode-dict"]
        unicode_to_wl = json_data["unicode-to-wl-dict"]
        for k, v in yaml_data.items():
            if v["has-unicode-inverse"]:
                u = v["wl-unicode"]
>               assert (
                    unicode_to_wl[wl_to_unicode[u]] == u
                ), f"key {k} unicode {u}, {wl_to_unicode[u]}"
E               KeyError: '\u2062'

test/test_roundtrip.py:12: KeyError
================================= short test summary info ==================================
FAILED test/test_roundtrip.py::test_roundtrip - KeyError: '\u2062'
=============================== 1 failed, 34 passed in 1.36s ===============================

Ohh, I see.

GarkGarcia · 2021-01-31T12:30:26Z

Hum, I was not able to reproduce the tests locally:

Python 3.6.9 (default, Oct  8 2020, 12:12:24)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from mathics_scanner.characters import replace_wl_with_plain_text as wl2uni
>>> from mathics_scanner.characters import replace_unicode_with_wl as uni2wl
>>> uni2wl(wl2uni("\u2062")) == "\u2062"
True
>>>

GarkGarcia · 2021-01-31T12:53:18Z

Failure seen in recent commit:

yaml_data = {'AAcute': {'esc-alias': "a'", 'has-unicode-inverse': False, 'is-letter-like': False, 'unicode-equivalent': 'á', ...},...eDot': {'esc-alias': 'a"', 'has-unicode-inverse': False, 'is-letter-like': False, 'unicode-equivalent': 'ä', ...}, ...}
json_data = {'aliased-characters': {' =>': '⇒', '!': '¬', '!<': '≮', '!</': '\uf424', ...}, 'letterlikes': '\uf768\uf760∠Å˘•¸¢♣\uf...DoubleDot': 'ä', ...}, 'unicode-to-wl-dict': {' ̑': '\uf755', '#': '\uf724', '==': '\uf7d9', 'C̣': '\uf81c', ...}, ...}

    def check_roundtrip(yaml_data: dict, json_data: dict):
        wl_to_unicode = json_data["wl-to-unicode-dict"]
        unicode_to_wl = json_data["unicode-to-wl-dict"]
        for k, v in yaml_data.items():
            if v["has-unicode-inverse"]:
                u = v["wl-unicode"]
>               assert (
                    unicode_to_wl[wl_to_unicode[u]] == u
                ), f"key {k} unicode {u}, {wl_to_unicode[u]}"
E               KeyError: '\u2062'

test/test_roundtrip.py:12: KeyError
================================= short test summary info ==================================
FAILED test/test_roundtrip.py::test_roundtrip - KeyError: '\u2062'
=============================== 1 failed, 34 passed in 1.36s ===============================

Ohh, I get it, the test isn't properly implemented: when the unicode equivalent of named character is precisely the WL unicode rerepresentation (c["unicode-equivalent"] == c["wl-unicode"]) the character isn't included in any of the JSON tables. This is implemented like so to make the JSON tables smaller (so that they load faster), since the conversion is redundant in this case.

For example, if we were to include \u2062 (\[InvisibleTimes]) in json_data["wl-to-unicode-dict"] and json_data["unicode-to-wl-dict"] then json_data["wl-to-unicode-dict"]["\u2062"] would be "\u2062" and json_data["unicode-to-wl-dict"]["\u2062"] would be "\u2062": we would be replacing "\u2062" by "\u2062" and then replacing "\u2062" by "\u2062" again. Including "\u2062" in the tables would not only make the tables larger, but it would also make wl_to_unicode_re and unicode_to_wl_re check for "\u2062" (which isn't needed since the conversion is redundant).

There's three way we can deal with this:

Add notes in the documentation about this, so that users don't make the same mistake.
Remove wl_to_unicode, wl_to_ascii and unicode_to_wl from our public API, so that users are forced to use replace_wl_with_plain_text and replace_unicode_with_wl (in which case roundtriping would work fine).
Add the redundant keys to the JSON tables but avoid including redundant keys in the regex patterns.

Personally, I think we should go with the second solution. The JSON tables really are an implementation detail and users should not expect them to include each and every existing named character, some of them don't have unicode equivalents/inverses and some of them are redundant when it comes to the conversion scheme.

When it comes to the tests, we should test replace_wl_with_plain_text and replace_unicode_with_wl, not the raw tables.

GarkGarcia · 2021-01-31T12:56:12Z

@rocky If you really want to tests the tables themselves, we should test the YAML table, not the JSON tables.

rocky · 2021-01-31T13:36:12Z

Ohh, I get it, the test isn't properly implemented: when the unicode equivalent of named character is precisely the WL unicode rerepresentation (c["unicode-equivalent"] == c["wl-unicode"]) the character isn't included in any of the JSON tables.

If that were the case then nothing would have failed. So this is not why the test is failing. In fact, if you go back to the data before your most recent changes the test was not failing. The most likely explanation of why things are failing is that something was marked invertable when it wasn't.

Add notes in the documentation about this, so that users don't make the same mistake.

Please do. In fact I had asked you before about going over adding comments in the YAML like the ones I added about subtllties or why a decision was made to do something in the way it was done.

Remove wl_to_unicode, wl_to_ascii and unicode_to_wl from our public API, so that users are forced to use replace_wl_with_plain_text and replace_unicode_with_wl (in which case roundtriping would work fine).

It is early enough that this is fine too. But aslo note as mentioned yesterday in a chat comment there are two kinds of unicode and there needs to be a way to indicate which unicode you want.

The data in the YAML should be designed to so that there isn't duplication of anything and so by definition it will be self consistent. What we are checking in the tests is the JSON generated from the YAML. There are probably some things that can be tested in the JSON alone, For example that escape sequences for in inputrc start and in in an escape. However there are going to be some things where we have to compare YAML against JSON.

Some of these could be counts for example: that the number of letterlike entires in the YAML is the same number of letterlike entries in JSON tables.

And you should be the one who should have been writing this all along. And it would be nice if I wouldn't have to remind you as many times as I have to do this. You seem to prioritize doing more tasks which potentially (and some cases do) break things over completing large breaking tasks undertook before. This makes it is unpleasant for me to work on this project.

Untill we have a release of everything would you please stop doing new things and focus on shoring up what we currently have. Make a pass over the developer docs (I still haven't merged in the character table stuff in that project but I will do so soon). And add more tests here.

Untill we have finished with the release of everything. please don't work on incorprating unicode into formatting.

GarkGarcia · 2021-01-31T14:40:59Z

If that were the case then nothing would have failed. So this is not why the test is failing. In fact, if you go back to the data before your most recent changes the test was not failing. The most likely explanation of why things are failing is that something was marked invertable when it wasn't.

Take a look at the JSON data and you'll see \u2062 isn't a key in any of the tables. The problem is we're iteration through the keys of the YAML table, which does contain redundant entries (since it's not optimized for anything at all).

This is failing right now because I implemented this optimization yesterday. As I've shown in my previous comment, the roundtrip does work if you use replace_unicode_with_plain_text and replace_unicode_with_wl instead of the raw tables.

Please do. In fact I had asked you before about going over adding comments in the YAML like the ones I added about subtllties or why a decision was made to do something in the way it was done.

Ok, I'll add extra notes about this in build-tables.py.

It is early enough that this is fine too. But aslo note as mentioned yesterday in a chat comment there are two kinds of unicode and there needs to be a way to indicate which unicode you want.

Fair enough, but this is already handled by the use_unicode parameter. In fact, all use_unicode does is switch between the tables for the two "kinds of unicode" (strict ASCII and full unicode).

The data in the YAML should be designed to so that there isn't duplication of anything and so by definition it will be self consistent.

I disagree with this a bit. Yes the data should be self-consistent (in fact, it is), but even when a key is redundant it should be listed in the YAML table. The point of the YAML table is that we have all of the information in a single place, even if the data is redundant.

There are probably some things that can be tested in the JSON alone, For example that escape sequences for in inputrc start and in in an escape. However there are going to be some things where we have to compare YAML against JSON.

Absolutely. For example, we probably should check if there are redundant keys in any of the JSON tables (because there shouldn't be). I guess my point is that anything concerning conversions should be tested with the conversion routines, not the raw tables.

And you should be the one who should have been writing this all along.

I didn't implement the test because I don't know how pytest works and I didn't know how to set that up, but now you've added the tests I can work on those.

I'll work on implementing everything we discussed in here today.

rocky · 2021-01-31T15:11:26Z

Take a look at the JSON data and you'll see \u2062 isn't a key in any of the tables. The problem is we're iteration through the keys of the YAML table, which does contain redundant entries (since it's not optimized for anything at all).

Ok. Then the simple adjustment is to iterate over the keys of the JSON table. However we need to cross check against the YAML data (using the JSON key) to see if the character is invertable. If it is, then the assert as written should be done.

Ok, I'll add extra notes about this in build-tables.py.

Great! Also add some these notes to the YAML file as well. Don't assume people will look at Python files to understand how data is organized in the YAML file and sublteties the YAML data might have. It is okay in the YAML file to refer to the Python program docstring as well.

I didn't implement the test because I don't know how pytest works and I didn't know how to set that up, but now you've added the tests I can work on those.

Ok. Great do what you can.

But let's also learn from this. There is this pattern of starting something a bit ambitious and a little more than you can handle all on your own. That's okay and best done after a release. But right now we are trying to solidify what has already been done so we can get it out there. So moving onto new projects right now is a bad idea.

I really regret not having put out a relase after the Django separation was done. And then having this release which woudn't have included that as part of this release but just the things since then.

…tests

GarkGarcia · 2021-01-31T15:16:59Z

I have to go now, but I'll work on more tests, documentation and fixing the errors later today. @rocky I'll also properly answer your last comment when I come back.

GarkGarcia · 2021-01-31T15:42:32Z

Ok. Then the simple adjustment is to iterate over the keys of the JSON table. However we need to cross check against the YAML data (using the JSON key) to see if the character is invertable. If it is, then the assert as written should be done.

Hum, makes sense. I ended up using the conversion routines instead of key-lookup (as I had proposed), but I'll the checks you described too.

Great! Also add some these notes to the YAML file as well. Don't assume people will look at Python files to understand how data is organized in the YAML file and sublteties the YAML data might have. It is okay in the YAML file to refer to the Python program docstring as well.

Makes sense too. I'll add the notes to the YAML and the README too.

But let's also learn from this. There is this pattern of starting something a bit ambitious and a little more than you can handle all on your own. That's okay and best done after a release. But right now we are trying to solidify what has already been done so we can get it out there. So moving onto new projects right now is a bad idea.

I agree. There was also a communication deficit on my part. The algorithms for resolving translations ended up being quite involved, with settle optimizations, and I failed to communicate with you guys as it developed.

GarkGarcia · 2021-01-31T17:52:21Z

@rocky That's interesting: https://github.com/Mathics3/mathics-scanner/pull/8/checks?check_run_id=1801929677 led me to find a bug in the official Wolfram listing of named characters (both \[Sampi] and \[CapitalSampi] are supposedly represented by the unicode character GREEK LETTER SAMPI).

…canner into rockys-tweaks

GarkGarcia · 2021-01-31T19:15:09Z

@rocky I finally got the tests to pass. The sanity tests I implemented caught multiples little errors in the YAML table. I wasn't expecting so many, but I'm glad we caught them. Anyway, I wrote all the tests I thought would be relevant, is there something else you'd like to include in the tests?

I'll work on documentation later today.

GarkGarcia · 2021-01-31T19:25:03Z

It's also worth noting this tests only guarantee the YAML data is consistent (they don't have any way of knowing if there's a better choice for translating a character or not).

GarkGarcia · 2021-01-31T19:30:16Z

Btw sorry for possibly spaming your mail-box with "GitHub run failed" messages, I couldn't get pytest to work locally so I had to keep testing things trough CI.

Um, ok. pip install pytest usually works. Before merging it would be good to squash all of the "Fix another tests" into one commit.

This isn't the first time I've noticed you use CI where others typically test locally. However, if that's going to be the case, it is much preferable that you do that in in a smaller module like this one rather than on Mathics which takes 15 minutes per run and consumes massive resources.

I suspect the one reason we are getting longer delay times when starting Travis is because it keeps an account of testing time we use. I don't know if it is still the situation, but they used to keep a running average of the resources used and if you exceeded a thresshold, your jobs would start getting delayed more often.

rocky · 2021-01-31T19:31:44Z

Ok. Thanks for doing all of this!

As I watched the commits flying by, one thing that struck me is that there is a lot of the same code that reads in YAML and JSON data. Maybe a helper function in a common module would be useful. Also, within a test maybe the data could be read in just once instead of in each test.

GarkGarcia · 2021-01-31T20:37:11Z

This isn't the first time I've noticed you use CI where others typically test locally. However, if that's going to be the case, it is much preferable that you do that in in a smaller module like this one rather than on Mathics which takes 15 minutes per run and consumes massive resources.

Yeah, I have the habit of doing so, mainly because I'm quite unexperienced when it comes to developing software for production and because I never can get pytest to work on my machine. I just tried running pip3 install pytest but after install it complains about "yaml not having a FullLoader attribute" (which it clearly does, since I can access from the repo).

As I watched the commits flying by, one thing that struck me is that there is a lot of the same code that reads in YAML and JSON data. Maybe a helper function in a common module would be useful. Also, within a test maybe the data could be read in just once instead of in each test.

Sure, I'll work on that after the documentation.

rocky · 2021-01-31T22:15:33Z

because I never can get pytest to work on my machine.

Ok. Some weekend we should go over your setup to make sure you can run pytest.

GarkGarcia · 2021-01-31T22:19:49Z

@rocky There's still some things missing before we can release the package:

We need to properly document functions and classes in our public API.
We need to mark functions/constants that shouldn't be in the public API with a _ at the start.

Check-out implementation.rst for the internal documentation I promised. We may want to convert this to Sphinx before the release.

GarkGarcia · 2021-02-01T21:46:52Z

@rocky I implemented all the tests you asked for and all of the tests I thought were relevant. Please let me know if there's something else you want me to implement. For now, I get to work on other things I'm interested in.

Fixed the general sanity tests

I removed the part of the code that printed header comments because it's better for users in general to have control over that (they can write their own header if they want to)

…in YAML table

…characters`

rocky · 2021-02-06T05:07:38Z

@GarkGarcia just had a chance to look at in more detail. Overall looks pretty good.

I updated a docstring and added a couple of tests. Look over if you have a chance. Tomorrow, I'll look at the other PR, try merging them in and cutting a release of this part.

rocky added 4 commits January 22, 2021 22:25

Administrivia

7e97d13

Merge branch 'master' of github.com:Mathics3/mathics-scanner

869b668

Merge github.com:Mathics3/mathics-scanner

5ddb0b0

Misc tweaks:

a6a98bc

* Start a character table roundtrip test. * Document what is in the various directories * Add code to generate GNU Readline inputrc's * Split off generate code into its own directory.

GarkGarcia added 2 commits January 31, 2021 14:55

Added a comment explaining the optimization

8bc9d80

Fixed the roundtrip test

d23d0eb

Added sanity check for has-unicode-inverse and fixed the reound trip …

d3a5253

…tests

GarkGarcia added 10 commits January 31, 2021 15:59

Fixed and improved the roundtrip tests

e8f4ec0

Improved the sanity checks for has-unicode-inverse

a4c063f

Fixed typos

25cd5e8

Added sanity checks for the WL-to-ASCII conversions

73870cc

Fixed typo

7804a75

Added sanity checks for letter-like character

979b7fb

Added general sanity tests for the YAML table

fc9e9f8

Fixed the sanity tests for letter-like characters

afa4753

Improved the general tests for YAML

a5e38c3

Fixed typos

1174572

GarkGarcia added 2 commits January 31, 2021 17:54

Fixed error in the YAML table

52ffaea

Fixed error in the YAML table

8fb9905

Merge branch 'rockys-tweaks' of https://github.com/Mathics3/mathics-s…

b11adc1

…canner into rockys-tweaks

GarkGarcia mentioned this pull request Jan 31, 2021

Translation tests! #2

Closed

This was referenced Feb 2, 2021

Document the tokeniser #11

Merged

Properlly deliniate our public API #10

Closed

GarkGarcia and others added 11 commits February 5, 2021 22:50

Fixed another errors in the YAML table

e6474cc

Updated the workflows

e839146

Fixed errors in the YAML table

d20931c

Fixed the general sanity tests

Fixed syntax error

4c07d17

Fixed the general sanity tests and errors in YAML

8275a6c

Fix errors in the inputrc generation routines

0c26840

I removed the part of the code that printed header comments because it's better for users in general to have control over that (they can write their own header if they want to)

Added an awfull lot of documentation

01b83b0

Added checks for decisions about the translation that are documented …

1ae6d06

…in YAML table

Fixed minor errors in the internal documentation on `mathics_scanner.…

778039c

…characters`

Extracted the code that loads the data from disk to a separate module

155c08f

Fixed import issues

4d0def3

rocky force-pushed the rockys-tweaks branch from e17a9b0 to 4d0def3 Compare February 6, 2021 03:59

Update docstring; add some table consistency tests

707e165

rocky force-pushed the rockys-tweaks branch from da2dd25 to 707e165 Compare February 6, 2021 05:05

rocky merged commit 83f430f into master Feb 6, 2021

rocky deleted the rockys-tweaks branch March 19, 2021 01:27

Uh oh!

Conversation

rocky commented Jan 31, 2021

Uh oh!

rocky commented Jan 31, 2021

Uh oh!

GarkGarcia commented Jan 31, 2021

Uh oh!

GarkGarcia commented Jan 31, 2021

Uh oh!

GarkGarcia commented Jan 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GarkGarcia commented Jan 31, 2021

Uh oh!

rocky commented Jan 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GarkGarcia commented Jan 31, 2021

Uh oh!

rocky commented Jan 31, 2021

Uh oh!

GarkGarcia commented Jan 31, 2021

Uh oh!

GarkGarcia commented Jan 31, 2021

Uh oh!

GarkGarcia commented Jan 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GarkGarcia commented Jan 31, 2021

Uh oh!

GarkGarcia commented Jan 31, 2021

Uh oh!

GarkGarcia commented Jan 31, 2021 • edited by rocky Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocky commented Jan 31, 2021

Uh oh!

GarkGarcia commented Jan 31, 2021

Uh oh!

rocky commented Jan 31, 2021

Uh oh!

GarkGarcia commented Jan 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GarkGarcia commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocky commented Feb 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GarkGarcia commented Jan 31, 2021 •

edited

Loading

rocky commented Jan 31, 2021 •

edited

Loading

GarkGarcia commented Jan 31, 2021 •

edited

Loading

GarkGarcia commented Jan 31, 2021 •

edited by rocky

Loading

GarkGarcia commented Jan 31, 2021 •

edited

Loading

GarkGarcia commented Feb 1, 2021 •

edited

Loading