Conversation
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
se7entyse7en
left a comment
There was a problem hiding this comment.
Other than the comments maybe we could extract the logic of doing a query, check if the item exists, and insert it if it doesn't. But this is just an idea for eventually future improvements.
cmd/ghsync/main.go
Outdated
| var app = cli.New("ghsync", version, build, "GitHub metadata sync") | ||
|
|
||
| func main() { | ||
| app.AddCommand(&subcmd.ShallowCommand{}) |
There was a problem hiding this comment.
Maybe clearer if renamed to ShallowSyncCommand? I'd have:
ShallowSyncCommandandSyncCommand, orShallowSyncCommandandDeepSyncCommand, orShallowCommandandDeepCommandas maybe theSyncis implicit in the project purpose itself.
cmd/ghsync/subcmd/common.go
Outdated
| T http.RoundTripper | ||
| } | ||
|
|
||
| func (t *RemoveHeaderTransport) RoundTrip(req *http.Request) (*http.Response, error) { |
There was a problem hiding this comment.
Can you please explain the reason for removing these headers?
I was also reading the doc, and it says:
// RoundTrip should not modify the request, except for // consuming and closing the Request's Body.
There was a problem hiding this comment.
I don't know the reason. I didn't pay any attention to this, I just blindly copy-pasted from sync.go.
There was a problem hiding this comment.
most probably if we don't remove these headers requests won't hit the cache. Our cache lib checks for headers as well. (though I didn't really check in details just saying according to my prior knowledge)
shallow/issue.go
Outdated
| } | ||
|
|
||
| for _, i := range issues { | ||
| if i.IsPullRequest() { |
There was a problem hiding this comment.
I guess that s.client.Issues.ListByRepo that returns both issues and prs, so is the output of this actually a superset (no pun intended) of the output returned by s.client.PullRequests.ListByRepo? Given that the API calls are the most time-consuming operations, can we just remove this check and work with both issues and prs?
There was a problem hiding this comment.
There was thread on slack about this.
I thought about it, but the kallax table stores *github.PullRequest, which is not the same as *github.Issue. A PR contains a few more fields than an Issue, see the models for both, and https://developer.github.com/v3/pulls/#list-pull-requests & https://developer.github.com/v3/issues/#list-issues-for-a-repository.
I guess we could create an empty PR object and fill the fields that we do have from the Issue, since many are shared. But some others would be missing, for example things like BaseRepositoryName.
If we assume this data will only be consumed by us, and we make sure our charts do not use any of the PR specific fields, it would work... for now. But it would leave us in a very unstable situation.
If we want to improve this in any way, I would maybe do this:
- store all Issues (including PRs) in the issues table
- edit the issues model to include the
PullRequestLinksfield - edit our dashboards to use the
issuestable and separate issues from PRs but the emptyPullRequestLinkscolumn.
But in any case I think this is better to be considered as an improvement for the future, since it's not an obvious change.
There was a problem hiding this comment.
I agree about improving this later but if I remember right the objects returned from list endpoint aren't complete anyway. To get ALL the data we have to call GET for each issue/pr. So the reasoning about "unstable situation" isn't really correct.
For example there is no comments field in PR list, only in the response for single PR. Refs:
https://developer.github.com/v3/pulls/#list-pull-requests
https://developer.github.com/v3/pulls/#get-a-single-pull-request
There was a problem hiding this comment.
Opened an issue to continue this conversation: #40
shallow/repository.go
Outdated
| return err | ||
| } | ||
|
|
||
| for _, r := range repositories { |
There was a problem hiding this comment.
I don't know how much this may impact on memory consumption, but it would be better to avoid storing the repos in an array and just call s.doRepo for each repo for each page.
There was a problem hiding this comment.
That was my first approach.
Then I changed it to retrieve all repositories first for this reason:
Imagine you have 101 repos.
GET the first page, and start processing one by one their issues and PRs.
Meanwhile a repo gets deleted, and the total number of repos is 100.
After a while we finish processing all the repos in the first page, and we GET page 2. But now github will say that there is no page 2, and we missed the processing of 1 repo.
Yes, this is unlikely, but that was my reasoning.
If we think the tradeoff of storing everything in memory is too big, I can change it.
There was a problem hiding this comment.
memory consumption here compare to a lot of other stuff we do is almost nothing, I wouldn't worry about it.
But I see a problem in the reasoning. The case when the last repo is deleted is even more unlike than deleting from somewhere in the middle. So we will get error anyway if something got deleted during import.
There was a problem hiding this comment.
Opened an issue to continue the conversation: #42
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
shallow/issue.go
Outdated
There was a problem hiding this comment.
I'm not sure it's a good idea to wrap ALL the issues in a transaction.
There can be thousands of them so they won't be committed for quite a long time. (on wip branch for prettier/prettier it was taking minutes to download all of them) So when UI is open db is still empty and charts are ugly showing nulls.
I would better commit in batches by 100 for example.
Though maybe I'm missing some other case when batches can cause problems.
shallow/issue.go
Outdated
There was a problem hiding this comment.
improvement for the future. We need to save github ids instead of relying on owner and repo. It would help with renames and maybe something else in the future.
shallow/issue.go
Outdated
There was a problem hiding this comment.
we are doing hundreds to thousands of requests. The change of 5xx errors from github or any network problems is VERY high. Maybe we can keep it as is in this PR but we really need to improve it. Better before the release.
shallow/repository.go
Outdated
There was a problem hiding this comment.
please change the order PRs first and then Issues: most of the charts rely on prs table only, better to have data for them first. I understand that it doesn't really solve any problem but still, in some cases, it would improve UX and for us, it does matter which one to call fist.
smacker
left a comment
There was a problem hiding this comment.
created issues for my comments for future improvements. Good to go as soon as PR and Issues are reordered (because it's super small change)
Fix #30.
Now there are 2 sub commands,
deepfor the previous code, andshallowfor the new endpoints.The download time for
src-dis around 15m, and it takes around 500 API calls.I chose to exit early when any error is found. The other alternative would be to log the error and continue, but I thought the errors we will find will probably be a broken internet connectivity, or the DB is down. In any of these, logging the errors and continue will just flood the logs with error messages and not really continue to do any work.
I will submit another incremental PR (#34) to get rid of all the
FindOneDB calls for each individual issue and PR. But the time does not really improve, it seems most of the time is spent on the API calls.