List endpoints by carlosms · Pull Request #32 · src-d/ghsync

carlosms · 2019-06-18T12:54:45Z

Fix #30.

Now there are 2 sub commands, deep for the previous code, and shallow for the new endpoints.
The download time for src-d is around 15m, and it takes around 500 API calls.

I chose to exit early when any error is found. The other alternative would be to log the error and continue, but I thought the errors we will find will probably be a broken internet connectivity, or the DB is down. In any of these, logging the errors and continue will just flood the logs with error messages and not really continue to do any work.

I will submit another incremental PR (#34) to get rid of all the FindOne DB calls for each individual issue and PR. But the time does not really improve, it seems most of the time is spent on the API calls.

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>

se7entyse7en

Other than the comments maybe we could extract the logic of doing a query, check if the item exists, and insert it if it doesn't. But this is just an idea for eventually future improvements.

se7entyse7en · 2019-06-19T09:39:02Z

cmd/ghsync/main.go

 var app = cli.New("ghsync", version, build, "GitHub metadata sync")

 func main() {
+	app.AddCommand(&subcmd.ShallowCommand{})


Maybe clearer if renamed to ShallowSyncCommand? I'd have:

ShallowSyncCommand and SyncCommand, or

ShallowSyncCommand and DeepSyncCommand, or

ShallowCommand and DeepCommand as maybe the Sync is implicit in the project purpose itself.

ok, done in 46f4e26

se7entyse7en · 2019-06-19T09:47:56Z

cmd/ghsync/subcmd/common.go

+	T http.RoundTripper
+}
+
+func (t *RemoveHeaderTransport) RoundTrip(req *http.Request) (*http.Response, error) {


Can you please explain the reason for removing these headers?

I was also reading the doc, and it says:

// RoundTrip should not modify the request, except for // consuming and closing the Request's Body.

I don't know the reason. I didn't pay any attention to this, I just blindly copy-pasted from sync.go.

most probably if we don't remove these headers requests won't hit the cache. Our cache lib checks for headers as well. (though I didn't really check in details just saying according to my prior knowledge)

se7entyse7en · 2019-06-19T10:05:04Z

shallow/issue.go

+		}
+
+		for _, i := range issues {
+			if i.IsPullRequest() {


I guess that s.client.Issues.ListByRepo that returns both issues and prs, so is the output of this actually a superset (no pun intended) of the output returned by s.client.PullRequests.ListByRepo? Given that the API calls are the most time-consuming operations, can we just remove this check and work with both issues and prs?

There was thread on slack about this.

I thought about it, but the kallax table stores *github.PullRequest, which is not the same as *github.Issue. A PR contains a few more fields than an Issue, see the models for both, and https://developer.github.com/v3/pulls/#list-pull-requests & https://developer.github.com/v3/issues/#list-issues-for-a-repository.

I guess we could create an empty PR object and fill the fields that we do have from the Issue, since many are shared. But some others would be missing, for example things like BaseRepositoryName.
If we assume this data will only be consumed by us, and we make sure our charts do not use any of the PR specific fields, it would work... for now. But it would leave us in a very unstable situation.

If we want to improve this in any way, I would maybe do this:

store all Issues (including PRs) in the issues table

edit the issues model to include the PullRequestLinks field

edit our dashboards to use the issues table and separate issues from PRs but the empty PullRequestLinks column.

But in any case I think this is better to be considered as an improvement for the future, since it's not an obvious change.

I agree about improving this later but if I remember right the objects returned from list endpoint aren't complete anyway. To get ALL the data we have to call GET for each issue/pr. So the reasoning about "unstable situation" isn't really correct.
For example there is no comments field in PR list, only in the response for single PR. Refs:
https://developer.github.com/v3/pulls/#list-pull-requests
https://developer.github.com/v3/pulls/#get-a-single-pull-request

Opened an issue to continue this conversation: #40

se7entyse7en · 2019-06-19T10:07:49Z

shallow/repository.go

+			return err
+		}
+
+		for _, r := range repositories {


I don't know how much this may impact on memory consumption, but it would be better to avoid storing the repos in an array and just call s.doRepo for each repo for each page.

That was my first approach.

Then I changed it to retrieve all repositories first for this reason:
Imagine you have 101 repos.
GET the first page, and start processing one by one their issues and PRs.
Meanwhile a repo gets deleted, and the total number of repos is 100.
After a while we finish processing all the repos in the first page, and we GET page 2. But now github will say that there is no page 2, and we missed the processing of 1 repo.

Yes, this is unlikely, but that was my reasoning.
If we think the tradeoff of storing everything in memory is too big, I can change it.

memory consumption here compare to a lot of other stuff we do is almost nothing, I wouldn't worry about it.
But I see a problem in the reasoning. The case when the last repo is deleted is even more unlike than deleting from somewhere in the middle. So we will get error anyway if something got deleted during import.

Opened an issue to continue the conversation: #42

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>

smacker · 2019-06-19T13:37:51Z

shallow/issue.go

I'm not sure it's a good idea to wrap ALL the issues in a transaction.
There can be thousands of them so they won't be committed for quite a long time. (on wip branch for prettier/prettier it was taking minutes to download all of them) So when UI is open db is still empty and charts are ugly showing nulls.
I would better commit in batches by 100 for example.
Though maybe I'm missing some other case when batches can cause problems.

smacker · 2019-06-19T13:45:19Z

shallow/issue.go

improvement for the future. We need to save github ids instead of relying on owner and repo. It would help with renames and maybe something else in the future.

smacker · 2019-06-19T13:48:17Z

shallow/issue.go

we are doing hundreds to thousands of requests. The change of 5xx errors from github or any network problems is VERY high. Maybe we can keep it as is in this PR but we really need to improve it. Better before the release.

smacker · 2019-06-19T13:57:28Z

shallow/repository.go

please change the order PRs first and then Issues: most of the charts rely on prs table only, better to have data for them first. I understand that it doesn't really solve any problem but still, in some cases, it would improve UX and for us, it does matter which one to call fist.

smacker

created issues for my comments for future improvements. Good to go as soon as PR and Issues are reordered (because it's super small change)

carlosms added 5 commits June 17, 2019 17:47

Refactor sync initialization

51ab089

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>

Refactor files to 'deep' package

a722c3e

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>

Rename sync cmd to deep

190f44a

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>

Add a new shallow cmd that uses list endpints, no queue

dbc9caf

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>

Use transactions for issues, PRs and users

1cb262b

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>

carlosms force-pushed the list-endpoints branch from de9903d to 1cb262b Compare June 19, 2019 08:56

carlosms changed the title ~~[WIP] List endpoints~~ List endpoints Jun 19, 2019

carlosms requested a review from a team June 19, 2019 09:05

carlosms mentioned this pull request Jun 19, 2019

List endpoints with repository transaction #34

Open

se7entyse7en suggested changes Jun 19, 2019

View reviewed changes

carlosms mentioned this pull request Jun 19, 2019

Support a list or organizations in shallow cmd #35

Merged

Rename SyncCommand to DeepCommand

46f4e26

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>

smacker reviewed Jun 19, 2019

View reviewed changes

carlosms mentioned this pull request Jun 19, 2019

Proposal: shallow download could create PRs from the Issues endpoint #40

Open

smacker approved these changes Jun 19, 2019

View reviewed changes

smacker mentioned this pull request Jun 19, 2019

Import can error if anything got deleted on github #41

Closed

carlosms mentioned this pull request Jun 19, 2019

Deal with potential pagination problems #42

Open

se7entyse7en approved these changes Jun 19, 2019

View reviewed changes

smacker merged commit 9874cca into src-d:master Jun 19, 2019

carlosms deleted the list-endpoints branch June 19, 2019 16:21

Conversation

carlosms commented Jun 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

se7entyse7en left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smacker left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

carlosms commented Jun 18, 2019 •

edited

Loading