[SHARE-832][feature]Add Symbiota harvester by sheriefvt · Pull Request #671 · CenterForOpenScience/SHARE

sheriefvt · 2017-05-30T20:57:27Z

Ticket

https://openscience.atlassian.net/browse/SHARE-832

Problem

Adding a harvester for Symbiota, harvesting Collections.

Solution:

Update setup
Add the harvester (Scraping HTML from each record page)
Add the transformer
Add new source
Add tests for both harvester/transformer
Add requests-mock package to requirements

* Update setup * Add harvester * Add transformer * Add source * Add tests for harvester/transformer

chrisseto · 2017-06-05T20:28:25Z

share/harvesters/org_swbiodiversity.py

+
+        logging.info('Found %d results from swbiodiversity', total)
+        count = 0
+        while count < total:


Use a for loop and enumerate here.

for count, record in record_list: ...

chrisseto · 2017-06-05T20:29:09Z

share/harvesters/org_swbiodiversity.py

+            identifier = record_list[count]
+
+            data['identifier'] = identifier
+            html = self.requests.get(self.kwargs['list_url'] + '?collid=' + identifier)


Use the furl library to build URLs. You can look at any other harvester for examples 👍

chrisseto · 2017-06-05T20:29:38Z

share/harvesters/org_swbiodiversity.py

+            identifier = record_list[count]
+
+            data['identifier'] = identifier
+            html = self.requests.get(self.kwargs['list_url'] + '?collid=' + identifier)


Use a more descriptive name here like resp or response. html is misleading.

chrisseto · 2017-06-05T20:31:27Z

share/harvesters/org_swbiodiversity.py

+            count += 1
+            yield identifier, data
+
+    def process_text(self, text):


This method is unnecessary, you can just call .text().
Don't parse HTML with regex

chrisseto · 2017-06-05T20:33:07Z

share/harvesters/org_swbiodiversity.py

+
+    def process_collection_stat(self, list):
+        stat = {}
+        for item in list:


Avoid using list as a variable name, it shadows the built in type list
Here's a list of builtins/names to avoid.

chrisseto · 2017-06-05T20:33:34Z

share/harvesters/org_swbiodiversity.py

+
+            # Peel out script tags and css things to minimize size of HTML
+            for el in itertools.chain(
+                    soup('img'),


Looks like this might be double indented?

chrisseto · 2017-06-05T20:35:21Z

share/harvesters/org_swbiodiversity.py

+                el.extract()
+
+            record = soup.find(id='innertext')
+            title = self.process_text(record.h1)


Most, if not all, of this processing should happen in the transformer. You'll want to preserve as much of the original data as possible. If a bug pops up in the parsing, we can just reprocess the data we already have.

If you still prefer to give this data to the transformer, you can massage the data before-hand by overriding transform

… into feature/SHARE-832 * Update transformer test

chrisseto · 2017-06-16T15:50:05Z

share/harvesters/org_swbiodiversity.py

+
+from bs4 import BeautifulSoup, Comment
+from furl import furl
+import itertools


itertools is part of the stdlib

chrisseto · 2017-06-16T15:50:59Z

share/transformers/org_swbiodiversity.py

+from share.transform.chain.parsers import Parser
+from share.transform.chain.soup import SoupXMLTransformer
+from bs4 import BeautifulSoup
+import re


These should be ordered as

import re from bs4 import BeautifulSoup from share.transform.chain import ctx from share.transform.chain import links as tools from share.transform.chain.parsers import Parser from share.transform.chain.soup import SoupXMLTransformer

chrisseto · 2017-06-16T16:00:30Z

tests/share/harvesters/test_swbiodiversity_harvester.py

@@ -0,0 +1,148 @@
+from datetime import timedelta
+
+import requests_mock


Could you switch this to use httpretty? @laurenbarker and I thought the interface for it was a bit nicer. It would be good to standardize on a single library.

chrisseto · 2017-06-26T15:37:16Z

share/harvesters/org_swbiodiversity.py

+        end_date = end.date()
+        start_date = start.date()
+        logger.info('Harvesting swbiodiversity %s - %s', start_date, end_date)
+        return self.fetch_records()


This appears to disregard date ranges. Did we discuss this at some point?

sheriefvt added 5 commits May 26, 2017 10:01

[SHARE-832][Feature]Add Symbiota harvester

369aaae

* Update setup * Add harvester * Add transformer * Add source * Add tests for harvester/transformer

Update code

c1b8e9f

Update code

912b060

Add requests-mock to requirements

d95f2fd

Rename test files

00faad4

chrisseto suggested changes Jun 5, 2017

View reviewed changes

sheriefvt added 3 commits June 12, 2017 11:31

Apply code review

57f6144

Merge branch 'develop' of https://github.com/CenterForOpenScience/SHARE…

ebed9fb

… into feature/SHARE-832 * Update transformer test

Use name parser

c2ab373

chrisseto reviewed Jun 16, 2017

View reviewed changes

sheriefvt added 2 commits June 19, 2017 10:54

* Use httpretty and apply other code review requests

6e694b7

Add httpretty to dev-requirements

7f6fe0b

chrisseto reviewed Jun 26, 2017

View reviewed changes

chrisseto merged commit 5d5f904 into CenterForOpenScience:develop Jul 6, 2017

sheriefvt mentioned this pull request Jul 25, 2017

[SHARE-832][feature]Add Symbiota harvester (Additional dev.) #704

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SHARE-832][feature]Add Symbiota harvester#671

[SHARE-832][feature]Add Symbiota harvester#671
chrisseto merged 10 commits intoCenterForOpenScience:developfrom
sheriefvt:feature/SHARE-832

sheriefvt commented May 30, 2017 •

edited

Loading

Uh oh!

chrisseto Jun 5, 2017

Uh oh!

chrisseto Jun 5, 2017

Uh oh!

chrisseto Jun 5, 2017

Uh oh!

chrisseto Jun 5, 2017

Uh oh!

chrisseto Jun 5, 2017

Uh oh!

chrisseto Jun 5, 2017

Uh oh!

chrisseto Jun 5, 2017

Uh oh!

chrisseto Jun 5, 2017

Uh oh!

chrisseto Jun 16, 2017

Uh oh!

chrisseto Jun 16, 2017

Uh oh!

chrisseto Jun 16, 2017

Uh oh!

chrisseto Jun 26, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,148 @@
		from datetime import timedelta

		import requests_mock

Comments

Conversation

sheriefvt commented May 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ticket

Problem

Solution:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sheriefvt commented May 30, 2017 •

edited

Loading