[SHARE-832][feature]Add Symbiota harvester#671
[SHARE-832][feature]Add Symbiota harvester#671chrisseto merged 10 commits intoCenterForOpenScience:developfrom
Conversation
* Update setup * Add harvester * Add transformer * Add source * Add tests for harvester/transformer
|
|
||
| logging.info('Found %d results from swbiodiversity', total) | ||
| count = 0 | ||
| while count < total: |
There was a problem hiding this comment.
Use a for loop and enumerate here.
for count, record in record_list:
...| identifier = record_list[count] | ||
|
|
||
| data['identifier'] = identifier | ||
| html = self.requests.get(self.kwargs['list_url'] + '?collid=' + identifier) |
There was a problem hiding this comment.
Use the furl library to build URLs. You can look at any other harvester for examples 👍
| identifier = record_list[count] | ||
|
|
||
| data['identifier'] = identifier | ||
| html = self.requests.get(self.kwargs['list_url'] + '?collid=' + identifier) |
There was a problem hiding this comment.
Use a more descriptive name here like resp or response. html is misleading.
| count += 1 | ||
| yield identifier, data | ||
|
|
||
| def process_text(self, text): |
There was a problem hiding this comment.
This method is unnecessary, you can just call .text().
Don't parse HTML with regex
|
|
||
| def process_collection_stat(self, list): | ||
| stat = {} | ||
| for item in list: |
There was a problem hiding this comment.
Avoid using list as a variable name, it shadows the built in type list
Here's a list of builtins/names to avoid.
|
|
||
| # Peel out script tags and css things to minimize size of HTML | ||
| for el in itertools.chain( | ||
| soup('img'), |
There was a problem hiding this comment.
Looks like this might be double indented?
| el.extract() | ||
|
|
||
| record = soup.find(id='innertext') | ||
| title = self.process_text(record.h1) |
There was a problem hiding this comment.
Most, if not all, of this processing should happen in the transformer. You'll want to preserve as much of the original data as possible. If a bug pops up in the parsing, we can just reprocess the data we already have.
There was a problem hiding this comment.
If you still prefer to give this data to the transformer, you can massage the data before-hand by overriding transform
… into feature/SHARE-832 * Update transformer test
|
|
||
| from bs4 import BeautifulSoup, Comment | ||
| from furl import furl | ||
| import itertools |
There was a problem hiding this comment.
itertools is part of the stdlib
| from share.transform.chain.parsers import Parser | ||
| from share.transform.chain.soup import SoupXMLTransformer | ||
| from bs4 import BeautifulSoup | ||
| import re |
There was a problem hiding this comment.
These should be ordered as
import re
from bs4 import BeautifulSoup
from share.transform.chain import ctx
from share.transform.chain import links as tools
from share.transform.chain.parsers import Parser
from share.transform.chain.soup import SoupXMLTransformer| @@ -0,0 +1,148 @@ | |||
| from datetime import timedelta | |||
|
|
|||
| import requests_mock | |||
There was a problem hiding this comment.
Could you switch this to use httpretty? @laurenbarker and I thought the interface for it was a bit nicer. It would be good to standardize on a single library.
| end_date = end.date() | ||
| start_date = start.date() | ||
| logger.info('Harvesting swbiodiversity %s - %s', start_date, end_date) | ||
| return self.fetch_records() |
There was a problem hiding this comment.
This appears to disregard date ranges. Did we discuss this at some point?
Ticket
https://openscience.atlassian.net/browse/SHARE-832
Problem
Adding a harvester for Symbiota, harvesting Collections.
Solution: