refactor!: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance#746
refactor!: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance#746
Conversation
UTs working. Generics stretched to limits, probably not worth it to keep BScrawlingcontext
Solved middleware issues.
HttpCrawler made generic. BeautifulSoup and Parsel crwalers inherit from this new generic.
There was a problem hiding this comment.
Pull request is neither linked to an issue or epic nor labeled as adhoc!
There was a problem hiding this comment.
Pull request is neither linked to an issue or epic nor labeled as adhoc!
janbuchar
left a comment
There was a problem hiding this comment.
Nice! Good job for going through with this. Since this is a pretty critical overhaul though, I was extra pedantic... So please understand this is not bullying 😄
vdusek
left a comment
There was a problem hiding this comment.
This is breaking because of the BeautifulSoupParser. Please add the exclamation mark to the PR title.
Co-authored-by: Jan Buchar <jan.buchar@apify.com>
Ahaaa, so thats the reason for exclamation marks in other PRs :-) |
vdusek
left a comment
There was a problem hiding this comment.
We are getting there 🙂. Mostly just docs related changes.
| --- | ||
| id: http-crawlers | ||
| title: HTTP crawlers | ||
| description: Crawlee supports multiple http crawlers that can be used to extract data from server-rendered webpages. | ||
| --- | ||
|
|
||
| import ApiLink from '@site/src/components/ApiLink'; | ||
| import Tabs from '@theme/Tabs'; | ||
| import TabItem from '@theme/TabItem'; | ||
| import CodeBlock from '@theme/CodeBlock'; | ||
|
|
||
| Generic class <ApiLink to="class/AbstractHttpCrawler">`AbstractHttpCrawler`</ApiLink> is parent to <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>, <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> and <ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> and it could be used as parent for your crawler with custom content parsing requirements. | ||
|
|
||
| It already includes almost all the functionality to crawl webpages and the only missing part is the parser that should be used to parse HTTP responses, and a context dataclass that defines what context helpers will be available to user handler functions. | ||
|
|
||
| ## `BeautifulSoupCrawler` | ||
| <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> uses <ApiLink to="class/BeautifulSoupParser">`BeautifulSoupParser`</ApiLink> to parse the HTTP response and makes it available in <ApiLink to="class/BeautifulSoupCrawlingContext">`BeautifulSoupCrawlingContext`</ApiLink> in the `.soup` or `.parsed_content` attribute. | ||
|
|
||
| ## `ParselCrawler` | ||
| <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> uses <ApiLink to="class/ParselParser">`ParselParser`</ApiLink> to parse the HTTP response and makes it available in <ApiLink to="class/ParselCrawlingContext">`ParselCrawlingContext`</ApiLink> in the `.selector` or `.parsed_content` attribute. | ||
|
|
||
| ## `HttpCrawler` | ||
| <ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> uses <ApiLink to="class/NoParser">`NoParser`</ApiLink> that does not parse the HTTP response at all and is to be used if no parsing is required. | ||
|
|
There was a problem hiding this comment.
I mean, this is great and definitely better than nothing. However, it is quite short and might not look good when rendered on the page. For comparison, take a look at guides like HTTP Clients or Result Storages. It should aim for similar depth and verbosity, including usage examples.
This should not be a blocker for the merging, as we have been improving the docs all the time. If you decide not-to-update it now, please open a new issue for it. Thanks.
There was a problem hiding this comment.
I was scratching my head trying to come up with something for those docs. The problem is, that the only example I can think of, is implementing your own HTTP based Crawler (other examples in other files already show how to crawlee). But such example exists already in our code base and it is BSCrawler and ParselCrawler, so I can just point to those two.
If you think something specific is missing, please let me know and I can do add that.
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
This should have been part of #746
…inheritance (apify#746) Reworked http based crawlers inheritance. StaticContentCrawler is parent of BeautifulSoupCrawler, ParselCrawler and HttpCrawler. StaticContentCrawler is generic. Specific versions depend on the type of parser used for parsing http response. **Breaking change:** Renamed BeautifulSoupParser to BeautifulSoupParserType (it is just string literal to properly set BeautiflSoup) BeautifulSoupParser is used for new class that is the parser used by BeautifulSoupCrawler - Closes: [ Reconsider crawler inheritance apify#350 ](apify#350) --------- Co-authored-by: Jan Buchar <jan.buchar@apify.com> Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Reworked http based crawlers inheritance.
StaticContentCrawler is parent of BeautifulSoupCrawler, ParselCrawler and HttpCrawler.
StaticContentCrawler is generic. Specific versions depend on the type of parser used for parsing http response.
Breaking change:
Renamed BeautifulSoupParser to BeautifulSoupParserType (it is just string literal to properly set BeautiflSoup)
BeautifulSoupParser is used for new class that is the parser used by BeautifulSoupCrawler