WIP: feat: Initial code to load workspaces from a specific container path#583
WIP: feat: Initial code to load workspaces from a specific container path#583aponcedeleonch wants to merge 1 commit intomainfrom
Conversation
…path Related: #454 This is the initial work to create workspaces when the server is initialized. The idea is that the user mounts a volume at the specific location: `/app/codegate_workspaces` and read from there the git repositories.
| max_fim_hash_lifetime: int = 60 * 5 # Time in seconds. Default is 5 minutes. | ||
| ignore_paths_workspaces = [ | ||
| ".git", "__pycache__", ".venv", ".DS_Store", "node_modules", ".pytest_cache", ".ruff_cache" | ||
| ] |
There was a problem hiding this comment.
would it make sense to just include the contents of gitignore?
Sounds like we should make this configurable down the road.
There was a problem hiding this comment.
I did consider the contents of .gitignore but if we use that it would mean skipping files that may contain secrets but could still leaked to LLMs.
I was planning to make ignore_paths_workspaces configurable through the cli, I just didn't have time to do so. The values here would be the defaults
| repos = FolderRepoScanner(ignore_paths).read(path) | ||
| workspaces = [ | ||
| Workspace( | ||
| id=str(uuid.uuid4()), |
There was a problem hiding this comment.
Does this mean that if I restart the container I get a new Workspace per repo in the tree?
There was a problem hiding this comment.
Yes, I still need to add functionality to avoid creating a new workspace if the repo already exists.
|
|
||
| class Repository(BaseModel): | ||
| name: str | ||
| folder_tree: Dict[str, Folder] |
There was a problem hiding this comment.
Is the intent to store the whole directory tree of a repository?
There was a problem hiding this comment.
What about storing the root of the repo instead of the whole filesystem?
There was a problem hiding this comment.
Yes, the intent is to store the whole directory tree of a repository. The reasoning behind it is to do fast lookups when we see a path in the received code snippets. Right now, we get the path of a code snippet if it was supplied for context to the LLM. Example:
{
"messages": [
{
"role": "user",
"content": "\n\n```py codegate/src/codegate/pipeline/factory.py (1-57)\nfrom typing import List\n\nfrom codegate.config import Config\nfrom codegate.pipeline.base import PipelineStep, SequentialPipelineProcessor\nfrom codegate.pipeline.codegate_context_retriever.codegate import CodegateContextRetriever\nfrom codegate.pipeline.extract_snippets.extract_snippets import CodeSnippetExtractor\nfrom codegate.pipeline.extract_snippets.output import CodeCommentStep\nfrom codegate.pipeline.output import OutputPipelineProcessor, OutputPipelineStep\nfrom codegate.pipeline.secrets.manager import SecretsManager\nfrom codegate.pipeline.secrets.secrets import (\n CodegateSecrets,\n SecretRedactionNotifier,\n SecretUnredactionStep,\n)\nfrom codegate.pipeline.system_prompt.codegate import SystemPrompt\nfrom codegate.pipeline.version.version import CodegateVersion\n\n\nclass PipelineFactory:\n def __init__(self, secrets_manager: SecretsManager):\n self.secrets_manager = secrets_manager\n\n def create_input_pipeline(self) -> SequentialPipelineProcessor:\n input_steps: List[PipelineStep] = [\n # make sure that this step is always first in the pipeline\n # the other steps might send the request to a LLM for it to be analyzed\n # and without obfuscating the secrets, we'd leak the secrets during those\n # later steps\n CodegateSecrets(),\n CodegateVersion(),\n CodeSnippetExtractor(),\n CodegateContextRetriever(),\n SystemPrompt(Config.get_config().prompts.default_chat),\n ]\n return SequentialPipelineProcessor(input_steps, self.secrets_manager, is_fim=False)\n\n def create_fim_pipeline(self) -> SequentialPipelineProcessor:\n fim_steps: List[PipelineStep] = [\n CodegateSecrets(),\n ]\n return SequentialPipelineProcessor(fim_steps, self.secrets_manager, is_fim=True)\n\n def create_output_pipeline(self) -> OutputPipelineProcessor:\n output_steps: List[OutputPipelineStep] = [\n SecretRedactionNotifier(),\n SecretUnredactionStep(),\n CodeCommentStep(),\n ]\n return OutputPipelineProcessor(output_steps)\n\n def create_fim_output_pipeline(self) -> OutputPipelineProcessor:\n fim_output_steps: List[OutputPipelineStep] = [\n # temporarily disabled\n # SecretUnredactionStep(),\n ]\n return OutputPipelineProcessor(fim_output_steps)\n\n```\nwhats this code doing?"
}
],
"model": "hosted_vllm/unsloth/Qwen2.5-Coder-32B-Instruct",
"max_tokens": 4096,
"stream": true,
"base_url": "https://inference.codegate.ai/v1"
}There was a problem hiding this comment.
gotcha, that makes sense. Let's give this a little bit of thought; every time a file is added/removed we'd have to rewrite the JSON blob into the database and that's not optimal either.
Related: #583 We had been using a single DB schema that didn't change until now. This introduces migrations using `alembic`. To create a new migration one can use: ```sh alembic revision -m "My migration" ``` That should generate an empty migration file that needs to be hand-filled. Specifically the `upgrade` method which will be the one executed when running the migration. ```python """My migration Revision ID: <some_hash> Revises: <previous_hash> Create Date: YYYY-MM-DD HH:MM:SS.XXXXXX """ from alembic import op import sqlalchemy as sa revision = '<some_hash>' down_revision = '<previous_hash>' branch_labels = None depends_on = None def upgrade(): pass def downgrade(): pass ```
Related: #583 We had been using a single DB schema that didn't change until now. This introduces migrations using `alembic`. To create a new migration one can use: ```sh alembic revision -m "My migration" ``` That should generate an empty migration file that needs to be hand-filled. Specifically the `upgrade` method which will be the one executed when running the migration. ```python """My migration Revision ID: <some_hash> Revises: <previous_hash> Create Date: YYYY-MM-DD HH:MM:SS.XXXXXX """ from alembic import op import sqlalchemy as sa revision = '<some_hash>' down_revision = '<previous_hash>' branch_labels = None depends_on = None def upgrade(): pass def downgrade(): pass ```
|
The effort to automatically detect repositories from the information provided by the client is stopped. At the moment we don't have enough information to accurately pin-point the repository in which a user is working on. The workspaces effort will continue with #600 |
Related: #454
This is the initial work to create workspaces when the server is initialized. The idea is that the user mounts a volume at the specific location:
/app/codegate_workspacesand read from there the git repositories.