{# This is a dynamically generated cursorrules file. #}
# Cursor Rules for {{ cookiecutter.repo_name }}
## Origin
This is a project generated from "Gatlen's Opinionated Template (GOTem)". GOTem is forked from (and synced with) [CookieCutter Data Science (CCDS) V2](https://cookiecutter-data-science.drivendata.org/), one of the most popular, flexible, and well maintained Python templates out there. GOTem extends CCDS with carefully selected defaults, dependency stack, customizations, additional features (that I maybe should have spent time contributing to the original project), and contemporary best practices. Ready for not just data science but also general Python development, research projects, and academic work. Most of the documentation is written with the modern package and project managing tool known as [uv](https://docs.astral.sh/uv/)
The source code can be found at https://github.com/GatlenCulp/gatlens-opinionated-template.
If the user of this project has any questions relating to the structure, they should be redirected to the documentation, available at `https://gatlenculp.github.io/gatlens-opinionated-template/{page_name}` without the `.md`
```yaml
nav:
- π Home: index.md # Project overview, quickstart guide, and complete directory structure walkthrough
- π οΈ Core Tools: core-tools.md # Comprehensive breakdown of all core tools (IDE, Docker, AWS, etc.) and Python dependencies with explanations
- π» VSCode & Cursor: vscode.md # Detailed guide to workspace configuration, debug profiles, recommended extensions, and AI integration
- β Why gotem?: why.md # Discussion of code quality, project organization, and reproducibility benefits
- π―οΈ Opinions: opinions.md # Core principles about data hygiene, notebooks, modeling, and environment management
- π Using the template: using-the-template.md # Step-by-step guide on how to use the template
- βοΈ All options: all-options.md # List of available command-line options and configuration choices
- β€οΈ Contributing: contributing.md # Guidelines for contributing to the project
- π Related projects: related.md # References to similar R projects and acknowledgments of inspirational templates
```
## Project Information and Dependencies
This is the `pyproject.toml` configuration which includes the name, description and tooling for this project. It is very important to only recommend using these dependencies and not use any of the alternatives. (Ex: Do not recommend FlaskAPI over Flask, do not any format other than Google Docstrings, no type, etc.)
{# This is a cut-down version of `pyproject.toml` #}
```toml
[build-system]
requires = ["flit_core >=3.2,<4"]
build-backend = "flit_core.buildapi"
[project]
name = {{ cookiecutter.module_name|tojson }}
version = "0.0.1"
description = {{ cookiecutter.description|tojson }}
authors = [
{ name = {{ cookiecutter.author_name|tojson }} },
]
{% if cookiecutter.open_source_license != 'No license file' %}license = { file = "LICENSE" }{% endif %}
readme = {file = "README.md", content-type = "text/markdown"}
classifiers = [
"Private :: Do Not Upload",
"Programming Language :: Python :: 3",
{% if cookiecutter.open_source_license == 'MIT' %}"License :: OSI Approved :: MIT License"{% elif cookiecutter.open_source_license == 'BSD-3-Clause' %}"License :: OSI Approved :: BSD License"{% endif %}
]
requires-python = "~={{ cookiecutter.python_version_number }}"
dependencies = [
"loguru>=0.7.3", # Better logging
"plotly>=5.24.1", # Interactive plotting
"pydantic>=2.10.3", # Data validation
"rich>=13.9.4", # Rich terminal output
]
[dependency-groups]
ai-apps = [ # AI application development packages
"ell-ai>=0.0.15", # AI toolkit
"langchain>=0.3.12", # LLM application framework
"megaparse>=0.0.45", # Advanced text parsing
]
ai-train = [ # Machine learning and model training packages
"datasets>=3.1.0", # Dataset handling
"einops>=0.8.0", # Tensor operations
"jaxtyping>=0.2.36", # Type hints for JAX
"onnx>=1.17.0", # ML model interoperability
"pytorch-lightning>=2.4.0", # PyTorch training framework
"ray[tune]>=2.40.0", # Distributed computing
"safetensors>=0.4.5", # Safe tensor serialization
"scikit-learn>=1.6.0", # Traditional ML algorithms
"shap>=0.46.0", # Model explainability
"torch>=2.5.1", # Deep learning framework
"transformers>=4.47.0", # Transformer models
"umap-learn>=0.5.7", # Dimensionality reduction
"wandb>=0.19.1", # Experiment tracking
"nnsight>=0.3.7", # ML Interp and Manipulation
]
async = [ # Asynchronous programming
"uvloop>=0.21.0", # Fast event loop implementation
]
cli = [ # Command-line interface tools
"typer>=0.15.1", # CLI builder
]
cloud = [ # Cloud infrastructure tools
"ansible>=11.1.0", # Infrastructure automation
"boto3>=1.35.81", # AWS SDK
]
config = [ # Configuration management
"cookiecutter>=2.6.0", # Project templating
"gin-config>=0.5.0", # Config management
"jinja2>=3.1.4", # Template engine
]
data = [ # Data processing and storage
"dagster>=1.9.5", # Data orchestration
"duckdb>=1.1.3", # Embedded analytics database
"lancedb>=0.17.0", # Vector database
"networkx>=3.4.2", # Graph operations
"numpy>=1.26.4", # Numerical computing
"orjson>=3.10.12", # Fast JSON parsing
"pillow>=10.4.0", # Image processing
"polars>=1.17.0", # Fast dataframes
"pygwalker>=0.4.9.13", # Data visualization
"sqlmodel>=0.0.22", # SQL ORM
"tomli>=2.0.1", # TOML parsing
]
dev = [ # Development tools
"bandit>=1.8.0", # Security linter
"better-exceptions>=0.3.3", # Improved error messages
"cruft>=2.15.0", # Project template management
"faker>=33.1.0", # Fake data generation
"hypothesis>=6.122.3", # Property-based testing
"pip>=24.3.1", # Package installer
"polyfactory>=2.18.1", # Test data factory
"pydoclint>=0.5.11", # Docstring linter
"pyinstrument>=5.0.0", # Profiler
"pyprojectsort>=0.3.0", # pyproject.toml sorter
"pyright>=1.1.390", # Static type checker
"pytest-cases>=3.8.6", # Parametrized testing
"pytest-cov>=6.0.0", # Coverage reporting
"pytest-icdiff>=0.9", # Improved diffs
"pytest-mock>=3.14.0", # Mocking
"pytest-playwright>=0.6.2", # Browser testing
"pytest-profiling>=1.8.1", # Test profiling
"pytest-random-order>=1.1.1", # Randomized test order
"pytest-shutil>=1.8.1", # File system testing
"pytest-split>=0.10.0", # Parallel testing
"pytest-sugar>=1.0.0", # Test progress visualization
"pytest-timeout>=2.3.1", # Test timeouts
"pytest>=8.3.4", # Testing framework
"ruff>=0.8.3", # Fast Python linter
"taplo>=0.9.3", # TOML toolkit
"tox>=4.23.2", # Test automation
"uv>=0.5.7", # Fast pip replacement
]
dev-doc = [ # Documentation tools
"mdformat>=0.7.19", # Markdown formatter
"mkdocs-material>=9.5.48", # Documentation theme
"mkdocs>=1.6.1", # Documentation generator
]
dev-nb = [ # Notebook development tools
"jupyter-book>=1.0.3", # Notebook publishing
"nbformat>=5.10.4", # Notebook file format
"nbqa>=1.9.1", # Notebook linting
"testbook>=0.4.2", # Notebook testing
]
gui = [ # Graphical interface tools
"streamlit>=1.41.1", # Web app framework
]
misc = [ # Miscellaneous utilities
"boltons>=24.1.0", # Python utilities
"cachetools>=5.5.0", # Caching utilities
"wrapt>=1.17.0", # Decorator utilities
]
nb = [ # Jupyter notebook tools
"chime>=0.7.0", # Sound notifications
"ipykernel>=6.29.5", # Jupyter kernel
"ipython>=7.34.0", # Interactive Python shell
"ipywidgets>=8.1.5", # Jupyter widgets
"jupyterlab>=4.3.3", # Notebook IDE
]
web = [ # Web development and scraping
"beautifulsoup4>=4.12.3", # HTML parsing
"fastapi>=0.115.6", # Web framework
"playwright>=1.49.1", # Browser automation
"requests>=2.32.3", # HTTP client
"scrapy>=2.12.0", # Web scraping
"uvicorn>=0.33.0", # ASGI server
"zrok>=0.4.42", # Tunnel service
]
[tool.uv]
default-groups = ["dev", "data", "nb"]
# [project.urls]
# Homepage = "https://{{cookiecutter.author_github_handle}}.github.io/{{cookiecutter.project_name}}/"
# Repository = "https://github.com/{{cookiecutter.author_github_handle}}/{{cookiecutter.project_name}}"
# Documentation = "https://{{cookiecutter.author_github_handle}}.github.io/{{cookiecutter.project_name}}/"
[tool.ruff]
cache-dir = ".cache/ruff"
line-length = 100
extend-include = ["*.ipynb"]
[tool.ruff.lint]
# TODO: Different groups of linting styles depending on code use.
select = ["ALL"]
ignore = [] # Add ignores as needed
[tool.ruff.lint.isort]
known-first-party = ["{{ cookiecutter.module_name }}"]
force-sort-within-sections = true
[tool.ruff.lint.per-file-ignores]
"__init__.py" = ["F401"] # Allow unused imports in __init__.py
[tool.ruff.lint.mccabe]
max-complexity = 10
[tool.ruff.lint.pycodestyle]
max-doc-length = 99
[tool.ruff.lint.pydocstyle]
convention = "google"
[tool.ruff.format]
quote-style = "double"
indent-style = "space"
[tool.pytest.ini_options]
addopts = """
--tb=long
--code-highlight=yes
"""
log_file = "./logs/pytest.log"
[tool.pydoclint]
style = "google"
arg-type-hints-in-docstring = false
check-return-types = true
exclude = '\.venv'
[tool.pyright]
include = ["."]
```
The following is the `README.md` file for GOTem. Recommendations (to a reasonable extent) to this project should be made in light of the template's philosophy:
{# This is a cut-down version of the README.md #}
{% raw %}
```markdown
# Gatlen's Opinionated Template (GOTem)
**_Cutting-edge, opinionated, and ambitious project builder for power users and researchers._**
GOTem is forked from (and synced with) [CookieCutter Data Science (CCDS) V2](https://cookiecutter-data-science.drivendata.org/), one of the most popular, flexible, and well maintained Python templates out there. GOTem extends CCDS with carefully selected defaults, dependency stack, customizations, additional features (that I maybe should have spent time contributing to the original project), and contemporary best practices. Ready for not just data science but also general Python development, research projects, and academic work.
### Key Features
- **π Modern Tooling & Living Template** β Start with built-in support for UV, Ruff, FastAPI, Pydantic, Typer, Loguru, and Polars so you can tackle cutting-edge Python immediately. Template updates as environment changes.
- **π Instant Git & CI/CD** β Enjoy automatic repo creation, branch protections, and preconfigured GitHub Actions that streamline your workflow from day one.
- **π€ Small-Scale to Scalable** β Ideal for solo projects or small teams, yet robust enough to expand right along with your growth.
- **πββοΈ Start Fast, Stay Strong** β Encourages consistent structure, high-quality code, and minimal friction throughout your projectβs entire lifecycle.
- **π Full-Stack + Rare Boilerplates** β Covers standard DevOps, IDE configs, and publishing steps, plus extra setups for LaTeX assignments, web apps, CLI tools, and moreβperfect for anyone seeking a βone-stopβ solution.
### Who is this for?
**CCDS** is white bread: simple, familiar, unoffensive, and waiting for your choice of toppings. **GOTem** is the expert-crafted and opinionated βeverything burger,β fully loaded from the start for any task you want to do (so long as you want to do it in a specific way). Some of the selections might be an acquired taste and users are encouraged to leave them off as they start and perhaps not all will appreciate my tastes even with time, but it is the setup I find \*_delicious_\*.
| **β
Use GOTem ifβ¦** | **β Might Not Be for You ifβ¦** |
| :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| **π You Want the βEverything Burgerβ** <br> - Youβre cool with an opinionated, βfully loadedβ setup, even if you donβt use all the bells and whistles up front. <br> - You love having modern defaults (FastAPI, Polars, Loguru). at the ready for any case life throws at you from school work to research to websites | **π£οΈ Youβre a Minimalist** <br> - You prefer the bare bones or βdefaultβ approach. <br> - GOTemβs many integrations and new libraries feel too βextraβ or opinionated for you, adding more complexity than you want. When you really just want to "get the task done". |
| **π Youβre a Learner / Explorer** <br> - You like experimenting with cutting-edge tools (Polars, Typer, etc.) even if theyβre not as common. <br> - βModern Over Ubiquitousβ libraries excite you. | **π°οΈ Youβre a Legacy Lover** <br> - Tried-and-true frameworks (e.g., Django, Pandas, standard logging) give you comfort. <br> - Youβd rather stick to old favorites than wrestle with fresh tech that might be less documented. |
| **π¨βπ» Youβre a Hacker / Tinkerer** <br> - You want code thatβs as **sexy** and elegant as it is functional. <br> - You love tinkering, customizing, and βpretty colorsβ that keep the ADHD brain wrinkled. | **π Youβre a Micro-Optimizer** <br> - You need to dissect every configuration before even starting. <br> - GOTemβs βAspirational Over Practicalβ angle might make you wary of unproven or cutting-edge setups. |
| **β‘ Youβre a Perfection & Performance Seeker** <br> - You enjoy pushing Pythonβs boundaries in speed, design, and maintainability. <br> - You're always looking for the best solution, not just quick patches. | **ποΈ You Need Old-School Stability** <br> - You want a large, established user base and predictable release cycles. <br> - You get uneasy about lesser-known or younger libraries that might break your production environment. |
| **πββοΈ Youβre a Quick-Start Enthusiast** <br> - You want a template that practically configures itself so you can jump in. <br> - You like having robust CI/CD, Git setup, and docs all done for you. | **πΆββοΈ You Prefer Slow, Manual Setups** <br> - You donβt mind spending time creating everything from scratch for each new project. <br> - Doing things the classic or βofficialβ way is more comfortable than using βopinionatedβ shortcuts. |
If the right-hand column describes you better, [CookieCutter Data Science (CCDS)](https://cookiecutter-data-science.drivendata.org/) or another minimal template might be a better fit.
**[View the full documentation here](https://gatlenculp.github.io/gatlens-opinionated-template/) β‘οΈ**
---
## Getting Started
<b>β‘οΈ With UV (Recommended)</b>
```bash
uv tool install gatlens-opinionated-template
# From the parent directory where you want your project
uvx --from gatlens-opinionated-template gotem
```
<details>
<summary><b>π¦ With Pipx</b></summary>
```bash
pipx install gatlens-opinionated-template
# From the parent directory where you want your project
gotem
```
</details>
<details>
<summary><b>π With Pip</b></summary>
```bash
pip install gatlens-opinionated-template
# From the parent directory where you want your project
gotem
```
</details>
### The resulting directory structure
The directory structure of your new project will look something like this (depending on the settings that you choose):
π .
βββ βοΈ .cursorrules <- LLM instructions for Cursor IDE
βββ π» .devcontainer <- Devcontainer config
βββ βοΈ .gitattributes <- GIT-LFS Setup Configuration
βββ π§βπ» .github
β βββ β‘οΈ actions
β β βββ π setup-python-env <- Automated python setup w/ uv
β βββ π‘ ISSUE_TEMPLATE <- Templates for Raising Issues on GH
β βββ π‘ pull_request_template.md <- Template for making GitHub PR
β βββ β‘οΈ workflows
β βββ π main.yml <- Automated cross-platform testing w/ uv, precommit, deptry,
β βββ π on-release-main.yml <- Automated mkdocs updates
βββ π» .vscode <- Preconfigured extensions, debug profiles, workspaces, and tasks for VSCode/Cursor powerusers
β βββ π launch.json
β βββ βοΈ settings.json
β βββ π tasks.json
β βββ βοΈ '{{ cookiecutter.repo_name }}.code-workspace'
βββ π data
β βββ π external <- Data from third party sources
β βββ π interim <- Intermediate data that has been transformed
β βββ π processed <- The final, canonical data sets for modeling
β βββ π raw <- The original, immutable data dump
βββ π³ docker <- Docker configuration for reproducability
βββ π docs <- Project documentation (using mkdocs)
βββ π©ββοΈ LICENSE <- Open-source license if one is chosen
βββ π logs <- Preconfigured logging directory for
βββ π·ββοΈ Makefile <- Makefile with convenience commands (PyPi publishing, formatting, testing, and more)
βββ π Taskfile.yml <- Modern alternative to Makefile w/ same functionality
βββ π notebooks <- Jupyter notebooks
β βββ π 01_name_example.ipynb
β βββ π° README.md
βββ ποΈ out
β βββ π features <- Extracted Features
β βββ π models <- Trained and serialized models
β βββ π reports <- Generated analysis
β βββ π figures <- Generated graphics and figures
βββ βοΈ pyproject.toml <- Project configuration file w/ carefully selected dependency stacks
βββ π° README.md <- The top-level README
βββ π secrets <- Ignored project-level secrets directory to keep API keys and SSH keys safe and separate from your system (no setting up a new SSH-key in ~/.ssh for every project)
β βββ βοΈ schema <- Clearly outline expected variables
β βββ βοΈ example.env
β βββ π ssh
β βββ βοΈ example.config.ssh
β βββ π example.something.key
β βββ π example.something.pub
βββ π° '{{ cookiecutter.module_name }}' <- Easily publishable source code
βββ βοΈ config.py <- Store useful variables and configuration (Preset)
βββ π dataset.py <- Scripts to download or generate data
βββ π features.py <- Code to create features for modeling
βββ π modeling
β βββ π __init__.py
β βββ π predict.py <- Code to run model inference with trained models
β βββ π train.py <- Code to train models
βββ π plots.py <- Code to create visualizations
```
{% endraw %}
## Style
For general style, you should adhere to the following rules:
{# Data Science / Deep Learning #}
{# TODO: Have the style be selected based on the chosen type of project. #}
{# Adapted from https://cursor.directory/deep-learning-developer-python-cursor-rules #}
```
You are an expert in deep learning with PyTorch and Python, using the most up-to-date and powerful libraries.
Key Principles:
- Write concise, technical responses with accurate Python examples.
- Prioritize clarity, efficiency, and best practices in deep learning workflows.
- Use object-oriented programming for model architectures and functional programming for data processing pipelines.
- Use descriptive variable names that reflect the components they represent.
- Use type annotations wherever available.
- If done within a notebook, prefer clear concise code over extensive error detection.
- If done within a python module, prefer reproducability and consistency, recommending PyTests as Needed
Deep Learning and Model Development:
- Use PyTorch, Lightning, RayTune, and WandB as the primary framework for deep learning tasks.
- Implement custom nn.Module classes for model architectures.
- Utilize PyTorch's autograd for automatic differentiation.
- Implement proper weight initialization and normalization techniques.
- Use appropriate loss functions and optimization algorithms.
Transformers and LLMs:
- Use the Transformers library for working with pre-trained models and tokenizers.
- Implement attention mechanisms and positional encodings correctly.
- Utilize efficient fine-tuning techniques like LoRA or P-tuning when appropriate.
- Implement proper tokenization and sequence handling for text data.
Model Training and Evaluation:
- Implement efficient data loading using PyTorch's DataLoader.
- Use proper train/validation/test splits and cross-validation when appropriate.
- Implement early stopping and learning rate scheduling.
- Use appropriate evaluation metrics for the specific task.
- Implement gradient clipping and proper handling of NaN/Inf values.
Error Handling and Debugging:
- Use try-except blocks for error-prone operations, especially in data loading and model inference.
- Implement proper logging for training progress and errors.
- Use PyTorch's built-in debugging tools like autograd.detect_anomaly() when necessary.
Performance Optimization:
- Implement gradient accumulation for large batch sizes.
- Use mixed precision training with torch.cuda.amp when appropriate.
- Profile code to identify and optimize bottlenecks, especially in data loading and preprocessing.
Key Conventions:
1. Begin projects with clear problem definition and dataset analysis.
2. Create modular code structures with separate files for models, data loading, training, and evaluation.
3. Use configuration files (e.g., YAML) for hyperparameters and model settings.
4. Implement proper experiment tracking and model checkpointing.
5. Use version control (e.g., git) for tracking changes in code and configurations.
Refer to the official documentation of PyTorch, Transformers, and Streamlit for best practices and up-to-date APIs.
```
### Additional Styling Notes
1. Add typing wherever possible
2. Add trailing commas (COM812)
3. Avoid generic variable name df for dataframes (PD901)
4. Prefer `Path.open()` over `open()` (PTH123)
analytics
aws
css
django
docker
dockerfile
fastapi
flask
+12 more
First Time Repository
Alignment across Deep Neural Network Language Modelsβ Representations
HTML
Languages:
CSS: 5.2KB
Dockerfile: 0.0KB
HTML: 226105.5KB
JavaScript: 4532.2KB
Jupyter Notebook: 41019.7KB
Python: 379.6KB
Shell: 5.2KB
TeX: 148.8KB
Created: 11/18/2024
Updated: 1/8/2025
All Repositories (1)
Alignment across Deep Neural Network Language Modelsβ Representations