-
-
Notifications
You must be signed in to change notification settings - Fork 49.4k
Open
Labels
enhancementThis PR modified some existing filesThis PR modified some existing files
Description
Feature description
Feature Description
The machine_learning/ directory contains several ML algorithm implementations (K-means, Linear Regression, Decision Trees, etc.), but many of these files have limited test coverage, incomplete doctests, and could benefit from more comprehensive documentation. This makes it harder for learners to understand and verify the correctness of implementations.
Current Issues
- Insufficient doctests: Many ML algorithms lack comprehensive doctests covering edge cases
- Missing complexity analysis: Time and space complexity not documented for most algorithms
- Limited examples: Few practical usage examples with real-world datasets
- Incomplete type hints: Some functions missing proper type annotations
- Unclear parameter explanations: Algorithm parameters not well-documented
Proposed Enhancements
I propose a systematic improvement of the machine_learning/ directory:
1. Enhanced Doctests
- Add doctests for all public functions
- Cover edge cases (empty inputs, single data point, etc.)
- Test both typical and boundary conditions
- Include negative test cases (invalid inputs)
2. Comprehensive Documentation
- Add algorithm descriptions with mathematical formulas where appropriate
- Document time and space complexity
- Explain hyperparameters and their effects
- Add references to papers/resources
3. Practical Examples
- Include small example datasets in doctests
- Show typical use cases
- Demonstrate convergence behavior
- Compare with expected outputs
4. Code Quality Improvements
- Complete type hints for all parameters and returns
- Add input validation with proper error messages
- Ensure consistent code style across all ML files
- Add docstring parameters and return value documentation
Example Files to Improve
k_means_clust.py- Add convergence tests, visualization exampleslinear_regression.py- Add tests with known datasets, R² score validationdecision_tree.py- Test with various tree depths, feature importancegradient_descent.py- Test convergence with different learning ratesnaive_bayes.py- Add probability calculation tests
Benefits
- Better learning experience: Learners can understand algorithms through examples
- Increased confidence: Comprehensive tests verify correctness
- Easier debugging: Better documentation helps identify issues
- Professional quality: Brings ML code to same standard as other directories
- Reproducibility: Clear examples make results reproducible
Suggested Approach
- Start with most commonly used algorithms (K-means, Linear Regression)
- Create a template/standard for ML algorithm documentation
- Systematically apply to all files in
machine_learning/directory - Add GitHub Actions tests to ensure doctest coverage
Example Enhancement
Before:
def k_means(data, k):
# Basic implementation
passAfter:
def k_means(data: np.ndarray, k: int, max_iterations: int = 100) -> tuple[np.ndarray, np.ndarray]:
"""
K-Means clustering algorithm.
Time Complexity: O(n * k * i * d) where n=samples, k=clusters, i=iterations, d=dimensions
Space Complexity: O(n * d + k * d)
>>> data = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8]])
>>> centroids, labels = k_means(data, k=2)
>>> len(centroids)
2
"""I'm happy to work on this systematic improvement following the repository's contribution guidelines.
Metadata
Metadata
Assignees
Labels
enhancementThis PR modified some existing filesThis PR modified some existing files