Skip to content

Conversation

@mohanad-hafez
Copy link

This PR adds a new function detect_highlights() that can automatically detect highlighted text of any color without requiring manual color range specification. This makes the tool more flexible and user-friendly as it no longer requires pre-configuring HSV color ranges for different highlighter colors.

Current Status

The implementation is functional but still needs refinement. It occasionally detects a small number of non-highlighted words or letters. I'm actively working on improving the accuracy and reducing false positives.

Implementation Details

  • Uses adaptive saturation thresholding to identify highlighted areas
  • Combines text detection with contrast analysis to reduce false positives
  • Applies morphological operations to clean up the detection mask
  • Filters out areas that are too small to be meaningful highlights

Testing

Tested on documents with various highlighter colors with promising results, though some edge cases need further tuning.

Next Steps

I'm working on:

  • Further reducing false positives
  • Improving detection accuracy for light-colored highlights
  • Adding parameter options to fine-tune detection sensitivity

Feedback and suggestions for improvement are welcome! I can update the implementation based on your recommendations.

@zirkelc
Copy link
Owner

zirkelc commented Mar 9, 2025

Hi @mohanad-hafez

that's awesome, thank you! I really didn't expect a PR on this repo.

I need some time to test it locally and will come back to you soon!

Thanks again! 👏

@mohanad-hafez
Copy link
Author

Hi @mohanad-hafez

that's awesome, thank you! I really didn't expect a PR on this repo.

I need some time to test it locally and will come back to you soon!

Thanks again! 👏

Thank you for the quick response! I found your tool really helpful for a project I was working on, but my use case required detecting different highlight colors without manual configuration. I thought this addition might be useful for the project. Thanks again!



# Set path to Tesseract executable
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is OS-specific, could you enabled it based on the current OS?

@zirkelc
Copy link
Owner

zirkelc commented Mar 13, 2025

I got a chance to test your PR and it looks good. However, the text detection in the middle misses a few words:

image

The words of, a, (2,4) and T are not detected.


I went through the detect_highlights and previewed each image. For this particular image, the saturation mask sat_mask should already be good enough to do text detection on it, since the highlighted areas has a lot of white:

image

Then I tried to use the sat_mask for text detection by simply returning it from detect_highlights and omitting the remaining code. This is the result with sat_mask as img_mask:

image

It detects all words inside the highlighted area, but adds a few false positives.


Then I used denoising on the sat_mask with a small kernel of (3,3) and repeated the text detection:

image

Here's the modified code:

def detect_highlights(img_src):
    """Detect highlighted areas of any color.

    This approach uses the following principles:
    1. Highlighted areas have higher saturation than plain paper
    2. We use adaptive thresholding to separate highlights from regular text
    3. We combine multiple features (saturation, value, local contrast) for better accuracy
    """
    # Convert to HSV
    img_hsv = cv2.cvtColor(img_src, cv2.COLOR_BGR2HSV)
    h, s, v = cv2.split(img_hsv)

    # Convert to grayscale for text detection
    img_gray = cv2.cvtColor(img_src, cv2.COLOR_BGR2GRAY)
    cv2.imshow("img_gray", img_gray)
    cv2.waitKey(0)

    # Step 1: Find areas with high saturation (highlighted areas)
    # Calculate saturation statistics to set adaptive threshold
    sat_mean = np.mean(s)
    sat_std = np.std(s)
    sat_thresh = sat_mean + (1.5 * sat_std)  # More adaptive threshold

    # Create a binary mask where saturation is higher than threshold
    _, sat_mask = cv2.threshold(s, sat_thresh, 255, cv2.THRESH_BINARY)
    cv2.imshow("sat_mask", sat_mask)
    cv2.waitKey(0)

    return sat_mask

def denoise_image(img_src, kernel_size=5, iterations=1):
    """Denoise image with a morphological transformation."""

    # Morphological transformations to remove small noise
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_size, kernel_size))
    img_denoise = cv2.morphologyEx(
        img_src, cv2.MORPH_OPEN, kernel, iterations=iterations
    )

def main(args):
    # ...

    img_mask_denoised = detect_highlights(img_orig)
    img_mask_denoised = denoise_image(img_mask_denoised, kernel_size=3)
    cv2.imshow("img_mask_denoised", img_mask_denoised)
    cv2.waitKey(0)

Do you have ore images to try this simplified version on?

@mohanad-hafez
Copy link
Author

Sorry for the late reply! Thank you so much for testing the PR and providing feedback.

I see you found a simpler approach that works better, just using the saturation mask with a small denoising kernel. That's a really good insight.

I will upload more test images with different highlight colors to see how this simplified version performs across various cases, and I'll share the results with you soon.

Thanks for taking the time to review and improve the code!

@mohanad-hafez
Copy link
Author

I tried with these pictures and there were a lot of false positives
green
orange
pink2

@zirkelc
Copy link
Owner

zirkelc commented Mar 21, 2025

Can you show me your code? I tested these images and the result was correct:

Green:
img_final


Orange:
img_final


Rosa:
img_contour_and_bounding

img_final


Only the last image has some issues, but it looks like the problem here is with the OCR in general as the bounding box indicates that it doesn't recognize the word alternative correctly.

I can push my code if you want to compare it with your results.

@mohanad-hafez
Copy link
Author

Hi @zirkelc,
Thanks again! I found and fixed the issue in my code - now I'm getting the same great results as your screenshots show.
Thanks again for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants