Add automatic highlight detection for any color #4

mohanad-hafez · 2025-03-09T12:05:55Z

This PR adds a new function detect_highlights() that can automatically detect highlighted text of any color without requiring manual color range specification. This makes the tool more flexible and user-friendly as it no longer requires pre-configuring HSV color ranges for different highlighter colors.

Current Status

The implementation is functional but still needs refinement. It occasionally detects a small number of non-highlighted words or letters. I'm actively working on improving the accuracy and reducing false positives.

Implementation Details

Uses adaptive saturation thresholding to identify highlighted areas
Combines text detection with contrast analysis to reduce false positives
Applies morphological operations to clean up the detection mask
Filters out areas that are too small to be meaningful highlights

Testing

Tested on documents with various highlighter colors with promising results, though some edge cases need further tuning.

Next Steps

I'm working on:

Further reducing false positives
Improving detection accuracy for light-colored highlights
Adding parameter options to fine-tune detection sensitivity

Feedback and suggestions for improvement are welcome! I can update the implementation based on your recommendations.

zirkelc · 2025-03-09T13:11:23Z

Hi @mohanad-hafez

that's awesome, thank you! I really didn't expect a PR on this repo.

I need some time to test it locally and will come back to you soon!

Thanks again! 👏

mohanad-hafez · 2025-03-09T15:59:14Z

Hi @mohanad-hafez

that's awesome, thank you! I really didn't expect a PR on this repo.

I need some time to test it locally and will come back to you soon!

Thanks again! 👏

Thank you for the quick response! I found your tool really helpful for a project I was working on, but my use case required detecting different highlight colors without manual configuration. I thought this addition might be useful for the project. Thanks again!

zirkelc · 2025-03-13T09:40:46Z

main.py


+
+# Set path to Tesseract executable
+pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'


Since this is OS-specific, could you enabled it based on the current OS?

zirkelc · 2025-03-13T12:15:01Z

I got a chance to test your PR and it looks good. However, the text detection in the middle misses a few words:

The words of, a, (2,4) and T are not detected.

I went through the detect_highlights and previewed each image. For this particular image, the saturation mask sat_mask should already be good enough to do text detection on it, since the highlighted areas has a lot of white:

Then I tried to use the sat_mask for text detection by simply returning it from detect_highlights and omitting the remaining code. This is the result with sat_mask as img_mask:

It detects all words inside the highlighted area, but adds a few false positives.

Then I used denoising on the sat_mask with a small kernel of (3,3) and repeated the text detection:

Here's the modified code:

def detect_highlights(img_src):
    """Detect highlighted areas of any color.

    This approach uses the following principles:
    1. Highlighted areas have higher saturation than plain paper
    2. We use adaptive thresholding to separate highlights from regular text
    3. We combine multiple features (saturation, value, local contrast) for better accuracy
    """
    # Convert to HSV
    img_hsv = cv2.cvtColor(img_src, cv2.COLOR_BGR2HSV)
    h, s, v = cv2.split(img_hsv)

    # Convert to grayscale for text detection
    img_gray = cv2.cvtColor(img_src, cv2.COLOR_BGR2GRAY)
    cv2.imshow("img_gray", img_gray)
    cv2.waitKey(0)

    # Step 1: Find areas with high saturation (highlighted areas)
    # Calculate saturation statistics to set adaptive threshold
    sat_mean = np.mean(s)
    sat_std = np.std(s)
    sat_thresh = sat_mean + (1.5 * sat_std)  # More adaptive threshold

    # Create a binary mask where saturation is higher than threshold
    _, sat_mask = cv2.threshold(s, sat_thresh, 255, cv2.THRESH_BINARY)
    cv2.imshow("sat_mask", sat_mask)
    cv2.waitKey(0)

    return sat_mask

def denoise_image(img_src, kernel_size=5, iterations=1):
    """Denoise image with a morphological transformation."""

    # Morphological transformations to remove small noise
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_size, kernel_size))
    img_denoise = cv2.morphologyEx(
        img_src, cv2.MORPH_OPEN, kernel, iterations=iterations
    )

def main(args):
    # ...

    img_mask_denoised = detect_highlights(img_orig)
    img_mask_denoised = denoise_image(img_mask_denoised, kernel_size=3)
    cv2.imshow("img_mask_denoised", img_mask_denoised)
    cv2.waitKey(0)

Do you have ore images to try this simplified version on?

mohanad-hafez · 2025-03-15T01:03:35Z

Sorry for the late reply! Thank you so much for testing the PR and providing feedback.

I see you found a simpler approach that works better, just using the saturation mask with a small denoising kernel. That's a really good insight.

I will upload more test images with different highlight colors to see how this simplified version performs across various cases, and I'll share the results with you soon.

Thanks for taking the time to review and improve the code!

mohanad-hafez · 2025-03-16T16:22:17Z

I tried with these pictures and there were a lot of false positives

zirkelc · 2025-03-21T07:50:03Z

Can you show me your code? I tested these images and the result was correct:

Green:

Orange:

Rosa:

Only the last image has some issues, but it looks like the problem here is with the OCR in general as the bounding box indicates that it doesn't recognize the word alternative correctly.

I can push my code if you want to compare it with your results.

mohanad-hafez · 2025-03-21T11:08:01Z

Hi @zirkelc,
Thanks again! I found and fixed the issue in my code - now I'm getting the same great results as your screenshots show.
Thanks again for your help!

mohanad-hafez and others added 2 commits March 9, 2025 14:52

detect any color

080f2a8

Create README.md

09e002e

zirkelc reviewed Mar 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add automatic highlight detection for any color #4

Add automatic highlight detection for any color #4

Uh oh!

mohanad-hafez commented Mar 9, 2025

Uh oh!

zirkelc commented Mar 9, 2025

Uh oh!

mohanad-hafez commented Mar 9, 2025

Uh oh!

zirkelc Mar 13, 2025

Uh oh!

zirkelc commented Mar 13, 2025

Uh oh!

mohanad-hafez commented Mar 15, 2025

Uh oh!

mohanad-hafez commented Mar 16, 2025

Uh oh!

zirkelc commented Mar 21, 2025

Uh oh!

mohanad-hafez commented Mar 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		# Set path to Tesseract executable
		pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

Add automatic highlight detection for any color #4

Are you sure you want to change the base?

Add automatic highlight detection for any color #4

Uh oh!

Conversation

mohanad-hafez commented Mar 9, 2025

Current Status

Implementation Details

Testing

Next Steps

Uh oh!

zirkelc commented Mar 9, 2025

Uh oh!

mohanad-hafez commented Mar 9, 2025

Uh oh!

zirkelc Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

zirkelc commented Mar 13, 2025

Uh oh!

mohanad-hafez commented Mar 15, 2025

Uh oh!

mohanad-hafez commented Mar 16, 2025

Uh oh!

zirkelc commented Mar 21, 2025

Uh oh!

mohanad-hafez commented Mar 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants