Searching for patterns within text is a fundamental task in computer science. One interesting variation is finding “anagram substrings” – instances where a substring within a larger text is an anagram of a given pattern. This problem often appears in coding interviews and has practical applications in areas like text analysis, data matching, and even bioinformatics.
In this post, we’ll break down the Anagram Substring Search problem, explore different approaches, and provide clear, runnable code examples in Python and Go.
What is Anagram Substring Search?
Let’s clarify the terms:
- Anagram: Two strings are anagrams of each other if they contain the same characters with the same frequencies, regardless of order. For example, “listen” and “silent” are anagrams. “Debit card” and “bad credit” are also anagrams.
- Substring: A contiguous sequence of characters within a string. “appl” is a substring of “apple”, but “aple” is not.
Anagram Substring Search (or “Find All Anagrams of a String in a Text”) involves finding all starting indices in a text
string where a substring of text
is an anagram of a given pattern
string.
Example: Text: “cbaebabacd” Pattern: “abc”
Expected output: [0, 6]
- At index 0: “cba” is an anagram of “abc”.
- At index 6: “bac” is an anagram of “abc”.
Approach 1: The Naive Method (Brute Force)
The most straightforward way to solve this is to iterate through every possible substring in the text
that has the same length as the pattern
. For each such substring, we check if it’s an anagram of the pattern
.
How to check if two strings are anagrams:
- Sorting: Sort the characters of both strings. If the sorted strings are identical, they are anagrams. This takes
O(M log M)
time for strings of lengthM
. - Frequency Maps (or Character Counts): Create a frequency map (e.g., a dictionary or an array of 26 integers for lowercase English letters) for each string. Populate the map by counting character occurrences. If the frequency maps are identical, the strings are anagrams. This takes
O(M)
time.
For the naive approach, using frequency maps to check anagrams is generally preferred as it’s faster than sorting.
Algorithm Steps:
- Get the length of the
text
(N
) andpattern
(M
). - If
M > N
, no anagrams can exist, so return an empty list. - Iterate from
i = 0
toN - M
.- Extract the current window (substring) of
text
fromi
toi + M - 1
. - Check if this window is an anagram of the
pattern
using frequency maps. - If it is, add
i
to our list of results.
- Extract the current window (substring) of
Time Complexity: O((N - M + 1) * M * A)
where N
is text length, M
is pattern length, and A
is the size of the alphabet. In the worst case (M
is close to N
), this is O(N * M * A)
. If using sorting for anagram check, it becomes O(N * M log M)
. This can be quite slow for large texts.
Python Example: Naive Approach
import collections
def are_anagrams(s1, s2):
"""
Checks if two strings are anagrams using frequency maps.
Assumes ASCII characters for simplicity, but works generally.
"""
if len(s1) != len(s2):
return False
# Using collections.Counter is very efficient for frequency maps
return collections.Counter(s1) == collections.Counter(s2)
def find_anagram_substrings_naive(text, pattern):
"""
Finds all starting indices of anagram substrings using the naive approach.
"""
n = len(text)
m = len(pattern)
results = []
if m > n:
return results
# Pre-calculate pattern frequency map
# Note: We could re-calculate this inside the loop for each comparison,
# but pre-calculating it once is slightly more efficient.
pattern_freq = collections.Counter(pattern)
for i in range(n - m + 1):
current_window = text[i : i + m]
if are_anagrams(current_window, pattern):
results.append(i)
return results
# --- Test Cases ---
print("--- Naive Approach Examples ---")
# Example 1
text1 = "cbaebabacd"
pattern1 = "abc"
print(f"Text: '{text1}', Pattern: '{pattern1}'")
output1 = find_anagram_substrings_naive(text1, pattern1)
print(f"Output: {output1}")
# Example 2
text2 = "abab"
pattern2 = "ab"
print(f"Text: '{text2}', Pattern: '{pattern2}'")
output2 = find_anagram_substrings_naive(text2, pattern2)
print(f"Output: {output2}")
# Example 3: No matches
text3 = "hello"
pattern3 = "world"
print(f"Text: '{text3}', Pattern: '{pattern3}'")
output3 = find_anagram_substrings_naive(text3, pattern3)
print(f"Output: {output3}")
# Example 4: Pattern longer than text
text4 = "a"
pattern4 = "abc"
print(f"Text: '{text4}', Pattern: '{pattern4}'")
output4 = find_anagram_substrings_naive(text4, pattern4)
print(f"Output: {output4}")
--- Naive Approach Examples ---
Text: 'cbaebabacd', Pattern: 'abc'
Output: [0, 6]
Text: 'abab', Pattern: 'ab'
Output: [0, 1, 2]
Text: 'hello', Pattern: 'world'
Output: []
Text: 'a', Pattern: 'abc'
Output: []
Approach 2: Sliding Window with Frequency Maps (Optimized)
The naive approach repeatedly builds a frequency map for each substring. We can optimize this by using a “sliding window” technique. Instead of recalculating the map for each new window, we update it as the window slides.
Key Idea: When the window slides one character to the right:
- The character at the leftmost edge of the previous window “leaves” the window. Its count in the frequency map should be decremented.
- The character at the new rightmost edge “enters” the window. Its count should be incremented.
This allows us to update the window’s frequency map in O(1)
time (amortized) instead of O(M)
.
Algorithm Steps:
- Get the length of the
text
(N
) andpattern
(M
). - If
M > N
, return an empty list. - Initialize
pattern_freq
(a frequency map forpattern
). - Initialize
window_freq
(a frequency map for the first window oftext
, i.e.,text[0:M]
). - Initialize
results
list. - Compare
pattern_freq
andwindow_freq
. If they are identical, add0
toresults
. - Iterate from
i = 1
toN - M
(sliding the window):- Remove the character
text[i-1]
fromwindow_freq
(decrement its count). - Add the character
text[i+M-1]
towindow_freq
(increment its count). - Compare
pattern_freq
andwindow_freq
. If identical, addi
toresults
.
- Remove the character
Optimized Comparison (Using a matches
counter):
Comparing two entire frequency maps (collections.Counter
objects in Python, or arrays) takes O(A)
time, where A
is the alphabet size. If A
is large, or we want to be strictly O(1)
per window slide, we can use a matches
counter.
The matches
counter keeps track of how many characters in window_freq
have the exact same frequency as in pattern_freq
.
- Initialize
matches = 0
. - When building initial
pattern_freq
andwindow_freq
:- For each character
c
in the alphabet: ifpattern_freq[c] == window_freq[c]
, incrementmatches
.
- For each character
- When sliding the window (for
char_out = text[i-1]
andchar_in = text[i+M-1]
):- Before
char_out
count decrements: ifwindow_freq[char_out] == pattern_freq[char_out]
, decrementmatches
. - Decrement
window_freq[char_out]
. - After
char_out
count decrements: ifwindow_freq[char_out] == pattern_freq[char_out]
, incrementmatches
. - Before
char_in
count increments: ifwindow_freq[char_in] == pattern_freq[char_in]
, decrementmatches
. - Increment
window_freq[char_in]
. - After
char_in
count increments: ifwindow_freq[char_in] == pattern_freq[char_in]
, incrementmatches
. - If
matches == A
(orpattern_freq.keys()
count if using only relevant chars), then the maps are identical. Addi
to results.
- Before
Note: The matches
counter approach is more complex to implement correctly due to the pre- and post-update checks. A simpler collections.Counter
comparison is often sufficient and more readable, especially if A
is small (e.g., 26 for English letters). For competitive programming, the matches
counter is a common optimization.
Time Complexity: O(N + M + A)
.
O(M)
to build initial pattern frequency map.O(M)
to build initial window frequency map.O(N - M)
slides. Each slide takesO(1)
to update counts andO(A)
to compare maps (orO(1)
using thematches
counter ifA
is constant). Overall, ifA
is constant and small (e.g., 26 for English alphabet), this simplifies toO(N)
. This is significantly better than the naive approach.
Python Example: Sliding Window Optimized
We’ll use collections.Counter
directly, as its comparison is optimized in C and usually fast enough. This is a pragmatic choice for Python.
import collections
def find_anagram_substrings_sliding_window(text, pattern):
"""
Finds all starting indices of anagram substrings using the sliding window approach.
"""
n = len(text)
m = len(pattern)
results = []
if m > n:
return results
# Build frequency map for the pattern
pattern_freq = collections.Counter(pattern)
# Build frequency map for the first window of the text
window_freq = collections.Counter(text[0:m])
# Check the first window
if window_freq == pattern_freq:
results.append(0)
# Slide the window across the rest of the text
for i in range(1, n - m + 1):
# Character leaving the window
char_out = text[i - 1]
window_freq[char_out] -= 1
if window_freq[char_out] == 0:
del window_freq[char_out] # Clean up zero counts for efficiency/correct comparison
# Character entering the window
char_in = text[i + m - 1]
window_freq[char_in] += 1
# Check if current window is an anagram of pattern
if window_freq == pattern_freq:
results.append(i)
return results
# --- Test Cases ---
print("\n--- Sliding Window Approach Examples (Python) ---")
# Example 1
text1 = "cbaebabacd"
pattern1 = "abc"
print(f"Text: '{text1}', Pattern: '{pattern1}'")
output1 = find_anagram_substrings_sliding_window(text1, pattern1)
print(f"Output: {output1}")
# Example 2
text2 = "abab"
pattern2 = "ab"
print(f"Text: '{text2}', Pattern: '{pattern2}'")
output2 = find_anagram_substrings_sliding_window(text2, pattern2)
print(f"Output: {output2}")
# Example 3: Long text, small pattern
text3 = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
pattern3 = "zyxw"
print(f"Text: '{text3}', Pattern: '{pattern3}'")
output3 = find_anagram_substrings_sliding_window(text3, pattern3)
print(f"Output: {output3}")
# Example 4: No matches, non-alphabetic characters
text4 = "helloworld!"
pattern4 = "dlo" # Note: 'dlo' is not an anagram of any substring of 'helloworld!'
print(f"Text: '{text4}', Pattern: '{pattern4}'")
output4 = find_anagram_substrings_sliding_window(text4, pattern4)
print(f"Output: {output4}")
# Example 5: Case sensitivity
text5 = "BbAcCb"
pattern5 = "abc"
print(f"Text: '{text5}', Pattern: '{pattern5}'")
output5 = find_anagram_substrings_sliding_window(text5, pattern5)
print(f"Output: {output5}") # Will yield no matches as 'A' != 'a'
--- Sliding Window Approach Examples (Python) ---
Text: 'cbaebabacd', Pattern: 'abc'
Output: [0, 6]
Text: 'abab', Pattern: 'ab'
Output: [0, 1, 2]
Text: 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz', Pattern: 'zyxw'
Output: [22, 48]
Text: 'helloworld!', Pattern: 'dlo'
Output: []
Text: 'BbAcCb', Pattern: 'abc'
Output: []
Go Example: Sliding Window Optimized
In Go, we typically use arrays (e.g., [26]int
for lowercase English letters or [128]int
for ASCII) to represent frequency maps for better performance than map[rune]int
for fixed, small alphabets. This also simplifies the matches
counter implementation.
package main
import (
"fmt"
)
// findAnagramSubstringsSlidingWindow finds all starting indices of anagram substrings.
func findAnagramSubstringsSlidingWindow(text string, pattern string) []int {
n := len(text)
m := len(pattern)
results := []int{}
if m > n {
return results
}
// Use frequency arrays for ASCII characters (0-127)
// For English lowercase letters, [26]int would be sufficient.
// For full Unicode, a map[rune]int would be necessary.
patternFreq := [128]int{}
windowFreq := [128]int{}
// Populate patternFreq
for _, char := range pattern {
patternFreq[char]++
}
// Populate windowFreq for the first window and count initial matches
matches := 0
for i := 0; i < m; i++ {
char := text[i] // byte value
windowFreq[char]++
}
// Compare initial windowFreq and patternFreq to count initial matches
// A 'match' means the character count in windowFreq matches patternFreq
for i := 0; i < 128; i++ {
if patternFreq[i] == windowFreq[i] && patternFreq[i] > 0 { // Check >0 to avoid matching empty counts
matches++
}
}
// If all pattern characters have matching frequencies in the first window
if matches == countDistinctChars(patternFreq[:]) { // countDistinctChars is helper for match comparison
results = append(results, 0)
}
// Slide the window
for i := 1; i <= n-m; i++ {
// Character leaving the window
charOut := text[i-1]
if patternFreq[charOut] > 0 { // Only consider chars relevant to pattern
if windowFreq[charOut] == patternFreq[charOut] {
matches-- // This char was matching, now it won't be
}
}
windowFreq[charOut]--
if patternFreq[charOut] > 0 {
if windowFreq[charOut] == patternFreq[charOut] {
matches++ // This char might become matching again after decrement
}
}
// Character entering the window
charIn := text[i+m-1]
if patternFreq[charIn] > 0 { // Only consider chars relevant to pattern
if windowFreq[charIn] == patternFreq[charIn] {
matches-- // This char was matching, now it won't be
}
}
windowFreq[charIn]++
if patternFreq[charIn] > 0 {
if windowFreq[charIn] == patternFreq[charIn] {
matches++ // This char might become matching again after increment
}
}
// If all pattern characters have matching frequencies
if matches == countDistinctChars(patternFreq[:]) {
results = append(results, i)
}
}
return results
}
// Helper to count distinct non-zero characters in a frequency array
// This helps determine how many `matches` are needed for a full match.
func countDistinctChars(freq []int) int {
count := 0
for _, f := range freq {
if f > 0 {
count++
}
}
return count
}
func main() {
fmt.Println("--- Sliding Window Approach Examples (Go) ---")
// Example 1
text1 := "cbaebabacd"
pattern1 := "abc"
fmt.Printf("Text: '%s', Pattern: '%s'\n", text1, pattern1)
output1 := findAnagramSubstringsSlidingWindow(text1, pattern1)
fmt.Printf("Output: %v\n", output1)
// Example 2
text2 := "abab"
pattern2 := "ab"
fmt.Printf("Text: '%s', Pattern: '%s'\n", text2, pattern2)
output2 := findAnagramSubstringsSlidingWindow(text2, pattern2)
fmt.Printf("Output: %v\n", output2)
// Example 3: Long text, small pattern
text3 := "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
pattern3 := "zyxw"
fmt.Printf("Text: '%s', Pattern: '%s'\n", text3, pattern3)
output3 := findAnagramSubstringsSlidingWindow(text3, pattern3)
fmt.Printf("Output: %v\n", output3)
// Example 4: Case sensitivity (will not match 'abc' to 'Abc')
text4 := "BbAcCb"
pattern4 := "abc"
fmt.Printf("Text: '%s', Pattern: '%s'\n", text4, pattern4)
output4 := findAnagramSubstringsSlidingWindow(text4, pattern4)
fmt.Printf("Output: %v\n", output4)
}
--- Sliding Window Approach Examples (Go) ---
Text: 'cbaebabacd', Pattern: 'abc'
Output: [0 6]
Text: 'abab', Pattern: 'ab'
Output: [0 1 2]
Text: 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'
Pattern: 'zyxw'
Output: [22 48]
Text: 'BbAcCb', Pattern: 'abc'
Output: []
Variations and Considerations
- Case Sensitivity: Our examples treat ‘A’ and ‘a’ as different characters. If case-insensitivity is required, convert both
text
andpattern
to a consistent case (e.g., lowercase) before processing. - Non-alphabetic Characters: If the problem allows spaces, punctuation, or numbers, and they should be considered part of the anagram, then the frequency maps (or arrays) must accommodate their ASCII/Unicode values. If they should be ignored, filter them out before processing or during frequency counting. Our
[128]int
array for Go handles common ASCII characters, but for full Unicode,map[rune]int
would be necessary. - Empty Pattern/Text: Our code handles
m > n
(pattern longer than text) by returning an empty list. Ifpattern
is empty, it depends on the problem definition. Some might consider an empty string an anagram of itself, matching at every possible index; others might return an error or empty list. Our current code would return an empty list because thepatternFreq
for an empty pattern would be all zeros, andcountDistinctChars
would return 0, leading tomatches == 0
. - Character Set: For specific character sets (e.g., only lowercase English letters), using a fixed-size array (
[26]int
) mapping ‘a’ to index 0, ‘b’ to index 1, etc., is more memory-efficient and potentially faster than hash maps.
Choosing the Right Approach
-
Naive Approach:
- Pros: Simple to understand and implement. Good for very small inputs or when extreme clarity is prioritized over performance.
- Cons: Inefficient for larger inputs. Time complexity
O(N * M * A)
orO(N * M log M)
is generally unacceptable for competitive programming or production systems with large datasets.
-
Sliding Window with Frequency Maps:
- Pros: Highly efficient with
O(N + M + A)
(effectivelyO(N)
for small, constant alphabets). This is the standard, optimized solution. - Cons: Slightly more complex to implement correctly due to managing the sliding window’s character counts and the
matches
logic (if used).
- Pros: Highly efficient with
Verdict: For virtually any practical application or interview scenario, the Sliding Window approach is the one to use and demonstrate. It showcases a strong understanding of algorithmic optimization and efficient data structure usage.
Conclusion
The Anagram Substring Search problem is a classic example of how applying a clever data structure (frequency maps) and an algorithmic pattern (sliding window) can drastically improve performance from a naive brute-force solution. Understanding and implementing the sliding window technique is a valuable skill for any developer dealing with string manipulation or array-based problems.
Remember, practice is key. Try implementing these solutions yourself, tweaking them for different constraints (case sensitivity, character sets), and measuring their performance. Happy coding!