The ability for computers to understand and process human language is fascinating. At its core, this often involves dissecting strings of characters into meaningful units. One classic problem that pops up in this domain, and frequently in technical interviews, is the Word Break Problem.
It might sound simple, but it’s a great example of where a naive approach can quickly spiral into an efficiency nightmare, and how intelligent algorithmic design (like Dynamic Programming) can tame it.
Let’s dive in.
What is the Word Break Problem?
Imagine you have a long string of characters with no spaces, like "applepenapple"
. You also have a dictionary of valid words, say ["apple", "pen"]
. The Word Break Problem asks: Can this string be segmented into a space-separated sequence of one or more dictionary words?
For "applepenapple"
and ["apple", "pen"]
, the answer is True
, because it can be segmented as “apple pen apple”.
What about "catsanddog"
and ["cat", "sand", "dog", "cats"]
? Also True
, as “cats and dog” or “cat sand dog”.
What about "pineapple"
and ["apple", "pen"]
? False
. Even though “apple” and “pen” are in the dictionary, you can’t form “pineapple” using just these words without leftover characters or an invalid sequence.
Why is this important?
This problem has applications in:
- Natural Language Processing (NLP): Tokenization, text segmentation, spell-checking.
- Search Engines: Breaking down queries into keywords.
- Compilers: Lexical analysis, identifying valid tokens in source code.
- Data Validation: Checking if a user input string conforms to a set of valid segments.
Problem Definition
Given:
- A non-empty string
s
. - A dictionary
wordDict
containing a list of non-empty words.
Determine if s
can be segmented into a space-separated sequence of one or more dictionary words.
You may assume the dictionary does not contain duplicate words.
Approach 1: The Naive Recursive (Backtracking) Solution
Our first instinct might be to try every possible way to break the string. This usually leads to a recursive solution.
The idea is:
- Take a prefix of the string
s
. - Check if this prefix exists in our
wordDict
. - If it does, then recursively check if the rest of the string (the suffix) can also be broken down into dictionary words.
- If at any point we successfully break down the entire string, we return
True
. - If we try all possible prefixes and none lead to a successful segmentation of the rest of the string, then we return
False
. - The base case: An empty string
""
can always be segmented (it requires no words), so we returnTrue
.
Let’s illustrate with s = "applepenapple"
, wordDict = ["apple", "pen"]
:
wordBreak("applepenapple")
- Try prefix "a" (not in dict)
- Try prefix "ap" (not in dict)
- ...
- Try prefix "apple" (in dict!)
- Recursive call: wordBreak("penapple")
- Try prefix "p" (not in dict)
- ...
- Try prefix "pen" (in dict!)
- Recursive call: wordBreak("apple")
- Try prefix "a" (not in dict)
- ...
- Try prefix "apple" (in dict!)
- Recursive call: wordBreak("")
- Base case: return True
- Since inner call returned True, this path returns True.
- Since inner call returned True, this path returns True.
- Since inner call returned True, this path returns True.
Final result: True
Python Implementation (Naive)
def wordBreak_naive(s: str, wordDict: list[str]) -> bool:
word_set = set(wordDict) # Convert to set for O(1) average lookup
# This is a helper recursive function
def can_break(sub_s: str) -> bool:
# Base case: If the substring is empty, it means we successfully segmented
# a part of the original string, so return True.
if not sub_s:
return True
# Try every possible prefix of the current substring
for i in range(1, len(sub_s) + 1):
prefix = sub_s[:i] # Get the prefix
# If the prefix is in our dictionary
if prefix in word_set:
# Recursively check if the remaining suffix can also be broken
suffix = sub_s[i:]
if can_break(suffix):
return True # If suffix can be broken, then current sub_s can be
# If no prefix leads to a successful breakdown, return False
return False
return can_break(s)
Example Usage and Output
# Example 1: Should be True
s1 = "leetcode"
wordDict1 = ["leet", "code"]
print(f"'{s1}' with {wordDict1}: {wordBreak_naive(s1, wordDict1)}")
# Example 2: Should be True
s2 = "applepenapple"
wordDict2 = ["apple", "pen"]
print(f"'{s2}' with {wordDict2}: {wordBreak_naive(s2, wordDict2)}")
# Example 3: Should be False
s3 = "catsandog"
wordDict3 = ["cats", "dog", "sand", "and", "cat"]
print(f"'{s3}' with {wordDict3}: {wordBreak_naive(s3, wordDict3)}")
# Example 4: A challenging case for naive recursion
s4 = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"
wordDict4 = ["a", "aa", "aaa", "aaaa", "aaaaa", "aaaaaa", "aaaaaaa", "aaaaaaaa", "aaaaaaaaa", "aaaaaaaaaa"]
print(f"'{s4[:20]}...' with {wordDict4}: {wordBreak_naive(s4, wordDict4)}")
'leetcode' with ['leet', 'code']: True
'applepenapple' with ['apple', 'pen']: True
'catsandog' with ['cats', 'dog', 'sand', 'and', 'cat']: False
'aaaaaaaaaaaaaaaaaaaa...' with ['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaa']: False
Analysis of the Naive Approach
Time Complexity: This approach suffers from exponential time complexity, roughly O(2^N) in the worst case, where N is the length of the string s
.
Why? Because of overlapping subproblems. Consider s = "aaaaaa"
and wordDict = ["a", "aa"]
.
can_break("aaaaaa")
will call can_break("aaaaa")
, can_break("aaaa")
, etc.
can_break("aaaaa")
will also call can_break("aaaa")
, can_break("aaa")
, etc.
The same subproblems ("aaaa"
, "aaa"
, etc.) are computed multiple times. This leads to a massive amount of redundant computation.
For a string like s4
in the example, a truly naive implementation might even hit a recursion depth limit or take an extremely long time. This is where optimization is crucial.
Approach 2: Memoization (Top-Down Dynamic Programming)
The problem of overlapping subproblems is a classic indicator that Dynamic Programming (DP) can be applied. When we combine recursion with storing the results of already-computed subproblems, it’s called memoization (a top-down DP approach).
The idea is simple:
- Use a cache (e.g., a dictionary or an array) to store the result of
can_break(sub_s)
. - Before computing
can_break(sub_s)
, check if its result is already in the cache. If yes, return the cached value directly. - If not, compute the result as before, and then store it in the cache before returning.
Python Implementation (Memoization)
def wordBreak_memo(s: str, wordDict: list[str]) -> bool:
word_set = set(wordDict)
# Cache to store results of subproblems: key = substring, value = boolean result
# We can also use an array `memo[i]` storing result for `s[i:]`
memo = {}
def can_break(sub_s: str) -> bool:
if not sub_s:
return True
# Check if the result for this substring is already memoized
if sub_s in memo:
return memo[sub_s]
for i in range(1, len(sub_s) + 1):
prefix = sub_s[:i]
if prefix in word_set:
suffix = sub_s[i:]
if can_break(suffix):
memo[sub_s] = True # Memoize and return True
return True
# If no path found, memoize False
memo[sub_s] = False
return False
return can_break(s)
Example Usage and Output
# Example 1: Should be True
s1 = "leetcode"
wordDict1 = ["leet", "code"]
print(f"'{s1}' with {wordDict1}: {wordBreak_memo(s1, wordDict1)}")
# Example 2: Should be True
s2 = "applepenapple"
wordDict2 = ["apple", "pen"]
print(f"'{s2}' with {wordDict2}: {wordBreak_memo(s2, wordDict2)}")
# Example 3: Should be False
s3 = "catsandog"
wordDict3 = ["cats", "dog", "sand", "and", "cat"]
print(f"'{s3}' with {wordDict3}: {wordBreak_memo(s3, wordDict3)}")
# Example 4: The challenging case, now it should run quickly
s4 = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"
wordDict4 = ["a", "aa", "aaa", "aaaa", "aaaaa", "aaaaaa", "aaaaaaa", "aaaaaaaa", "aaaaaaaaa", "aaaaaaaaaa"]
print(f"'{s4[:20]}...' with {wordDict4}: {wordBreak_memo(s4, wordDict4)}")
# Example 5: Another true case
s5 = "cars"
wordDict5 = ["car", "cars"]
print(f"'{s5}' with {wordDict5}: {wordBreak_memo(s5, wordDict5)}")
'leetcode' with ['leet', 'code']: True
'applepenapple' with ['apple', 'pen']: True
'catsandog' with ['cats', 'dog', 'sand', 'and', 'cat']: False
'aaaaaaaaaaaaaaaaaaaa...' with ['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaa']: False
'cars' with ['car', 'cars']: True
Analysis of Memoized Approach
Time Complexity: By memoizing, we ensure that each unique subproblem is computed only once. There are N
possible substrings (or N
starting indices for subproblems). For each subproblem of length k
, we iterate k
times to find prefixes. Substring slicing sub_s[:i]
takes O(i)
time, and set
lookup takes O(word_length)
average.
So, for each of N
states, we do N
iterations, each involving a substring slice and a dictionary lookup.
This gives us a complexity of O(N^2 * L_max), where N is the length of s
, and L_max
is the maximum length of a word in wordDict
(due to substring slicing and dictionary lookups). This is a significant improvement over exponential!
Space Complexity: O(N) for the memoization dictionary, as we store results for up to N
unique substrings (actually, N
suffixes s[i:]
).
Approach 3: Bottom-Up Dynamic Programming
While memoization is top-down DP, we can also solve this problem iteratively using a bottom-up DP approach. This is often preferred for its clear state transitions and avoidance of recursion overhead.
We define a boolean array dp
of size N+1
.
dp[i]
will beTrue
ifs[0...i-1]
(the prefix ofs
of lengthi
) can be segmented into dictionary words.dp[0]
isTrue
because an empty string (prefix of length 0) can always be segmented (it represents a valid base for building up solutions).
The logic:
To determine dp[i]
, we look at all possible split points j
from 0
to i-1
.
If dp[j]
is True
(meaning s[0...j-1]
can be segmented) AND s[j...i-1]
(the substring from j
to i-1
) is in our wordDict
, then dp[i]
can be True
.
Let’s trace s = "leetcode"
, wordDict = ["leet", "code"]
:
N = 8
. dp
array of size 9
.
dp = [F, F, F, F, F, F, F, F, F]
Initialize dp[0] = True
dp = [T, F, F, F, F, F, F, F, F]
Outer loop: i
from 1
to N
(length of prefix we are checking)
i = 1
:s[0...0] = "l"
j = 0
:dp[0]
isTrue
.s[0...0]
("l"
) inwordDict
? No.
i = 2
:s[0...1] = "le"
j = 0
:dp[0]
isTrue
.s[0...1]
("le"
) inwordDict
? No.j = 1
:dp[1]
isFalse
. Skip.
i = 3
:s[0...2] = "lee"
- … (no match)
i = 4
:s[0...3] = "leet"
j = 0
:dp[0]
isTrue
.s[0...3]
("leet"
) inwordDict
? Yes!- Set
dp[4] = True
. Break inner loop.dp = [T, F, F, F, T, F, F, F, F]
i = 5
:s[0...4] = "leetc"
j = 0
:dp[0]
isTrue
.s[0...4]
("leetc"
) inwordDict
? No.j = 1
:dp[1]
isFalse
. Skip.j = 2
:dp[2]
isFalse
. Skip.j = 3
:dp[3]
isFalse
. Skip.j = 4
:dp[4]
isTrue
.s[4...4]
("c"
) inwordDict
? No.
i = 6
:s[0...5] = "leetco"
- …
j = 4
:dp[4]
isTrue
.s[4...5]
("co"
) inwordDict
? No.
i = 7
:s[0...6] = "leetcod"
- …
j = 4
:dp[4]
isTrue
.s[4...6]
("cod"
) inwordDict
? No.
i = 8
:s[0...7] = "leetcode"
j = 0
:dp[0]
isTrue
.s[0...7]
("leetcode"
) inwordDict
? No.- …
j = 4
:dp[4]
isTrue
.s[4...7]
("code"
) inwordDict
? Yes!- Set
dp[8] = True
. Break inner loop.dp = [T, F, F, F, T, F, F, F, T]
Finally, return dp[N]
(which is dp[8]
). It’s True
.
Python Implementation (Bottom-Up DP)
def wordBreak_dp(s: str, wordDict: list[str]) -> bool:
word_set = set(wordDict)
n = len(s)
# dp[i] is True if s[0...i-1] can be segmented
dp = [False] * (n + 1)
dp[0] = True # Empty string can always be segmented
# Iterate through all possible end points (i) of a prefix
for i in range(1, n + 1):
# Iterate through all possible start points (j) of the current word
# (s[j...i-1]) such that s[0...j-1] is already segmented (dp[j] is True)
for j in range(i):
if dp[j]: # If the prefix s[0...j-1] can be segmented
current_word = s[j:i] # The potential word
if current_word in word_set: # And this word is in the dictionary
dp[i] = True # Then s[0...i-1] can be segmented
break # No need to check other j's for this i, we found a way
return dp[n]
Example Usage and Output
# Example 1: Should be True
s1 = "leetcode"
wordDict1 = ["leet", "code"]
print(f"'{s1}' with {wordDict1}: {wordBreak_dp(s1, wordDict1)}")
# Example 2: Should be True
s2 = "applepenapple"
wordDict2 = ["apple", "pen"]
print(f"'{s2}' with {wordDict2}: {wordBreak_dp(s2, wordDict2)}")
# Example 3: Should be False
s3 = "catsandog"
wordDict3 = ["cats", "dog", "sand", "and", "cat"]
print(f"'{s3}' with {wordDict3}: {wordBreak_dp(s3, wordDict3)}")
# Example 4: The challenging case, now it runs quickly and correctly
s4 = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"
wordDict4 = ["a", "aa", "aaa", "aaaa", "aaaaa", "aaaaaa", "aaaaaaa", "aaaaaaaa", "aaaaaaaaa", "aaaaaaaaaa"]
print(f"'{s4[:20]}...' with {wordDict4}: {wordBreak_dp(s4, wordDict4)}")
# Example 5: Another true case
s5 = "cars"
wordDict5 = ["car", "cars"]
print(f"'{s5}' with {wordDict5}: {wordBreak_dp(s5, wordDict5)}")
# Example 6: "bedbathandbeyond", ["bed", "bath", "bat", "and", "hand", "beyond"] -> True
s6 = "bedbathandbeyond"
wordDict6 = ["bed", "bath", "bat", "and", "hand", "beyond"]
print(f"'{s6}' with {wordDict6}: {wordBreak_dp(s6, wordDict6)}")
'leetcode' with ['leet', 'code']: True
'applepenapple' with ['apple', 'pen']: True
'catsandog' with ['cats', 'dog', 'sand', 'and', 'cat']: False
'aaaaaaaaaaaaaaaaaaaa...' with ['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaa']: False
'cars' with ['car', 'cars']: True
'bedbathandbeyond' with ['bed', 'bath', 'bat', 'and', 'hand', 'beyond']: True
Analysis of Bottom-Up DP Approach
Time Complexity: This is the same as the memoized version. We have two nested loops, i
from 1
to N
and j
from 0
to i-1
. Inside the inner loop, we perform substring slicing s[j:i]
(which takes O(length_of_substring)
) and a set
lookup (average O(length_of_word)
). In the worst case, length_of_substring
can be N
.
Therefore, the total time complexity is O(N^2 * L_max), where N is the length of s
and L_max
is the maximum length of a word in wordDict
.
Space Complexity: O(N) for the dp
array.
Variations and Further Considerations
The problem we solved focuses on whether a segmentation exists. What if you needed all possible segmentations?
Returning All Possible Segmentations
This variation changes the return type from a boolean to a list of strings (or a list of lists of strings).
The core idea remains Dynamic Programming or Memoized Recursion, but instead of storing True
/False
, dp[i]
would store a list of all possible word sequences that form s[0...i-1]
.
When building dp[i]
, if dp[j]
is valid and s[j...i-1]
is a word, you’d combine each sequence from dp[j]
with s[j...i-1]
to form new sequences for dp[i]
. This is often done using recursion with memoization, where the memo stores lists of strings.
Example: s = "catsanddog"
, wordDict = ["cat", "cats", "and", "sand", "dog"]
Output: ["cats and dog", "cat sand dog"]
Optimizing Dictionary Lookups with a Trie
If wordDict
is extremely large and contains many words with common prefixes (e.g., “apple”, “applet”, “application”), converting it to a set
offers average O(1) lookups, but in the worst case (hash collisions or very long words), it can still be proportional to word length.
For scenarios where L_max
is large or you have millions of words, a Trie (prefix tree) can be used to store wordDict
. Looking up a word in a Trie takes O(L_word)
time, but it can also quickly tell you if a prefix exists. This can potentially optimize the current_word in word_set
check, especially if you modify the DP inner loop to leverage prefix matching. For the standard DP solution, set
is generally sufficient and simpler.
Conclusion
The Word Break Problem is a fantastic illustration of how identifying overlapping subproblems can transform an inefficient exponential algorithm into a performant polynomial one.
- The Naive Recursive approach is intuitive but suffers from extreme redundancy, leading to exponential time complexity.
- Memoization (Top-Down DP) prunes the redundant computations by caching results, significantly improving performance to O(N^2 * L_max) time.
- Bottom-Up Dynamic Programming provides an iterative solution with the same optimal time and space complexity, often preferred for its clear state transitions and avoidance of recursion depth limits.
Understanding these approaches not only helps you ace technical interviews but also strengthens your grasp on fundamental algorithmic design patterns applicable to a wide range of complex problems. Always look for overlapping subproblems and optimal substructure – they are the hallmarks of Dynamic Programming!