Using regex to select elements of a string with python
Imagine you have a list of string elements, such as you might get after reading in a text file line by line that looks as follows:
data=['month::5-->', 'day::13-->']
From the above, it looks like a value is provided between some label e.g. month::
and an end string e.g. -->
. If for example you wanted to get at the month
part, you could access it as follows from which point you could start messing around with splitting etc:
list(filter(lambda k: 'month::' in k, data))
>>['month::5-->']
Say your data looks like the following though as the file contained some unexpected gaps:
data=['address::1, Humpty Dumpty Lane,', 'Somewhere -->', 'month::5-->', 'day::13-->']
Notice how the strings that were bordering the info are different list elements now?
data[0]
>>'address::1, Humpty Dumpty Lane,'
data[1]
>>'Somewhere -->'
If I want to get the address, I could try the filter
approach but it won’t get everything I need as really I need it to continue up to the -->
which is now in a different list element:
list(filter(lambda k: 'address::' in k, data))
>> ['address::1, Humpty Dumpty Lane,']
A more efficient and tidy way to deal with this is to use regular expressions through the re
package. First, concatenate everything into a string:
all_content=' '.join(data)
Now use the following:
import re
address_info=re.findall("address::(.*?)-->", all_content)
print(address_info) # the list
>>['1, Humpty Dumpty Lane, Somewhere']
print(address_info[0]) # the string
>>1, Humpty Dumpty Lane, Somewhere
This looks for the first instance of address::
, then with (.*?)
gets all text up to the first instance of -->
. If you just pass (.*)
it will be greedy and get all text up to the last occurrence of -->
which matters if you have a few of them!
For help with finding and testing regex commands, check out pythex. You can see it working with the example provided here.
Thanks John Stevenson for the help!