I have a python dictionary as given below:
ip = {
"doc1.pdf": {
"img1.png": ("FP", "text1"),
"img2.png": ("NP", "text2"),
"img3.png": ("FP", "text3"),
},
"doc2.pdf": {
"img1.png": ("FP", "text4"),
"img2.png": ("NP", "text5"),
"img3.png": ("NP", "text6"),
"img4.png": ("NP", "text7"),
"img5.png": ("Others", "text8"),
"img6.png": ("FP", "text9"),
"img7.png": ("NP", "text10"),
},
"doc3.pdf": {
"img1.png": ("Others", "text8"),
"img2.png": ("FP", "text9"),
"img3.png": ("Others", "text10"),
"img4.png": ("FP", "text11"),
},
"doc4.pdf": {
"img1.png": ("FP", "text12"),
"img2.png": ("Others", "text13"),
"img3.png": ("Others", "text14"),
"img4.png": ("Others", "text15"),
},
"doc5.pdf": {
"img1.png": ("FP", "text16"),
"img2.png": ("FP", "text17"),
"img3.png": ("NP", "text18"),
"img4.png": ("NP", "text19"),
},
}
Here the keyword FP means FirstPage, NP is NextPage and Others is OtherPage (which is not a part of the FP or NP). So FP and NP are sequential and hence FP will appear before NP. Now I want to segregate the sequential FP's NP's from other other sequential FP's and NP's.
I want to process the dictionary based on these rules:
- Remove all the elements that contain the keyword
Othersin the tuple present. - Next I want to combine those elements into one dictionary which are sequential i.e. consecutive
FP's andNP's. So if one or moreNP's appear after anFPthen theFPandNPshould be combined into one dictionary. - If there is a lone
FPwith noNPfollowing it, or if anFP(1) is followed by anotherFP(2) then the (1)FPneeds to be put in a separate dictionary.
Here is what the output would look like for the above input:
op = {
"doc1.pdf": [
{
"img1.png": ("FP", "text1"),
"img2.png": ("NP", "text2")
}
{
"img3.png": ("FP", "text3")
}
],
"doc2.pdf": [
{
"img1.png": ("FP", "text4"),
"img2.png": ("NP", "text5"),
"img3.png": ("NP", "text6"),
"img4.png": ("NP", "text7")
}
{
"img6.png": ("FP", "text9"),
"img7.png": ("NP", "text10")
}
],
"doc3.pdf": [
{
"img2.png": ("FP", "text9")
}
{
"img4.png": ("FP", "text11"),
}
],
"doc4.pdf": [
{
"img1.png": ("FP", "text12")
}
],
"doc5.pdf": [
{
"img1.png": ("FP", "text16")
}
{
"img2.png": ("FP", "text17"),
"img3.png": ("NP", "text18"),
"img4.png": ("NP", "text19")
}
]
}
So far I have tried this but it is not working:
def remove_others(ip_dict):
op_dict = {}
for doc, img_dict in ip_dict.items():
temp_list = []
current_group = []
for img, values in img_dict.items():
label, text = values
if label == "Others":
continue
if current_group and label == "NP" and current_group[-1][1][0] == "FP":
current_group.append((img, (label, text)))
else:
if current_group:
temp_list.append(dict(current_group))
current_group = [(img, (label, text))]
if current_group:
temp_list.append(dict(current_group))
op_dict[doc] = temp_list
return op_dict
Any help is appreciated!