Summarizing Markdown files with Azure Cognitive Services TextAnalytics Extractive Summary

At Microsoft, documentation is created in Markdown, lives in GitHub repos, which you then fork and clone and do all the GitHub things and stuff. The markdown then gets transformed into HTML. (This is a really simplified explanation. The actual process is a lot more complicated.)

This script opens local Markdown files and summarizes them using Azure Cognitive Services TextAnalytics Extractive Summary. It’s available in version 5.3.0b2. It’s not available in the current verion, 5.2.1.

It doesn’t as yet write anything to the Markdown files or create a CSV file. All it does is return the summary to the terminal. I’m planning to use similar scripts to analyze what is in my content set so I can eventually use structured documentation methods to overhaul it. First, though I need to know what is there, and how it all maps to other content.

It does return an unclosed connection error, and fixing that it is on my TO DO list.

import asyncio
from dotenv import load_dotenv
# Get environment variables
load_dotenv()
import os
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics.aio import TextAnalyticsClient
endpoint = os.environ["AZURE_LANGUAGE_ENDPOINT"]
key = os.environ["AZURE_LANGUAGE_KEY"]
text_analytics_client = TextAnalyticsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(key),
)
async def get_summary(content,filename):
        print(f"Summarizing {filename}")
        #print(content)
        document = [content]
        poller = await text_analytics_client.begin_extract_summary(document)
        extract_summary_results = await poller.result()
        async for result in extract_summary_results:
            if result.kind == "ExtractiveSummarization":
                print("Summary extracted: \n{}".format(
                    " ".join([sentence.text for sentence in result.sentences]))
                )
            elif result.is_error is True:
                print("...Is an error with code '{}' and message '{}'".format(
                    result.error.code, result.error.message
                ))
async def sample_extractive_summarization_async():
    document = []
    direct = "C:/absolute/path/to/markdown/files/"
    for filename in os.listdir(direct):
        #Open the file
        file = open(direct + filename, "r")
        #Read the contents of the file
        content = file.read()
        await get_summary(content,filename)
async def main():
    await sample_extractive_summarization_async()
# TO DO: Fix unclosed connector error
if __name__ == '__main__':
    asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
    asyncio.run(main())

PS. For organizations that use an actual Component Content Management System (CCMS) to write and publish technical documentation and learning content, there may be a function in the system that automatically creates summaries of aggregated content. Paligo integrates with GitHub repos, although I haven’t seen it in action yet.

Use Azure OpenAI, Python and OpenAPI to create a PlantUML relationship diagram

This is not my first attempt at getting Azure OpenAI to tell me something about the relationship between OpenAPI entities, but it is the first time I have opened local files to get at the JSON. I have started with the smallest of the JSON files for Azure Media Services because I haven’t yet broken a larger file into acceptable token limit chunks. The paragraph I asked for didn’t turn out quite like I wanted it to…but that is for a different iteration of this experiment.

Python Code

import asyncio
import os
import json
import openai

from dotenv import load_dotenv
# Get environment variables
load_dotenv()
openai.api_type = "azure"
openai.api_base = os.getenv("OPENAI_API_BASE")
openai.api_version = "2022-12-01"
openai.api_key = os.getenv("OPENAI_API_KEY")

# Read the JSON files
async def readJSON():
     with open('ams_api/AccountFilters.json') as f:
          data = json.load(f)
     return data

async def main():
     data = await readJSON()
     print(data.keys())

response = openai.Completion.create(
engine="gpt-35-turbo-0301",
prompt=f"<|im_start|>system\nYou are a developer who wants to understand the relationship between entities in several OpenAPI files.\n<|im_end|>\n<|im_start|>user\nRead the JSON in {data}. First return a markdown table of the entities under 'definitions' with the entity name in the left column and the description in the right column. If the entity has properties, concatenate the entity name and property with a period delimiter. Put the concatenated name in the left column and the property description in the right column. Then return a paragraph that describes the parent/child relationships of the entities. Then return a PlantUML diagram that shows the relationships between entities. Make sure you include the properties of each entity.\n\n<|im_end|>\n<|im_start|>assistant",
temperature=1,
max_tokens=5500,
top_p=0.5,
frequency_penalty=0,
presence_penalty=0,
stop=None)

print(response['choices'][0]['text'])

if __name__ == '__main__':
     asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
     asyncio.run(main())

 

Markdown Table

| Entity Name | Description |
|————-|————-|
| PresentationTimeRange | The presentation time range, this is asset related and not recommended for Account Filter. |
| PresentationTimeRange.startTimestamp | The absolute start time boundary. |
| PresentationTimeRange.endTimestamp | The absolute end time boundary. |
| PresentationTimeRange.presentationWindowDuration | The relative to end sliding window.|
| PresentationTimeRange.liveBackoffDuration | The relative to end right edge.|
| PresentationTimeRange.timescale | The time scale of time stamps.|
| PresentationTimeRange.forceEndTimestamp | The indicator of forcing existing of end time stamp.|
| FilterTrackPropertyCondition | The class to specify one track property condition.|
| FilterTrackPropertyCondition.property | The track property type.|
| FilterTrackPropertyCondition.value | The track property value.|
| FilterTrackPropertyCondition.operation | The track property condition operation.|
| FirstQuality | Filter First Quality|
| FirstQuality.bitrate | The first quality bitrate.|
| FilterTrackSelection | Representing a list of FilterTrackPropertyConditions to select a track. The filters are combined using a logical AND operation. |
| FilterTrackSelection.trackSelections | The track selections.|
| MediaFilterProperties | The Media Filter properties.|
| MediaFilterProperties.presentationTimeRange | The presentation time range.|
| MediaFilterProperties.firstQuality | The first quality.|
| MediaFilterProperties.tracks | The tracks selection conditions. |
| AccountFilter | An Account Filter.|
| AccountFilter.properties | The Media Filter properties. |
| AccountFilter.systemData | The system metadata relating to this resource.|
| AccountFilterCollection | A collection of AccountFilter items.|
| AccountFilterCollection.value | A collection of AccountFilter items.|
| AccountFilterCollection.@odata.nextLink | A link to the next page of the collection (when the collection contains too many results to return in one response).|

Paragraph

Kinda meh…I think this will turn out better when I am using the Accounts.json file rather than the AccountFilters.json file. What I asked for was probably too literal.

“The entities have a parent-child relationship. The `AccountFilter` entity has a property called `properties` which is of type `MediaFilterProperties`. The `MediaFilterProperties` entity has properties called `presentationTimeRange`, `firstQuality`, and `tracks`. The `tracks` property is an array of `FilterTrackSelection` entities. The `FilterTrackSelection` entity has a property called `trackSelections` which is an array of `FilterTrackPropertyCondition` entities.”

PlantUML Diagram

PlantUML Syntax:</p>
<p>@startuml<br />
class AccountFilter {<br />
+ properties: MediaFilterProperties<br />
+ systemData: systemData<br />
}</p>
<p>class MediaFilterProperties {<br />
+ presentationTimeRange: PresentationTimeRange<br />
+ firstQuality: FirstQuality<br />
+ tracks: FilterTrackSelection[]<br />
}</p>
<p>class PresentationTimeRange {<br />
+ startTimestamp: integer<br />
+ endTimestamp: integer<br />
+ presentationWindowDuration: integer<br />
+ liveBackoffDuration: integer<br />
+ timescale: integer<br />
+ forceEndTimestamp: boolean<br />
}</p>
<p>class FirstQuality {<br />
+ bitrate: integer<br />
}</p>
<p>class FilterTrackSelection {<br />
+ trackSelections: FilterTrackPropertyCondition[]<br />
}</p>
<p>class FilterTrackPropertyCondition {<br />
+ property: string<br />
+ value: string<br />
+ operation: string<br />
}</p>
<p>AccountFilter –> MediaFilterProperties<br />
MediaFilterProperties –> PresentationTimeRange<br />
MediaFilterProperties –> FirstQuality<br />
MediaFilterProperties –> FilterTrackSelection<br />
FilterTrackSelection –> FilterTrackPropertyCondition<br />
@enduml</p>
<p>

Using ChatGPT to Create PlantUML Entity diagrams

More messing around with ChatGPT and automating documentation for APIs.  This time, instead of using a JSON file, I used a description of one of the entities in Azure Media Services.

I also installed the PlantUML plugin on this website. The rendered chart is below the code sample. It’s not exactly right, as an asset is really a storage container that extends the storage container with its own properties and methods.  It’s a start though.

Test 1

PROMPT

1. Determine the entities in the following text: An Asset is mapped to a blob container in the Azure Storage account and the files in the Asset are stored as block blobs in that container. Assets contain information about digital files stored in Azure Storage (including video, audio, images, thumbnail collections, text tracks, and closed caption files).

2. Describe the relationship of the entities in PlantUML.

RESULTS

@startuml
class Asset {
+info: String
}
class BlobContainer {
}
class AzureStorage {
}
class File {
}
class BlockBlob {
}
class Video {
}
class Audio {
}
class Image {
}
class ThumbnailCollection {
}
class TextTrack {
}
class ClosedCaptionFile {
}

Asset –> BlobContainer
BlobContainer –> AzureStorage
Asset –> File
File –> BlockBlob
Asset –> Video
Asset –> Audio
Asset –> Image
Asset –> ThumbnailCollection
Asset –> TextTrack
Asset –> ClosedCaptionFile

@enduml

PlantUML Syntax:<br />
@startuml<br />
class Asset {<br />
+info: String<br />
}<br />
class BlobContainer {<br />
}<br />
class AzureStorage {<br />
}<br />
class File {<br />
}<br />
class BlockBlob {<br />
}<br />
class Video {<br />
}<br />
class Audio {<br />
}<br />
class Image {<br />
}<br />
class ThumbnailCollection {<br />
}<br />
class TextTrack {<br />
}<br />
class ClosedCaptionFile {<br />
}</p>
<p>Asset –> BlobContainer<br />
BlobContainer –> AzureStorage<br />
Asset –> File<br />
File –> BlockBlob<br />
Asset –> Video<br />
Asset –> Audio<br />
Asset –> Image<br />
Asset –> ThumbnailCollection<br />
Asset –> TextTrack<br />
Asset –> ClosedCaptionFile</p>
<p>@enduml</p>
<p>

TEST 2

This time, I tried using a truncated version of the Asset JSON for Azure Media Services.

I got the following feedback from ChatGPT about the formatting of the JSON, which I appreciate.

“This diagram shows that “Asset” class has an attribute “AssetProperties” which is an object of type “AssetProperties”, which in turn has an attribute “properties” that is an object of “Properties”. And the “Properties” class has several attributes, including “assetId”, “created”, “lastModified”, “alternateId”, “description”, “container”, “storageAccountName”, and “storageEncryptionFormat”. It should be noted that the JSON provided has duplicate information and also the properties types are not well formatted to be used in PlantUML.”

PROMPT

Create a PlantUML class diagram using the following JSON:


{
"Asset" : {
"AssetProperties": {
"properties": {
"assetId": {
"type": "string",
"format": "uuid",
"description": "The Asset ID.",
"readOnly": true,
"x-nullable": false
},
"created": {
"type": "string",
"format": "date-time",
"description": "The creation date of the Asset.",
"readOnly": true,
"x-nullable": false
},
"lastModified": {
"type": "string",
"format": "date-time",
"description": "The last modified date of the Asset.",
"readOnly": true,
"x-nullable": false
},
"alternateId": {
"type": "string",
"description": "The alternate ID of the Asset."
},
"description": {
"type": "string",
"description": "The Asset description."
},
"container": {
"type": "string",
"description": "The name of the asset blob container."
},
"storageAccountName": {
"type": "string",
"description": "The name of the storage account."
},
"storageEncryptionFormat": {
"type": "string",
"enum": [
"None",
"MediaStorageClientEncryption"
],
"x-ms-enum": {
"name": "AssetStorageEncryptionFormat",
"values": [
{
"value": "None",
"description": "The Asset does not use client-side storage encryption (this is the only allowed value for new Assets)."
},
{
"value": "MediaStorageClientEncryption",
"description": "The Asset is encrypted with Media Services client-side encryption."
}
],
"modelAsString": true
},
"description": "The Asset encryption format. One of None or MediaStorageEncryption.",
"readOnly": true,
"x-nullable": false
}
},
"type": "object",
"description": "The Asset properties."
}
}
}

RESULTS

Here’s the resulting PlantUML:

@startuml
class Asset {
-AssetProperties : AssetProperties
}
class AssetProperties {
-properties : Properties
-type : object
-description : string
}
class Properties {
-assetId : string
-created : string
-lastModified : string
-alternateId : string
-description : string
-container : string
-storageAccountName : string
-storageEncryptionFormat : string
}
@enduml

Here’s the class diagram:

PlantUML Syntax:</p>
<p>@startuml<br />
class Asset {<br />
-AssetProperties : AssetProperties<br />
}<br />
class AssetProperties {<br />
-properties : Properties<br />
-type : object<br />
-description : string<br />
}<br />
class Properties {<br />
-assetId : string<br />
-created : string<br />
-lastModified : string<br />
-alternateId : string<br />
-description : string<br />
-container : string<br />
-storageAccountName : string<br />
-storageEncryptionFormat : string<br />
}<br />
@enduml</p>
<p>

First Pass at Testing ChatGPT for reading a JSON OpenAPI file

Noodling on how documentation could be automated for an API.  Below is the prompt I used with an excerpt from the Azure Media Services OpenAPI JSON for AMS accounts.

PROMPT

You are a technical content writer.

Your task is to document the Azure Media Services API.

Silently do the following:

1/ Read the JSON between the <OpenAPIJSON></OpenAPIJSON> tags below.

2/ Find the “description” for each entity.

3/ Associate each entity with the description value.

4/ Output your results as follows:

Output:

1/ Print “Azure Media Services Account Entities” at the top.

2/ Print a Markdown table of the entities using the entity name in the first column and the entity description value the second column.

Constraints: Do not change any of the wording in the text in such a way that the original meaning.

“parameters”: {
“SubscriptionId”: {
“name”: “subscriptionId”,
“in”: “path”,
“required”: true,
“type”: “string”,
“description”: “The unique identifier for a Microsoft Azure subscription.”
},
“ResourceGroupName”: {
“name”: “resourceGroupName”,
“in”: “path”,
“required”: true,
“type”: “string”,
“description”: “The name of the resource group within the Azure subscription.”,
“x-ms-parameter-location”: “method”
},
“ApiVersion”: {
“name”: “api-version”,
“in”: “query”,
“required”: true,
“type”: “string”,
“description”: “The version of the API to be used with the client request.”
}
}
</OpenAPI JSON>

RESULT