Call Us

Home / Blog / Interview Questions / Data Science: File Types Using R and Python

Data Science: File Types Using R and Python

  • August 10, 2020
  • 4332
  • 37
Author Images

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Read More >

Also, check this Data Science Institute in Bangalore to start a career in Data Science.

Looking forward to becoming a Data Scientist? Check out the Data Science Course and get certified today.

Learn the core concepts of Data Science Course video on YouTube:

Become a Data Scientist with 360DigiTMG Data Science course in Hyderabad Get trained by the alumni from IIT, IIM, and ISB.

Earn yourself a promising career in data science by enrolling in the Data Science Classes in Pune offered by 360DigiTMG.

  • JSON File: JSON stands for Javascript Object Notation. It is a language-agnostic easily parsable, readable, writable, and generatable text data interchange format. It is pillared on objects such as arrays, vectors, sequences, lists, etc., and collection on name-value pairs similar to dictionaries in python. The majority of the latest programming languages support data in JSON format as it supports universally accepted data structures. Also, it can be easily typecasted to transform its data structure to a data frame for easy data manipulation tasks.

    R Code:

    JSON R Code

    Python Code:

    JSON Python Code
  • HTML File: HTML stands for Hypertext Markup Language. Its file extension is “.html”. HTML is used to create and manage the structure of web pages. It was developed in 1991 by a group of Engineers at CERN in Switzerland to easily manage and display web pages. It is a simple text format file that contains tags, tables, images, etc. that needs to be displayed on a webpage.

    R Code:

    HTML R Code HTML R Code

    Python Code:

    HTML Python Code
  • CSV File: CSV stands for Comma Separated Values. The data is stored in plain text format which is generally separated by commas or semicolons. It is easily articulated as a data frame in R and Python for data manipulation.

    R Code:

    CSV R Code CSV R Code

    Python Code:

    CSV Python Code CSV Python Code
  • ORC File: ORC stands for Optimized Row Columnar. It is used for performance enhancement and storage of Hive data. The data in ORC format is organized in rows called stripes. It also has file footers in addition to stripes that provide supplementary information. Each stripe is of default size 250MB.

    R Code:

    NA

    Python Code:

    ORC Python Code ORC Python Code
  • SPSS File: SPSS stands for Statistical Package for Social Sciences. SPSS was acquired by IBM and it is an IBM product now. Its file extension is ".sav". Any file in SAV format is stored in binary form which can be used only by SPSS. However, the SAV format files can also be used in R and Python as it gets converted to the requisite format.

    R Code:

    SPSS R Code SPSS R Code

    Python Code:

    SPSS Python Code SPSS Python Code
  • SAS File: SAS stands for Statistical Analysis System. It was developed by North Carolina State University finally in 1976 and post that SAS institute was incorporated and has managed SAS to date. It is used for analytics, data management, and Business Intelligence. Its file extension is given by ``.sas7bdat". The data in the SAS file is stored in rows and columns. It can easily be imported in R and Python and parsed as a data frame.

    R Code:

    SAS R Code SAS R Code

    Python Code:

    SAS Python Code SAS Python Code
  • Matlab File: Matlab is a programming tool designed by Mathworks. Generally, the tool is used by Engineers, Scientists, and Data Scientists. The tool can help in analyzing data, design algorithms, design models, and develop applications. Its file extension is ".mat". It is in binary container data format. Matrices and strings are supported in 4 MAT files. And multidimensional arrays, objects, strictures, etc are supported in 5 MAT files. These are internal levels with Matlab to store data. Matlab data files can easily be parsed in R and Python.

    R Code:

    Matlab R Code Matlab R Code

    Python Code:

    Matlab Python Code Matlab Python Code
  • Parquet File: To execute projects in a Hadoop environment, Parquet is used. It is an open-source file format. It has a similar flat columnar data storage as ORC which is very efficient. Its file extension is ``.parquet". It is extremely efficient in data encoding and compression. Also, it has been optimized to work with bulk data dealing with complexities. It can read the columnar data directly from large datasets without increasing the computational burden. Parquet datasets can be parsed to data frames in R and Python. Arrow package is now readily available in the CRAN repository and can be installed directly. Python uses the Pandas package just as any other file types to read the data files.

    R Code:

    Parquet R Code Parquet R Code

    Python Code:

    Parquet Python Code Parquet Python Code
  • Stata File: Stata is a statistical tool developed by StataCorp in 1985. Its file extension is ``.dta". Stata is used for research in the area of Social science, Bioscience, Medicine, Epidemiology, etc. Large data can be easily managed and stored using Stata. It is effective in performing data analytics and visualization. Just as rectangular excel or column separated values dataset, Stata has a 2-dimensional rectangular structure that is organized in rows and columns. The observations are arranged in rows and features are arranged in columns. Hence, it can be easily parsed as a data frame in R and Python environments.

    R Code:

    Stata R Code Stata R Code

    Python Code:

    Stata Python Code Stata Python Code
  • Weka File: The full form of Weka is Waikato Environment for Knowledge Analysis. It was written in Java. It is an open-source tool and can be used for data processing, analytics, machine learning, and visualization. Its file extension is ".arff". ARFF stands for Attribute Relation File Format. It is an ASCII text file. ARFF data files have header and data sections. The header section contains the title and attributes names. The data section contains instance lines across attributes delimited by commas. Any missing value is represented by a question mark. ARFF files are case sensitive. Interestingly, in Weka strings and nominal data are stored as numbers. Even Weka files are easily parsed in R and Python.

    R Code:

    Weka R Code Weka R Code

    Python Code:

    Weka Python Code Weka Python Code
  • YAML File:Full form for YAML is YAML Ain't Markup Language. Its extension is ".yaml". YAML files are user friendly and can be used easily with multiple programming languages. It is used to manage data. It has a markup language that distinguishes data-oriented language with document markup. It is able to match the data structures of other languages such as Python, Perl, Ruby, etc. YAML only allows the usage of space while creating the files and is case sensitive. Any line starting with a hash (#) is treated as a comment. For indentation, space is used as a tab is not permissible. Just as R and Python, the data structures have whitespace indentation denoting structures. Data within square brackets [ ] represent a list. The key-value pairs are created using curly brackets { } and colon (:).

    R Code:

    YAML R Code YAML R Code
  • PDF File: The full form of PDF is a Portable document format. Its extension is ".pdf". PDF was invented by ADOBE. It is an extremely useful format of the file to store data in the form of text, images, tables, etc. The PDF documents can be easily exchanged irrespective of the operating system. PDF files are primarily used for viewing. Also, it does an excellent job of preserving the format of the data in which the data was originally prepared.

    R Code:

    PDF R Code PDF R Code

    Python Code:

    PDF Python Code PDF Python Code
  • AVRO File: AVRO is a data serialization system. AVRO was developed by Doug Cutting. He was also instrumental in developing Hadoop. AVRO data formats can easily be articulated with many languages but that is not the case with Hadoop. So, AVRO is used to serialize data for Hadoop. It has a binary schema. The schema is inbuilt. Also, AVRO files can be easily split and compressed. Its extension is ".avro". The file can be easily imported to python and processed.

    R Code:

    AVRO R Code

    Python Code:

    AVRO Python Code AVRO Python Code
  • MP4 File: MP4 is a MPEG 4 video file format. The full form for MPEG is Motion Picture Experts Group. Its extension is ".mp4". It holds digital data in compressed format. All video players support the MP4 file format. Majorly it is used to store video and audio.

    R Code:

    MP4 R Code MP4 R Code

    Python Code:

    MP4 Python Code MP4 Python Code
  • XML File: The full form of XML is Extensible Markup Language. It is a flexible text format file that works independently of system being used. It is used to store and exchange a wide variety of data. Its file extension is “.xml”. It has markup tabs that help in explaining the meaning of the file data. It is widely used across many platforms.

    R Code:

    XML R Code

    Python Code:

    XML Python Code
  • PNG File: PNG stands for Portable Network Graphics. PNG was developed to do a better job over GIF (Graphics Interchange Format). PNG’s file format is raster graphics. It is used to compress the data. It is able to store greyscale images and 24bits colored images. Its file extension is “.png”.

    R Code:

    PNG R Code

    Python Code:

    PNG Python Code
  • JPEG File: The full form of JPEG is Joint Photographic Experts Group. Generally, JPEG’s are used for effortlessly sharing image files. Even after a lot of compressions the quality of the image is preserved. It is widely used on the internet, mobile phones, and computers. It is a very efficient data storage method as it requires minimum storage capacity. Its file extension is “.jpeg” and “.jpg”.

    R Code:

    JPEG R Code

    Python Code:

    JPEG Python Code
  • TIF File: TIF stands for Tagged Image Format. TIF preserves high-quality images. Adobe had acquired the format from Aldus Corporation and improved its manifold. It can contain compressed and uncompressed images. TIF files can be easily converted to PDF, GIF, JPEG, etc formats. Its file extension is “.tif”. TIF is capable of holding high colour depth images.

    R Code:

    TIF R Code

    Python Code:

    TIF Python Code
  • MP3 File: MP3 is used to store audio data. It is widely used to store, compress, and easily share the audio files. However, the compression is irreversible yet it gives very high-quality audio. Interestingly, the loss during compression is only to the extent that human ears cannot detect. File extension for MP3 is “.mp3”.

    R Code:

    MP3 R Code

    Python Code:

    MP3 Python Code
  • DIF File: DIF stands for Data Interchange Format. It stores text data in regular spreadsheet-style, however, it cannot handle multiple spreadsheets at once. Its file extension is “.dif”.

    R Code:

    DIF R Code

    Python Code:

    DIF Python Code
  • WAV File: The full form of WAV is Waveform Audio File Format. Its file extension is “.wav”. It was jointly developed by Microsoft and IBM. Before MP3 audio files were generally played in WAV format.

    R Code:

    R Code

    Python Code:

    Python Code
  • ZIP File: There is no full form for ZIP. Its file extension is “.zip”. It is used to compress files and data in binary file format. It is also used for archival purposes. It can compress many files at once and the ZIP file can also be de-compressed to get the original files stored in it. It is extremely handy for exchanging large size files.

    R Code:

    ZIP R Code

    Python Code:

    ZIP Python Code
  • RAR File: RAR full form is Roshal Archive. It has been named after its developer Eugene Roshal. Just as ZIP file, RAR file is also used to compress and archive multiple files. Its files extension is “.rar”. It is also extremely handy in exchanging large size files.

    Python Code:

    RAR Python Code
  • RSS File: RSS full form is Rich Site Summary. On all websites, the content is regularly updated. To share the updated content, websites generally allow to access the feeds through RSS. Users can readily extract the information for their need.

    R Code:

    RSS R Code

    Python Code:

    RSS Python Code
  • TXT File: TXT stands for Text file format. Its file extension is “.txt”. It stores data in plain text style with extremely limited formatting options. It stores the data in sequence. The sequences are stored as line. Like we have lines in a book.

    R Code:

    TXT R Code

    Python Code:

    TXT Python Code
  • ISO File: ISO stands for International Organization for Standardization. Its file extension is “.iso”. This is a file type that stores images or data from CDs, DVDs, etc. Specifically, ISO 9660 file type is defined for media stored in optical discs.

    R Code:

    NA

    Python Code:

    ISO File python Code
  • DBF File: DBF refers to the database. Its file extension is “.dbf”. It can store huge numbers of digital files that are properly indexed. The data stored in these files can be easily looked up, manipulated, juxtaposed, and cited. The components of the database are schema (it can store multiple tables), table (a 2-dimensional object with rows and columns), rows (to store observations), and columns (to store different data types such as numeric, character, etc.)

    R Code:

    DBF File R Code

    Python Code:

    DBF File python Code
  • Markdown File: Markdown is known as a non-heavy markup language. Markdown files are often referred to as developer files. It stores data in plain text format. It easily reads and writes, Markdown text files are generally converted to HTML files. However, it is not treated as a replacement for HTML files. The only goal of Markdown is readability. Its file extension is “.md”.

    R Code:

    markdown File R Code

    Python Code:

    Markdown File python Code
  • DLL File: DLL’s full form is a dynamic link library. It is a common library used to follow protocols to perform tasks. A lot of programs are able to use the collection of resources in this library. Like if we have to save a file to the system locally, the DLL provides resources to facilitate the steps internally to fulfill the action. Because of DLL developers are able to write programs easily. Its file extension is “.dll”

    R Code:

    DLL File python Code

    Python Code:

    DLL File python Code
  • RTF File: The full form of RTF is the Rich Text Format. Its file extension is “.rtf”. RTFs are a combination of plain text and rich text files. There are extremely limited formatting features in a text file, however, the rich text offers more formatting features as compared to a text file.

    R Code:

    RTF File R Code

    Python Code:

    RTF File python Code
  • BMP File: BMP file is a bitmap image file format. Its extension is “.bmp”. It does not require a graphics adapter to display images and can be in uncompressed or compressed format. Bitmaps can hold grey scale and coloured images in 2 dimensions.

    R Code:

    bmp File R Code

    Python Code:

    bmp File python Code
  • GeoTIFF File: GeoTiff is like a regular “.tif” image file format. It has spatial information as tags. These tags are call ed embedded tags. GeofTIFF files carry the following metadata information:
    • Image Resolution
    • Layers
    • Coordinate Information System
    • Area coverage
    • No Data Value

    R Code:

    geotiff File R Code

    Python Code:

    geotiff File python Code
  • HDF5 File: The full form of HDF5 is Hierarchical Data Format 5. Its file extension is “.hdf5”. HDF5 is used to store large amounts of data in a hierarchical structure and is open source.

    It is extremely handy in retrieving parts of data rather than the whole at once. It is extremely powerful in accessing and searching as it provides metaset along with the data. It supports heterogeneous and complex data.

    R Code:

    hdf5 File R Code

    Python Code:

    hdf5 File python Code
  • AIFF File: The full form of AIFF is Audio Interchange File Format. It is a file type to store audio data using electronic devices. It was developed by Apple and has been extensively used for audio purposes. It's an uncompressed file. Hence, the audio quality is very good but would generally take more space than the MP3 file.

    Python Code:

    aiff File python Code
  • MOV File: MOV is a file format that can contain timecode, audio, text tracks, and videos. It is a multimedia container. Its file extension is “.mov”. It was developed by Apple and is compatible with MS and Mac.

    Python Code:

    mov File python Code
  • TSV File: The full form of TSV is Tab Separated Values. Its file extension is “.tsv”. Just as CSV, TSV is a 2-dimensional file format used with a spreadsheet.

    R Code:

    tsv File R Code

    Python Code:

    tsv File python Code
  • SWF File: The full form of SWF is Small We Format file. It is also referred to as Shockwave. Its file extension is “.swf”. It is an Adobe flash file. It can contain movies and animations. It is used to deliver multimedia files over the web.

    Python Code:

    swf File python Code
  • PSD File: The full form of PSD is Photoshop Document. It is a layered image file. Its extension is “.psd”. Photoshop uses it as the default file format to preserve data. PSD files can be converted to any of the non-proprietary image file formats such as “.jpg”, “.tif”, etc. However, once the conversion is done on the original PSD file, then PSD format cannot be retrieved back.

    Python Code:

    psd File python Code
  • SVG File: SVG stands for Scalable Vector Graphics. Its extension is “.svg”. SVG is used to describe 2-dimensional graphics. Only 3 graphics formats can be used in SVG:
    • Text
    • Images
    • Vector graphic shapes (straight lines or curves)

    It is primarily used to present the information in a rich graphical format. It is an XML application and is HTML compatible. The graphical objects in SVG can be segmented, transformed, blended, and designed. The files can be rendered in various formats such as PDF, PNG, etc.

    R Code:

    svg File R Code

    Python Code:

    svg File python Code
Make an Enquiry